One of the major recommendations we make for our clients is to leverage PDF as much as possible as both source documents and renditions within Alfresco. This post will share some of our observations and lessons learned.
Alfresco – Why PDF?
In working with legacy ECM customers, many are used to older formats (like TIFF) or just storing native formats (like Word). In discussing why PDF, TSG will typically bring up the following reasons:
- Portable Document Format – PDF stands for “Portable Document Format”. PDF can easily be sent to other people outside the organization without being concerned the audience has the right application to view the document. With security, certain functions like altering or printing can be curtailed.
- Browser Support – As all of our ECM customers now leverage web browsers for viewing documents, PDF provides the benefit of allowing the document to view within the browser without any add-ons or launching the native application. (ex: Word). This viewing or previewing capability provides the ability quickly view a document faster than a native application (Word) could be launched to view a document. Lots of options from the browser itself, PDF.JS or our viewing/annotation/redacting product OpenAnnotate.
- Printing Support – PDF provides benefits over other content types that might need the native application to initiate a print. See our earlier article on printing with Alfresco.
- Overlay Support – PDF also provides the capability for overlays as we do with our OpenOverlay product.
- Combining and Manipulating – Having PDF renditions available allows pages to be added and deleted or combined to create supersets or subsets of documents. One of the major features of OpenContent Case is the Combine PDF for our insurance clients looking to send out organized packets of documents.
- Annotation Support – PDF has built in annotation support with the XFDF format. Annotations can be stored as different files within Alfresco (with different security) or burned into the PDF themselves. See examples with OpenAnnotate.
- Long Term Archival – PDFA is the ISO-standardized version of the Portable Document Format for use in the archiving and long-term preservation of electronic documents. Clients with records management requirements will typically use PDFA for long-term storage.
- Storage Costs – With the price of storage very cheap, it no longer makes sense to avoid storing a PDF rendition of a document just to avoid the storage costs.
Alfresco – How PDF?
There are multiple options to leverage PDF within Alfresco including:
- Scanning – This is the oldest way to turn paper into PDF. TSG typically recommends scanning and OCR to have the text available within the PDF for indexing in full-text search. PDF supports both a native, image and combination text/image.
- Native PDF Content – more and more content is originating as PDF. Whether that be from outside parties or capturing a print stream, rather than print and scan, many are going directly to text PDF.
- Native Content Transformed into PDF – This is when documents exist in Word, Excel or a variety of other formats. Alfresco provides transformation to PDF as part of the base product leveraging LibreOffice. Alfresco also provides and external transformation server for more difficult transformations for an added cost. TSG has also worked with Adlib for high-quality document transformation.
- Legacy Content Transformed into PDF – Within our OpenMigrate practice we typically work with a ton of Legacy TIFF file formats particularly with FileNet coming from the scanning approach. TIFF can transformed with open source libraries into PDF Image or OCRed to Image and Text.
Some of the unique components of Alfresco to keep in mind when working with PDF include:
- Synchronous Transformation – For both the internal and external transformation server, Alfresco will transform the document when it is being stored. Having to wait until the transformation is complete can slow down ingestion, particularly on mass migrations. TSG has built alternatives with OpenMigrate to transform asynchronously.
- Transformation Quirks – The native transformation server as well as external doesn’t always transform documents correctly and TSG recommends clients test their document types. TSG has found that both will sometimes struggle with certain MS Word documents, particularly legal agreements that have two columns, complex tables, or tricky fonts. Contact us if you need some example documents.
- Versioning – Out of the box, Alfresco versioning only allows for the latest version of a document to have a PDF rendition. TSG has built the Chain Versioning upgrade to allow for both current and previous versions in Alfresco to contain renditions. See the discussion around associations here for details as to why Alfresco cannot support separate renditions per version out of the box.
TSG – PDF Recommendations and Best Practices for PDF in Alfresco
In talking with our team, some recommendations for clients include:
- Transform Legacy Content during the Migration rather than after – For our large volume clients, leaving the transformation until the end of the migration can result in a backlog of transformations that affect viewing performance as well as perception. We would recommend transforming during the migration process whenever possible.
- Test different document types and formats – As mentioned above, checking how well documents transform before choosing a transformation solution.
- Consider PDF Overlays for print control – Overlaying the Date and User on a document are easy steps to make sure that printed documents are properly controlled.
- Implement TSG Chain Versioning – to allow versions to have PDF renditions.
- Transform to PDF and then Annotate rather than Annotate on Native Formats – some tools support viewing and annotating on the native format. While the approach might seem cleaner, the tools and viewing can be expensive and troublesome.
- Build interfaces that rely on PDF Viewing First – With PDF available, viewing and printing should focus on PDF with access to the native application added later.
Summary
PDF Renditions used correctly within an Alfresco implementation can simplify viewing, printing, annotating and manipulating documents. Users should be aware of the uniqueness of using PDF and implement Alfresco with those factors in mind.
Let us know your thoughts or best practices below: