As we presented yesterday, one of the bigger issues with FileNet Migrations was clients that have leveraged the COLD format and how TSG has come up with our own adapter to help clients move away from FileNet COLD to a less proprietary PDF format. The discussion got us thinking about what other formats need special attention during legacy ECM migrations. This post will discuss some of the various formats and our lessons learned for migrations.
TIFF – Tagged Image File Format
For our legacy ECM clients, many imaging systems store documents in the image format of TIFF. TIFF became popular with the popularity of facsimile (fax) and can either be a TIFF-3 or TIFF-4 format. TIFF is a very popular format supporting scanning, faxing, word processing and optical character recognition, as well as other processes. We typically recommend converting TIFF images to the PDF Image format when migrating to modern ECM systems in order to support in-browser viewing of documents. Some specific lessons learned about TIFF to PDF conversion include:
- Image Libraries – There are lots of high performance libraries available for this process. While ImageMagick is a common image conversion manipulation tool, TSG typically recommends iText for TIFF to PDF conversion for speed and performance reasons. While many of the transformation tools (like Alfresco’s) can rendition many different formats, for our OpenMigrate product, we have chosen specific libraries in order to optimize migration throughput for high volume migrations.
- Single Page TIFF to Multi Page PDF – FileNet Image Services stores every page of a document as a separate TIFF file, adding bloat to the repository. This was helpful 25 years ago when these systems were originally implemented and network bandwidth was at a premium. It allowed documents to be served to the viewer page-by-page, rather than forcing a download of the entire document. Now that bandwidth is much less of a concern, we typically recommend combining the single page TIFF files into one multi-page PDF during migration.
- Proprietary TIFF – FileNet will have special image information in the header and footer of the TIFF documents when they’re stored on disk. OpenMigrate removes the proprietary headers and footers as part of the migration.
- Watermarks – When converting to PDF, there is an opportunity to add overlays on the documents. It’s also possible to “burn in” certain metadata into the document, such as created date, author, etc. OpenMigrate can add these overlays during migration or OpenOverlay can add them when viewing the documents/images in the new ECM system.
- Optical Character Recognition – The majority of OCR tools work with TIFF. If there is useful textual data contained within the document, the migration might be an opportunity to convert the TIFF file to not just PDF Image format, but to a PDF Text document that allows for full-text searching within many ECM systems.
Microsoft Office (Word, Excel and PowerPoint)
For the majority of legacy ECM migrations, TSG will recommend migrating Word documents in their native format. Sometimes clients will look to update the format, but this is rare. While the migration will typically be standard, many clients will use the introduction of the new ECM system to introduce new Word or other templates to address the “Create New Document” requirement.
One issue for large migrations is the renditions, typically PDF. Should the migration move the renditions from the legacy system or leverage the PDF renditioning capabilities of the new system to create new renditions of the Office documents? TSG typically recommends generating PDF renditions for viewing Office documents as it is the best way to quickly view a document without having to launch Office or give the user the impression that they can edit the document. The PDF format also has benefits in that annotations can be stored in separate XFDF files and merged into the PDF viewing or PDF document itself at a later time. TSG leverages this capability for our OpenAnnotate PDF annotations.
Deciding to migration renditions versus generate all new renditions should be decided based on requirements as well as timings. Migrating PDF renditions from the legacy system will require more migration effort but considerably less transformations in the new system. Transforming in the new system will result in consistency in the PDF renditions but could overwhelm the transformation servers during a large migration effort. TSG would recommend each client weigh their own requirements and consider many options.
Video and Audio File Formats
There are lots of proprietary formats in video. TSG typically recommends migrating all formats as native files but consider creating MP4 or MP3 renditions as part of the migration process for listening/watching in more open file formats in-browser without the need for downloading plugins or players.
Image File Formats
Similar to audio and video formats, there are a number of different formats for storing images; GIF, JPEG, PNG, and BMP just to name a few. TSG recommends migrating images from legacy systems in their native formats in order to preserve the original files in an unaltered state.
TSG also recommends generating renditions for image files if the native formats cannot be viewed in-browser without plugins. Image renditions can be generated by OpenMigrate as part of the migration from the legacy system, or as a separate process after the migration is complete.
Some user interfaces, like TSG’s OpenContent Management Suite, offer thumbnail views that provide an enhanced user experience in systems where large amounts of images are processed. Similar to other image renditions, thumbnails can also be generated using OpenMigrate as part of the migration from the legacy system, or as a side process.
COLD File Formats
COLD or “Computer Output to Laser Disc” are typically formats that are “printed” to documents directly from another system. See our previous post for addressing the specific FileNet COLD migration format. As each COLD format tends to be unique and require a specific viewer, TSG recommends all COLD formats be converted to PDF to provide support from a variety of different viewers and add annotation capabilities.
Annotations Formats
Along with legacy migrations comes the question of what to do with legacy viewers and annotations. Many legacy ECM systems leverage annotations tools that store annotations in proprietary formats. Just like when moving the files, TSG recommends converting these annotations to the XFDF standard format to support the PDF transformations described above. See our related posts about FileNet Annotations as well as Daeja Annotations.
PDF or PDF/A
PDF/A is an ISO-standardized version of the Portable Document Format (PDF) specialized for use in the archiving and long-term preservation of electronic documents. PDF/A differs from PDF by prohibiting features unsuitable for long-term archiving, such as font linking (as opposed to font embedding) and encryption. The ISO requirements for PDF/A file viewers include color management guidelines, support for embedded fonts, and a user interface for reading embedded annotations. Clients should consider leveraging PDF/A when migrating documents that are purely archival but should consider the impacts of a larger document format caused by the embedded fonts.
Summary
Migrating from legacy ECM systems should also involve migrating from legacy file formats to more modern and less proprietary formats. TSG has experience with a large number of formats and would recommend clients carefully address the upgrading of the formats along with their legacy ECM system migration.
[…] While you can find a lot of references here to migration best practices (One Step vs Two, File Formats Lessons, Migrating 11 Billion Documents) , we thought for this post we would be slightly more aggressive […]