As part of our series on Capture 2.0, this quarter TSG is focused on improving our ability to capture documents that are “borne digital”, that is, while being in a paper format, were never printed out on paper. While capturing documents from Word or other documents created by an end-user is fairly straightforward, computer batch processes are not so easy. For this post, we will focus on batch computer processes that produce output, often in document format, and will describe the issues as well as our plans for our Capture 2.0 solutions.
Computer Output – What are some of the issues?
The need to preserve document output from computer systems typically has to do with record keeping. To answer the question, “What did we send the customer 2 years ago”, most times legacy systems will not be able to reproduce the exact communication easily as data can change and it is hard to defend something that was just regenerated.
While it might make sense to always save the data, regenerating the document can be complex and time consuming, particularly if a customer is initiating the request. Often data on the document includes report logic and calculations that do not exist in the source system and cannot be easily reproduced (for example, percentages, averages, etc). Modern ECM approaches rely on saving the document itself for customer or internal resources to quickly retrieve the record of the communication while capturing much of the data for big data query requirements.
Capturing Computer output has evolved over the years and, while the technology and formats have changed, Capture 2.0 solutions need to address all of the different ways computer output might be captured. Some examples of computer output include:
- Computer Output to pre-printed Forms – For volume printing of documents (think invoices, statements, direct mail, checks, etc.), often a printed template document would be created by the printer with common components such as logos and addresses already on the document. The computer output would fill in the data on the document in a fairly complex printing environment to capture the document with both the template and computer output. See how this is addressed in the FileNet COLD format from our previous post
- Computer Output for Reports – Reports tend to not have pre-printed forms but are printed out in a stream batch process. While the batch might have several reports, often the reports are all in one print job. To capture the individual reports, the batch file needs to be parsed and individual reports identified and stored.
- Computer Output to PDF – Many modern solutions are printing directly to PDF and storing or emailing the PDF to the user. Again think statements or other financial transactions where customers can opt to save paper and distribute electronically.
A Capture 2.0 solution needs to address all of the above scenarios as many times current printing requirements are coming from legacy solutions that have no plans of being updated.
Capture 2.0 – Capturing the Computer Documents
TSG would recommend that the first step is looking at how to convert the documents into a common format, typically PDF or PDF/A. In this manner, the documents represent a record in time that can stand alone regardless of the final storage system. TSG has worked with a variety of output types to convert to PDF and parse documents. Examples include:
- Postscript – Postscript output can be captured at the printer and easily converted into PDF with libraries such as Ghostscript.
- Text – streamed text can be converted to PDF in a variety of ways. In the past, TSG has used Apache FreeMarker templates to format the data for conversion to PDF.
- AFP – documents in the AFP format can be converted to PDF using open source libraries or commercial tools.
By converting to PDF, the computer documents can be easily displayed and shared without the need for special programs required for proprietary formats. If PDF/A is selected, the document is in a complete archive format that can be relied on for long term storage needs. PDF/A differs from regular PDF in that features unsuitable for long term archival (ex: font linking) are disabled.
Capture 2.0 – Capturing the Metadata
Once the document is created, the next step is extracting the metadata to correctly store and process the document. There are two basic methods:
- Smart File Naming and Tagged Metadata – With this method, the system producing the computer output either names the file with the metadata components or creates another file with tagged metadata (think XML) or embeds the metadata in the source file itself. Capture 2.0 solution needs to be able to process the output to store the document correctly in the ECM system.
- Data extraction from the Document itself – For most systems, the data needs to be extracted from the document itself as the system producing the report, statement or other output was not originally created with smart file naming or tagged metadata. See our related post on how to capture metadata from documents, particularly computer generated documents.
Capture 2.0 – Big Data Storage Requirements
TSG would argue that a Capture 2.0 system has to address more than just document archive for computer generated documents. In extracting data, customers are asking for more and more data from the documents. We would forecast two unique scenarios.
- Historical Document Capture – Clients would like to capture current and historical documents in a “know your customer” approach where all current and past documents are stored. Getting information out of the documents for new and old documents will be a key requirement as often the data for the old documents is difficult to retrieve or not available.
- Update Historical Document Capture – While the documents might be stored in a new document system, some key component or data element was not stored as metadata and therefore not retrievable. The system needs to provide the ability to reprocess old documents while updating new documents to capture this new data.
See our related post as we are continuing to add capture approaches for both new and historical documents to capture data created as borne digital.
TSG would recommend more modern NoSQL ECM repositories for eventual repository as these Big Data approaches better fit the complex query needs for data analysis. See our efforts for Hadoop and DynamoDB for some examples.
Summary
Capturing documents from computer systems can be complex given legacy systems different approaches. As clients are looking for more Big Data extraction from documents, a Capture 2.0 solution should be able to address all of the different scenarios and be able to capture the metadata from both the files as well as the document itself and store in a big data accessible repository.
Let us know your thoughts below:
[…] Computer-generated and born digital documents can be indexed more efficiently with Capture […]