• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Computer Generated Documents – What’s different about Capture 2.0 and Big Data?

You are here: Home / Content Capture / Computer Generated Documents – What’s different about Capture 2.0 and Big Data?

August 1, 2019

As part of our series on Capture 2.0, this quarter TSG is focused on improving our ability to capture documents that are “borne digital”, that is, while being in a paper format, were never printed out on paper.  While capturing documents from Word or other documents created by an end-user is fairly straightforward, computer batch processes are not so easy.  For this post, we will focus on batch computer processes that produce output, often in document format, and will describe the issues as well as our plans for our Capture 2.0 solutions.

Computer Output – What are some of the issues?

The need to preserve document output from computer systems typically has to do with record keeping.  To answer the question, “What did we send the customer 2 years ago”, most times legacy systems will not be able to reproduce the exact communication easily as data can change and it is hard to defend something that was just regenerated. 

While it might make sense to always save the data, regenerating the document can be complex and time consuming, particularly if a customer is initiating the request.  Often data on the document includes report logic and calculations that do not exist in the source system and cannot be easily reproduced (for example, percentages, averages, etc).  Modern ECM approaches rely on saving the document itself for customer or internal resources to quickly retrieve the record of the communication while capturing much of the data for big data query requirements.

Capturing Computer output has evolved over the years and, while the technology and formats have changed, Capture 2.0 solutions need to address all of the different ways computer output might be captured.  Some examples of computer output include:

  • Computer Output to pre-printed Forms – For volume printing of documents (think invoices, statements, direct mail, checks, etc.), often a printed template document would be created by the printer with common components such as logos and addresses already on the document.  The computer output would fill in the data on the document in a fairly complex printing environment to capture the document with both the template and computer output.  See how this is addressed in the FileNet COLD format from our previous post
  • Computer Output for Reports – Reports tend to not have pre-printed forms but are printed out in a stream batch process.  While the batch might have several reports, often the reports are all in one print job.  To capture the individual reports, the batch file needs to be parsed and individual reports identified and stored.
  • Computer Output to PDF – Many modern solutions are printing directly to PDF and storing or emailing the PDF to the user.  Again think statements or other financial transactions where customers can opt to save paper and distribute electronically.

A Capture 2.0 solution needs to address all of the above scenarios as many times current printing requirements are coming from legacy solutions that have no plans of being updated.

Capture 2.0 – Capturing the Computer Documents

TSG would recommend that the first step is looking at how to convert the documents into a common format, typically PDF or PDF/A.  In this manner, the documents represent a record in time that can stand alone regardless of the final storage system.  TSG has worked with a variety of output types to convert to PDF and parse documents.  Examples include:

  • Postscript – Postscript output can be captured at the printer and easily converted into PDF with libraries such as Ghostscript.
  • Text – streamed text can be converted to PDF in a variety of ways. In the past, TSG has used Apache FreeMarker templates to format the data for conversion to PDF.
  • AFP – documents in the AFP format can be converted to PDF using open source libraries or commercial tools.

By converting to PDF, the computer documents can be easily displayed and shared without the need for special programs required for proprietary formats.  If PDF/A is selected, the document is in a complete archive format that can be relied on for long term storage needs. PDF/A differs from regular PDF in that features unsuitable for long term archival (ex: font linking) are disabled.

Capture 2.0 – Capturing the Metadata

Once the document is created, the next step is extracting the metadata to correctly store and process the document.  There are two basic methods:

  • Smart File Naming and Tagged Metadata – With this method, the system producing the computer output either names the file with the metadata components or creates another file with tagged metadata (think XML) or embeds the metadata in the source file itself.  Capture 2.0 solution needs to be able to process the output to store the document correctly in the ECM system.
  • Data extraction from the Document itself – For most systems, the data needs to be extracted from the document itself as the system producing the report, statement or other output was not originally created with smart file naming or tagged metadata.  See our related post on how to capture metadata from documents, particularly computer generated documents.

Capture 2.0 – Big Data Storage Requirements

TSG would argue that a Capture 2.0 system has to address more than just document archive for computer generated documents.  In extracting data, customers are asking for more and more data from the documents.  We would forecast two unique scenarios.

  • Historical Document Capture – Clients would like to capture current and historical documents in a “know your customer” approach where all current and past documents are stored.  Getting information out of the documents for new and old documents will be a key requirement as often the data for the old documents is difficult to retrieve or not available.
  • Update Historical Document Capture – While the documents might be stored in a new document system, some key component or data element was not stored as metadata and therefore not retrievable.  The system needs to provide the ability to reprocess old documents while updating new documents to capture this new data.

See our related post as we are continuing to add capture approaches for both new and historical documents to capture data created as borne digital.

TSG would recommend more modern NoSQL ECM repositories for eventual repository as these Big Data approaches better fit the complex query needs for data analysis.  See our efforts for Hadoop and DynamoDB for some examples.

Summary

Capturing documents from computer systems can be complex given legacy systems different approaches.  As clients are looking for more Big Data extraction from documents, a Capture 2.0 solution should be able to address all of the different scenarios and be able to capture the metadata from both the files as well as the document itself and store in a big data accessible repository.

Let us know your thoughts below:

Filed Under: Content Capture, OpenContent Management Suite, OpenMigrate

Reader Interactions

Trackbacks

  1. OpenContent Management Suite – Fall 2019 3.3 Release — Technology Services Group says:
    December 19, 2019 at 9:36 am

    […] Computer-generated and born digital documents can be indexed more efficiently with Capture […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Capture 2.0 – Metadata Extraction with Machine Learning Upon Ingestion
  • Alfresco – Do More with Capture 2.0
  • Capture 2.0 – Disrupting Legacy Capture Solutions with Machine Learning
  • Capture 2.0 – Visualizing Metadata Capture Location
  • Capture 2.0 – Improving Metadata Extraction with Machine Learning
  • TECHNOLOGY SERVICES GROUP SUCCESSFULLY BENCHMARKS 11 BILLION DOCUMENT REPOSITORY WITH AMAZON WEB SERVICES – PRESS RELEASE
  • Amazon Textract for Full Text Search
  • Migrating to Alfresco – Reducing Risk, Stress and Cost with a Rolling Migration
  • Redaction for AWS, Alfresco, Documentum and Hadoop – Bulk Redaction upon Ingestion or Migration
  • Alfresco – Working with PDF and Renditions

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT