Content Capture 2.0 – The Upcoming Disruptions – Thoughts and Analysis

Similar to the disruption happening within the ECM marketplace, most analysts (and TSG) are predicting that the Capture component of ECM is ripe for disruption as well. The capture marketplace is very similar to the ECM marketplace with legacy products that haven’t evolved as new requirements and technologies are introduced. One of the strategic initiatives TSG is pursuing this quarter is to continue to update our capture capabilities for our own offerings. This post will be the first in a series of posts on our products and how they are evolving to meet the requirements of modern systems today.

Capture 1.0 – Born in the mailroom scanning paper

Most current capture vendors began based on efforts by customers around the electronic capture of paper. Like the ECM market, the capture marketplace grew up in the 80’s and 90’s as digital capture/scanning of paper documents. In company initiatives to go paperless, capture solutions evolved around scanning of documents and capturing values for meta-data based on automated OCR processing and manual indexing. Many of the different capture products still have this basic focus with additional logic for handwriting capture and other automation capabilities.

Focused on the automation of paper capture in the mailroom, capture 1.0 vendors focus on the below processing:

Scanning and Recognition – Components that are unique for mailroom scanning include scanning batches of documents, separator pages, bar-code reading, Optical Character Recognition as well as hand writing recognition.
Indexing – Screens for indexing of documents. Based for paper, results are around recognition of characters and fields and leveraging confidence levels for keying of data.
Bulk Ingestion into ECM and Data platforms – Growing up as a separate infrastructure, vendors would have specific adapters for where the content would flow to after it was indexed. For example, Captiva has adapters for Documentum and Application Extender along with many others before being acquired by Documentum.

Capture 2.0 solutions need to do all of the above and more. Like the ECM legacy vendors, Capture 2.0 solutions will embrace the affordability and accessibility of limitless computing power with technology like Machine Learning/Artificial Intelligence as well as cloud capabilities.

Rather than patching additional capabilities onto scanning solutions, new vendors will emerge that are built from the ground up with a disruptive technology and pricing model. Like the ECM disruption currently occurring, we would predict Capture 2.0 solutions won’t immediately replace Capture 1.0 solution but will nibble away at the documents that used to flow through paper and the mailroom.

Capture 2.0 – Making intelligent capture smarter with machine learning

Capture 1.0 vendors have branded Intelligent capture to describe their current method of extracting content from documents. These capture tools generally rely on two approaches to data capture:

Location Template Approach – a template defines where data is located in a given document. A zone is given to denote where a piece of data resides. For example, the tool could be told to look in a given box in the top right corner of the header to pull the “Report Number” value. This approach only works well when the positional data is known and very consistent across all documents. Templates need to be created for every type of captured document.
Key/Value pair Template Approach – A second approach is to provide a Key/Value pair template. In this approach, instead of defining the zonal position of the data, the tool is told to look for a given key, for example: “Invoice Number”, and then the tool will look at surrounding text to pull the value – for example, preferring text to the left or underneath the key. This approach works well when the target data may be anywhere within the document, but runs into problems when the Key text is inconsistent. Using our invoice example, some vendors may display Invoice Number as Invoice Num, Invoice Nbr, Invoice #, etc. Existing Capture tools have approaches for minimizing this problem, but it is still an issue for many clients.

Both approaches are typically augmented with additional processing to look up and verify sources against other systems (example PO number, account number….). This processing can include both configuration and customization depending on requirements.

Capture 2.0 will combine the above approaches while adding Machine Learning to address handling incorrect data extraction that is corrected by the user during indexing. For Capture 1.0 tools, an error that is manually corrected on one document will continue to be an error on the next, similar document unless the algorithm or template is changed. Capture 2.0 approaches will look to provide the infrastructure to gradually reduce the indexing effort for subsequent documents.

In typical capture scenarios, templates created on location or Key/Value pairs are needed for a number of reasons. Templates can be used to classify documents into certain types (ex: invoice vs. purchase order vs. billing report). The key for templates in the Capture 2.0 future will be in machine learning and evolving the extraction and identification on the fly. If a document matches a given template, but incorrect data is extracted from the document, the user’s act of correcting the mistake will feed into machine learning algorithms to improve metadata extraction accuracy for subsequent documents. Current capture tools require a manual administrative update to the template or an entirely new template. In reality, this means that templates aren’t updated for most corrected extraction mistakes leading to user frustration.

Focusing on modern technologies, Capture 2.0 tools will do more than just intelligently extract content, Capture 2.0 will take into account machine learning to allow the indexing components to learn over time to achieve better results. See our previous post on how TSG’s indexing can recognize and improve based on user input on real world data. Look for posts in the future on how this approach is evolving to address multiple indexing scenarios.

Capture 2.0 – Addressing digitally born content

Capture 2.0 has to address born digital content from external and internal sources. While some scanning and OCR will always exist for certain scenarios, more and more of the content being captured will be created digitally and data can be easily captured based on the values in the content itself. Rather than rely on a mail room function, born digital content will arrive at a company from many different and distributed sources. Capture 2.0 solutions need to provide for both batch digital ingestion from things like internal computer output as well as external vendors via EDI or email. Both large batch jobs as well as individual content needs to be able to be easily and consistently ingested with automated indexing as appropriate.

As it relates to the TSG product roadmap, look for upcoming posts on how our tools can capture both scanned images as well as a variety of born digital content including forms, computer output and externally generated source content.

Capture 2.0 – Repository Based Capture in bulk or individually

All of the current capture tools were created as standalone or point solutions where the processing of a batch of scanned documents were initially stored with the scanning solution and later exported to the final repository location. In this manner, capture vendors could more easily support multiple repositories with small integrations rather than have to rely on large integration efforts to all of the different repositories.

TSG solutions have always been repository based as they have evolved out of adding additional capabilities to our repository indexing tools. Benefits of this approach include:

Indexing – Having one indexing process for bulk or individual document ingestion. Logic for indexing captured content can be the same as general import of documents rather than requiring indexing logic in multiple places.
Infrastructure – Less infrastructure to procure and maintain.
Speed – Documents are immediately available and available to be processed in the repository rather than waiting for a batch process to run.
Business Process rather than point solution – By ingraining the solution in the repository, the business process, including workflow and ECM capabilities, can move from the point solution Capture 1.0 to the full business process of Capture 2.0.

Look for posts in the future on how the repository-based capture provides benefits to typical point solution standalone capture solutions.

Capture 2.0 – Cloud Friendly Solutions

As IT departments evolve within organizations, we are seeing more and more clients move away from on premise data centers to Infrastructure as a Service (IaaS) providers in the cloud such as Amazon AWS and Microsoft Azure. Legacy Capture solutions were primarily on-premise installations that fed into one or more on-premise document repositories. Capture 2.0 solutions can utilize a cloud first architecture that negates the need for an on premise installation. TSG’s repository-based capture approach outlined above allows for our tools to be easily deployed in both IaaS as well as on premise.

Cloud based tools have a couple of major benefits over on-premise architectures:

Scaling – typical ingestion processes still occur in bulk requiring large infrastructures for processing the batch components with those infrastructures sitting idle most of the time. Cloud based pricing models can be more flexible for addressing both surge and idle requirements.
Cloud options – Cloud processes also provide different business models where external parties index documents as part of an extranet. This is one piece that typical Capture 1.0 vendors struggle to understand and price.

Another benefit to cloud-based capture is that related cloud services can be easily integrated. Using AWS services as examples:

Textract – Provides for a modern, cloud-based OCR and form recognition. See our post on how Textract compares to OpenText on-premise solutions and how we are adding Textract capabilities to our current products.
Rekognition – for image and video analysis.
Comprehend – to analyze text for items like key phrases, sentiment analysis (positive or negative), topics, people and more.

Look for posts in the future on how cloud-based approaches can provide additional capabilities over typical on-premise solutions.

Capture 2.0 – Mining data and documents with Big Data

Looking beyond the machine learning aspect of Capture 2.0, organizations should also look to big data tools to allow for data mining and analysis of both document content as well as the capture process itself. During both the automated and manual indexing process, data can be fed into a Hadoop or DynamoDB instance or other big data solutions. This data can then be mined and analyzed to provide insights such as:

What fields are users correcting most often?
Which fields have the highest automated extraction success rate?
Are there areas of the capture process that are inefficient for the organization?
If so, are there tweaks to the process or ways to manipulate the data to improve efficiency?

Look for additional posts in the future on now Capture2.0 is feeding Big Data. Also, see our related whitepaper on A Big Data Approach to ECM.

Summary

Capture 2.0 represents the evolution (and disruption) of Capture 1.0, a point solution associated with capturing and digitizing paper in a typical mailroom function. As Capture 2.0 solution evolve, they will disruption Capture 1.0 vendors by:

Better addressing both scanned content as well as the bulk of digitally born content.
Leverage machine learning over typical template approaches to improve capabilities over time.
Leverage the repository to move from a point solution to more of a full business process.
Cloud Native or friendly solutions to move from on-premise only solutions.
Better address data mining and other big data requirements.

TSG is excited to explore these areas with our clients and incorporate Capture 2.0 features into our OpenContent Management Suite.

“Capture 2.0 in leveraging the power of AI and Machine Learning will require a complete rethink, often a rebuild from the ground up,” said Alan Pelz-Sharpe long time ECM analyst at Deep Analysis. “TSG’s new roadmap for Capture 2.0 appears to be building from the ground up focused on AI & Machine Learning and takes in the lessons learned from decades of deploying Capture solutions.”

Are there other areas you believe will be addressed in the next generation of capture tools? Let us know your thoughts below.

Trackbacks

Capture 2.0 – Improving Metadata Extraction with Machine Learning says:

July 31, 2019 at 8:00 am

[…] of the future is the inclusion of machine learning to the capture process. As discussed in our Capture 2.0 introductory post, the majority of legacy tools do not improve over time. This post will take a deeper dive into how […]
Computer Generated Documents – What’s different about Capture 2.0 and Big Data? says:

August 1, 2019 at 8:00 am

[…] part of our series on Capture 2.0, this quarter TSG is focused on improving our ability to capture documents that are “borne […]
Claim Document Efficiency – How to improve customer experience and satisfaction. says:

September 17, 2019 at 10:54 am

[…] New tools coming available to leverage machine learning. Check out our Capture 2.0 roadmap. […]
Gartner Content Services Platform (CPS) Magic Quadrant 2019 – Where is the Vision? says:

November 5, 2019 at 3:31 pm

[…] state CPS will leverage machine learning to increase productivity. We have started to look at Capture 2.0 focus on how machine learning can improve the capture process. Visionary clients will start […]