As we begin to explore the topic of Capture 2.0 at TSG, the primary component that we would propose differentiates the legacy capture tools from the capture tools of the future is the inclusion of machine learning to the capture process. As discussed in our Capture 2.0 introductory post, the majority of legacy tools do not improve over time. This post will take a deeper dive into how Capture 2.0 tools will improve the capture process with machine learning so that extraction errors are corrected automatically over time and how this capability is shaping our product roadmap.
Templates and Metadata Extraction
In traditional capture tools, a template is required to train the system. As part of the training process, an administrative user sets up extraction rules. Typically these fall into one of two approaches:
- Location / Zonal Approach – using this approach, the administrator defines a zone on the document to denote where a piece of data resides. For example, the tool could be told to look in a given box in the top right corner of the header to pull the “Report Number” value. This approach only works well when the positional data is known and very consistent across all documents.
- Key/Value Pair Approach – using this this approach, instead of defining the zonal position of the data, the tool is told to look for a given key, for example: “Invoice Number”, and then the tool will look at surrounding text to pull the value – for example, preferring text to the left or underneath the key. This approach works well when the target data may be anywhere within the document, but runs into problems when the key text is inconsistent. Using our invoice example, some vendors may display Invoice Number as Invoice Num, Invoice Nbr, Invoice #, etc. Existing Capture tools have approaches for minimizing this problem, but it is still an issue for many clients.
As we begin to expand the OpenContent Management Suite’s Capture modules, we plan on supporting the above templating process and extraction methods as well. However, the key to Capture 2.0 is that the extraction model does not stop there.
Improving Capture Templates with Machine Learning
Capture 2.0 tools will improve upon the above metadata extraction techniques by incorporating machine learning into the process and improving extraction results on the fly. If a document matches a given template, but incorrect data is extracted from the document, the user’s act of correcting the mistake will feed into machine learning algorithms to improve metadata extraction accuracy for subsequent documents. Current capture tools require a manual administrative update to the template or an entirely new template. In reality, this means that templates aren’t updated for most corrected extraction mistakes leading to user frustration.
In the OpenContent Management Suite’s Capture solution, when a user notices an incorrect metadata extraction and corrects the data location, the correct error data will be fed back into the extraction engine. As users correct any extraction errors over time, machine learning algorithms will learn from these corrections to prevent future extraction errors. We can also use data points such as the extraction confidence percentage or the number of mistakes the extraction process encountered to feed back into the system. For example, if many changes are made and/or the extraction confidence is below a certain point, a new extraction template could be created on the fly rather than modifying an existing template.
Here’s an overview of how the process will work:
- Create and Train – Capture administrators will be able to create initial templates with extraction rules (ex: zonal, key/value pair, etc). These templates will be fed into the suggestion engine
- Bulk Ingestion – As documents enter the system, OpenMigrate can call the suggestion engine to classify documents and extract metadata.
- Store Completed Docs – After receiving the extracted data, if required fields are all filled with a high enough confidence level, the document is filed in the repository in the correct location.
- Queue Incomplete Docs – If all required fields cannot be completed with high enough confidence, the document is placed into the repository and queued for indexing in OCMS.
- Note that in either case above, the document is always ingested to the repository.
- Extract Metadata – During OCMS indexing, the suggestion engine can be called to return metadata suggestions for documents that have not yet been processed through the suggestion engine. This can happen, for example, for documents that were queued for indexing by a process other than OpenMigrate.
- Finalize Document – the user works through the queue of documents to index, verifying the metadata suggestions extracted from the document and saving the final metadata values.
- Extraction Error Corrections – during the previous step, the indexing module of OCMS keeps track of any error corrections that were made. For example, if the user dismisses one of the original suggestions and selects a different value on the document, that correction is fed back into the suggestion engine so that the next time a similar document is processed, the same mistake is not repeated.
One of the most important concepts within Capture 2.0 is that the capture process, including metadata extraction, should improve over time as the user’s utilize the system in live production environments. While templates set the stage for document classification and metadata extraction, machine learning improves the template over time, reducing user frustration with extraction errors. Let us know your thoughts below.