• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Capture 2.0 – Improving Metadata Extraction with Machine Learning

You are here: Home / Content Capture / Capture 2.0 – Improving Metadata Extraction with Machine Learning

July 31, 2019

As we begin to explore the topic of Capture 2.0 at TSG, the primary component that we would propose differentiates the legacy capture tools from the capture tools of the future is the inclusion of machine learning to the capture process. As discussed in our Capture 2.0 introductory post, the majority of legacy tools do not improve over time. This post will take a deeper dive into how Capture 2.0 tools will improve the capture process with machine learning so that extraction errors are corrected automatically over time and how this capability is shaping our product roadmap.

Templates and Metadata Extraction

In traditional capture tools, a template is required to train the system. As part of the training process, an administrative user sets up extraction rules. Typically these fall into one of two approaches:

  • Location / Zonal Approach – using this approach, the administrator defines a zone on the document to denote where a piece of data resides.  For example, the tool could be told to look in a given box in the top right corner of the header to pull the “Report Number” value.  This approach only works well when the positional data is known and very consistent across all documents.
  • Key/Value Pair Approach – using this this approach, instead of defining the zonal position of the data, the tool is told to look for a given key, for example: “Invoice Number”, and then the tool will look at surrounding text to pull the value – for example, preferring text to the left or underneath the key.  This approach works well when the target data may be anywhere within the document, but runs into problems when the key text is inconsistent.  Using our invoice example, some vendors may display Invoice Number as Invoice Num, Invoice Nbr, Invoice #, etc.  Existing Capture tools have approaches for minimizing this problem, but it is still an issue for many clients.

As we begin to expand the OpenContent Management Suite’s Capture modules, we plan on supporting the above templating process and extraction methods as well. However, the key to Capture 2.0 is that the extraction model does not stop there.

Improving Capture Templates with Machine Learning

Capture 2.0 tools will improve upon the above metadata extraction techniques by incorporating machine learning into the process and improving extraction results on the fly. If a document matches a given template, but incorrect data is extracted from the document, the user’s act of correcting the mistake will feed into machine learning algorithms to improve metadata extraction accuracy for subsequent documents.  Current capture tools require a manual administrative update to the template or an entirely new template.  In reality, this means that templates aren’t updated for most corrected extraction mistakes leading to user frustration.

In the OpenContent Management Suite’s Capture solution, when a user notices an incorrect metadata extraction and corrects the data location, the correct error data will be fed back into the extraction engine. As users correct any extraction errors over time, machine learning algorithms will learn from these corrections to prevent future extraction errors. We can also use data points such as the extraction confidence percentage or the number of mistakes the extraction process encountered to feed back into the system. For example, if many changes are made and/or the extraction confidence is below a certain point, a new extraction template could be created on the fly rather than modifying an existing template.

Here’s an overview of how the process will work:

  1. Create and Train – Capture administrators will be able to create initial templates with extraction rules (ex: zonal, key/value pair, etc). These templates will be fed into the suggestion engine
  2. Bulk Ingestion – As documents enter the system, OpenMigrate can call the suggestion engine to classify documents and extract metadata.
  3. Store Completed Docs – After receiving the extracted data, if required fields are all filled with a high enough confidence level, the document is filed in the repository in the correct location.
  4. Queue Incomplete Docs – If all required fields cannot be completed with high enough confidence, the document is placed into the repository and queued for indexing in OCMS.
    1. Note that in either case above, the document is always ingested to the repository.
  5. Extract Metadata – During OCMS indexing, the suggestion engine can be called to return metadata suggestions for documents that have not yet been processed through the suggestion engine. This can happen, for example, for documents that were queued for indexing by a process other than OpenMigrate.
  6. Finalize Document – the user works through the queue of documents to index, verifying the metadata suggestions extracted from the document and saving the final metadata values.
  7. Extraction Error Corrections – during the previous step, the indexing module of OCMS keeps track of any error corrections that were made. For example, if the user dismisses one of the original suggestions and selects a different value on the document, that correction is fed back into the suggestion engine so that the next time a similar document is processed, the same mistake is not repeated.

Summary

One of the most important concepts within Capture 2.0 is that the capture process, including metadata extraction, should improve over time as the user’s utilize the system in live production environments. While templates set the stage for document classification and metadata extraction, machine learning improves the template over time, reducing user frustration with extraction errors. Let us know your thoughts below.

Filed Under: Content Capture, OpenContent Management Suite

Reader Interactions

Trackbacks

  1. Capture 2.0 – Visualizing Metadata Capture Location says:
    August 28, 2019 at 8:20 am

    […] in the Capture 2.0 series, we discussed how modern capture solutions would improve the metadata extraction process when processing documents. Capture 2.0 solutions will expand on existing zonal as well as key/value […]

    Reply
  2. Alfresco - Do More with Capture 2.0 — Technology Services Group says:
    April 17, 2020 at 9:35 am

    […] Improving Metadata Extraction with Machine Learning […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Alfresco – Do More with Capture 2.0
  • Capture 2.0 – Metadata Extraction with Machine Learning Upon Ingestion
  • Capture 2.0 – Disrupting Legacy Capture Solutions with Machine Learning
  • Capture 2.0 – Visualizing Metadata Capture Location
  • Computer Generated Documents – What’s different about Capture 2.0 and Big Data?
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
  • Capture 2.0 – Document Classification with Machine Learning
  • Alfresco Content Accelerator – Enhanced Notifications
  • Content Accelerator Labs – Document View Timeline for Efficient Content Navigation

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT