• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

ECM 2.0 – Machine Learning and Indexing

You are here: Home / ECM Solutions / ECM 2.0 – Machine Learning and Indexing

March 20, 2019

One of the more interesting usages for machine learning is the potential to speed up and add efficiency to the indexing of documents.  At TSG, we are currently adding this capability to our document indexing application.  This post will describe the current methods of indexing from the major vendors and how an ECM 2.0 solution will add machine learning.

ECM 1.0 Indexing – TIFFs, Templates, and Inference

Initially most of the indexing solutions were built back in the 1990’s as part of a scanning/image management solution.  We would group Captiva/Input Accel, Kofax and DataCap as these types of solutions.  Documents would be scanned and OCRed, templates would be created for each unique document type and tuned for capturing the OCRed data from that document.  This process worked well for automating a mail room where lots of paper arrived and needed to be scanned and data captured.

For some applications, utilizing a template strategy was difficult in that it wasn’t always practical for certain documents that originated from many third parties.  A perfect example are invoices – each unique company sending an invoice might require the paying company to create a different template for the unique company.  With hundreds or thousands of vendors, this solution isn’t realistic.

A more modern approach came from companies looking to avoid using templates. Inference approaches would OCR the entire document and look for data points within the text data.  For example, inferring that the text after “Invoice No” or “Invoice #” was the invoice number.  With an inference approach, templates do not need to be created. However, the struggle with both a template approach and inference approach is that the system, once it makes a mistake, cannot learn from the correction a user will make.  It can be incredibly frustrating for a user to correct a mistake only to see the same mistake need to get corrected the next time a similar document arrives.

ECM 2.0 – Adding Machine Learning and Creating Templates on the Fly

With ECM 2.0, we are starting to see both better OCR with components like Amazon Textract, as well as the ability to add Machine Learning to feed corrections into the process to improve the results over time. To quote Alan Pelz-Sharpe, long-time ECM analyst at Deep Analysis:

“First generation capture products will give way to Machine Learning approaches over the next few years, bringing flexibility and organic adaptation to formerly fixed capture situations”

To illustrate how this works, consider an Accounts Payable scenario where if the system can recognize an invoice’s vendor, it can predict where key metadata fields are located. Rather than utilizing static templates, we analyze document content to create a “fingerprint”. These fingerprints allow the system to match Machine Learning features, and predict where key metadata fields are on documents that share similar fingerprints.

For the OpenContent Indexing approach, the process looks something like the below:

  • A new document arrives in the system either as a PDF sent electronically or as paper that is scanned and converted with OCR results into a PDF with text.
  • Upon ingestion, the unique fingerprint of the document is created by extracting text and text location from the document. This initial fingerprint is added to a library, establishing a base feature set for future documents.
  • During indexing, the user selects text from the document for metadata fields leveraging an easy point and click interface.
  • The data locations from the document are saved back to the fingerprint library.
  • The next time a document’s fingerprint matches a feature set in the library, the locations of the data are used to automatically pull in the data from the document for review and verification.
  • The indexing user reviews the data and makes any corrections.  If any corrections are made, the locations are fed back into the fingerprint library as required – allowing the process to improve over time.

The approach above eliminates the need to create templates in preparation for documents arriving as well as provides a feedback loop to improve the process over time.  Below is a quick demo:

TSG Labs – Machine Learning from Existing Content

While the process above works well for a “green field” or new system, it does require the indexers to initially build up the fingerprint library for each unique document type.  TSG is currently working with the idea of creating the library based on existing content that has already been indexed.  In this scenario the process would look something like the below:

  • Documents and index data are extracted from the repository.
  • These Documents are analyzed to build up the fingerprint library by extracting text and text location. This creates the Machine Learning features necessary to match future documents against the library.
  • The process also searches through the document to look for the location of key metadata fields.  If found, the locations from the document are saved back to the fingerprint library.

After the library is built and Machine Learning features can be matched across documents, the indexing process follows exactly as in the previous section above:

  • During indexing, if document’s fingerprint matches a feature set in the library, the locations of the data are used to automatically pull in the data from the document for review and verification.
  • The indexing user reviews the data and makes any corrections.  If any corrections are made, the locations are fed back into the fingerprint library as required – allowing the process to improve over time.

In both approaches above, the document is always presented to a reviewer to review the data extraction for accuracy.  We would anticipate that, based on accuracy, the approach would evolve to skip documents that can auto-populate all required metadata fields and satisfy certain confidence levels.

Summary

Machine Learning will provide a new approach to allow feedback to correct and improve the document indexing process.  Innovative solutions providers and clients will begin to add these capabilities to their products.  As TSG’s product and experience evolves, we will be posting other ideas of how machine learning can be leveraged to improve indexing and other capabilities. 

If you have other thoughts, please add below.

Filed Under: ECM Solutions, OpenContent Management Suite, TSG Labs

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Alfresco – Do More with Capture 2.0
  • Print to Repository – OpenContent Print Driver Support
  • Machine Learning & ECM -Smarter Policy Management with Kira & OCMS
  • AWS with DynamoDB for Content Management – Reference Architecture & Cost Estimate
  • Reference Architecture for Content Management on Azure HDInsight with HBase
  • Introducing Smart Communications CCM Capabilities with OpenContent Case
  • Alfresco & TSG Webinar September 26th – Managing Claims Content Chaos
  • Claim Document Efficiency – How to improve customer experience and satisfaction.
  • AWS & Alfresco – AAIS Case Study – Insurance Policy Management and Machine Learning
  • DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2022 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT