• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Amazon Textract for Full Text Search

You are here: Home / Amazon / Amazon Textract for Full Text Search

June 3, 2019

TSG had added Amazon Textract to our ECM offerings for Alfresco, Documentum, Hadoop and DynamoDB as well as our search offerings with Solr and Elastic Search.  Previously, we looked at Textract’s text extraction capabilities. For this post, we will discuss and demonstrate how Amazon Textract can be leveraged as a modern OCR indexing engine for image conversion to support full-text search efforts for both on-premise and cloud-based solutions.

Amazon Textract Background

Amazon Textract, recently released on May 29th, is a new exciting service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.  In TSG’s initial review with actual client scanned documents, Textract had some great figures when compared with a traditional OCR engine from OpenText.

Textract is built to extract data from image files.  In a later post, TSG will demonstrate how Textract can be used in a common indexing application.  TSG has also found that Textract, with some additions, can also be used as a better and highly scalable OCR engine for ongoing or backfile image conversion to support full-text search.

Amazon Textract for Full Text Search

The Textract API allows for submitting PNG image files or PDF to the service. Textract then responds with a JSON object that includes the data fields identified as well as information regarding placement of the data in the image and confidence levels of the results.

TSG recommends clients leverage the PDF Image format to embed the Textract OCR results behind the image.  See our related post on Redacting PDF – What did the Manafort Lawyers do wrong to better understand PDF with image and how text can be embedded in the PDF.  By embedding the text results in the PDF, the PDF can be ingested into most standard full-text search engines.

TSG Product Plans for Textract

TSG is currently building connectivity to our products for Textract for the following products and scenarios:

  • OpenContent Services – OpenContent will provide an asynchronous end point capable of calling Textract with an image document (or multiple image pages). Once Textract responds, the resulting OCR text will be combined with the PDF image in the repository.
  • OpenMigrate – OpenMigrate will provide the capability of calling the OpenContent end point for ingestion or migration of images and indexing for full-text search as well as indexing documents already contained in a repository.  OpenMigrate can be used for both on-premise as well as cloud based solutions.  We would anticipate adding the additional steps to call Textract as part of migration efforts from FileNet, Documentum, OpenText or any other platform as part of a conversion to better create content with intelligent data. OpenMigrate can also be used during any document ingestion process for bulk import of documents with full-text search.
  • OpenContent Management Suite (OCMS) – Will provide both searching for the intelligent image documents as well as updated indexing as part of our case offering.  OCMS will call the Textract endpoint in OpenContent to OCR the image if needed, as well as utilize Textract’s ability to intelligently identify relevant document data for either metadata extraction or suggested redactions. Look for a post shortly on updates to our indexing process. 

Let us know your thoughts below.

Filed Under: Amazon, OpenCapture, OpenContent Management Suite, OpenMigrate

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Redaction for AWS, Alfresco, Documentum and Hadoop – Bulk Redaction upon Ingestion or Migration
  • Alfresco Solutions of the Year 2017 – TSG wins Alfresco award for sixth year in a row
  • Alfresco 2013 Summit – Recap
  • Alfresco Consulting – Documentum Disruptor #2
  • Alfresco – Do More with Capture 2.0
  • Capture 2.0 – Metadata Extraction with Machine Learning Upon Ingestion
  • ECM 2.0 – Can you build it yourself?
  • AWS & Alfresco – AAIS Case Study – Insurance Policy Management and Machine Learning
  • Computer Generated Documents – What’s different about Capture 2.0 and Big Data?
  • DynamoDB – 11 Billion Document Benchmark White Paper

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT