TSG had added Amazon Textract to our ECM offerings for Alfresco, Documentum, Hadoop and DynamoDB as well as our search offerings with Solr and Elastic Search. Previously, we looked at Textract’s text extraction capabilities. For this post, we will discuss and demonstrate how Amazon Textract can be leveraged as a modern OCR indexing engine for image conversion to support full-text search efforts for both on-premise and cloud-based solutions.
Amazon Textract Background
Amazon Textract, recently released on May 29th, is a new exciting service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. In TSG’s initial review with actual client scanned documents, Textract had some great figures when compared with a traditional OCR engine from OpenText.
Textract is built to extract data from image files. In a later post, TSG will demonstrate how Textract can be used in a common indexing application. TSG has also found that Textract, with some additions, can also be used as a better and highly scalable OCR engine for ongoing or backfile image conversion to support full-text search.
Amazon Textract for Full Text Search
The Textract API allows for submitting PNG image files or PDF to the service. Textract then responds with a JSON object that includes the data fields identified as well as information regarding placement of the data in the image and confidence levels of the results.
TSG recommends clients leverage the PDF Image format to embed the Textract OCR results behind the image. See our related post on Redacting PDF – What did the Manafort Lawyers do wrong to better understand PDF with image and how text can be embedded in the PDF. By embedding the text results in the PDF, the PDF can be ingested into most standard full-text search engines.
TSG Product Plans for Textract
TSG is currently building connectivity to our products for Textract for the following products and scenarios:
- OpenContent Services – OpenContent will provide an asynchronous end point capable of calling Textract with an image document (or multiple image pages). Once Textract responds, the resulting OCR text will be combined with the PDF image in the repository.
- OpenMigrate – OpenMigrate will provide the capability of calling the OpenContent end point for ingestion or migration of images and indexing for full-text search as well as indexing documents already contained in a repository. OpenMigrate can be used for both on-premise as well as cloud based solutions. We would anticipate adding the additional steps to call Textract as part of migration efforts from FileNet, Documentum, OpenText or any other platform as part of a conversion to better create content with intelligent data. OpenMigrate can also be used during any document ingestion process for bulk import of documents with full-text search.
- OpenContent Management Suite (OCMS) – Will provide both searching for the intelligent image documents as well as updated indexing as part of our case offering. OCMS will call the Textract endpoint in OpenContent to OCR the image if needed, as well as utilize Textract’s ability to intelligently identify relevant document data for either metadata extraction or suggested redactions. Look for a post shortly on updates to our indexing process.
Let us know your thoughts below.