Editor’s Note (5/31/219): This article was original posted on 2/12/19 when Amazon Textract was currently in a pre-release phase. All results have been completed using a non-production version of the service. Now that Textract has been formally released, look for results to be updated shortly.
As part of our R&D effort into Amazon Textract with Alfresco, TSG conducted some initial research on the quality of the OCR results of Textract on a sample set of images from a real-world TSG client. This blog post will share our findings and some of the different nuances between Textract and the OpenText Capture Recognition Engine, formerly known as RecoStar. We should note that currently, Textract is currently in beta preview from Amazon, while we used RecoStar version 7.5 for this comparison.
Invoice Sample Set
As an initial document set, we took 14 scanned invoices from a real-world client sample. These documents leaned on the more difficult side for OCR processing. A concept that is discussed in a few of the points below is a “High Confidence Type II Error”. This is an error when the tool thinks it has a high confidence that the field is correct when it is actually incorrect. This is important since the only way to remove these errors is with a manual review of the values in the document. Our thoughts:
- Textract had a much better overall OCR result. OpenText specifically struggled with watermarks and overlays.
- In most cases, Textract had a lower rate of misreading a field on a document with an average error rate of about 6.5% on fields within a document. OpenText averaged about 26% field error rate for the same sample set. OpenText struggled with logos or other types of watermarks. We thought Textract did a significantly better job of removing noise of the background image/logo when attempting to extract text.
- We also noted that Textract had a slightly lower chance of making a High Confidence Type II Error. We found 7.35% Type II errors with Textract vs. 10.75% with OpenText.
- However, when looking at the confidence levels on fields that were Type II errors, OpenText’s levels were between 50-80%. Textract’s levels on Type II errors was upwards of 95%. This seems to point to OpenText being having slightly more realistic confidence levels in cases where the OCR result was incorrect. It should be noted that this seemed to only happen in the case of bounding box determination in Textract, not in character classification.
- While both tools seem to have difficulty extracting white text on a black backdrop, Textract was able to correctly extract this text in more cases than OpenText.
- When parsing text of different font (for example cursive) Textract seems to perform better. This could help to explain why it parses handwritten notes and angled text more effectively.
- Textract seems to be more capable of filtering out “noise” in the document such as logos, gridlines, and watermarks. In several of the test cases, OpenText’s confidence levels and prediction success were affected by such noise.
- In several cases, the inclusion of non-alphanumeric characters in an otherwise alphanumeric string led to misreading of the string by OpenText. Textract had fewer misreads of text following this pattern.
- In many cases of error for Textract, the bounding area surrounding text would either begin too late or start too early, slicing part of the text off.
- Textract did a much better job not trying to read/parse logos and other header images.
Both struggled with handwriting, specifically notes jotted on the invoice (Handwriting was not in a structured format).
Example Images
The below shows one of the high-confidence bounding errors where Textract dropped the state abbreviation, SC, as part of the recognition of an address field. This type of error could be programmatically corrected as the zip code could be used to generate the state.
The below example shows a result from OpenText where white text on a black background was not detected at all by OpenText. Textract was able to recognize Quantity, Product Id and Description correctly.
The below example shows how Textract better recognized a cursive (non-handwriting) font. Not sure this would be relevant for most cases but was interesting.
The below example shows how background logos corrupted the recognition by OpenText and not Textract.
The below example shows how OpenText struggled with a combined alpha and numeric string example.
The below shows an example where Textract struggled with some bounding issues.
While not important for most scenarios, Textract did do an admirable job capturing handwriting.
Summary
From our review, Textract did a better job on our sample set when compared to OpenText for just about every scenario. Our biggest concern was with Textract’s confidence level as it did result in Type II errors, where Textract said it was highly confident of a result but that result turned out to be incorrect. We would recommend clients begin looking at Textract
when it is generally available as an alternative to OpenText/Recostar.
From a TSG roadmap side, look for future posts as we begin to incorporate Textract into our products from both a full-text search indexing alternative as well as part of our metadata indexing capabilities.
[…] updating our products to leverage Amazon Textract, officially released on May 29th, for it’s text extraction capabilities and how it can be used to create full text PDFs. For this post, we’re going to […]