Recently we’ve been updating our products to leverage Amazon Textract, officially released on May 29th, for it’s text extraction capabilities and how it can be used to create full text PDFs. For this post, we’re going to look into how we can use the indexing modules within the OpenContent Management Suite (OCMS) to improve document indexing and also explore some possible future improvements that Textract an other tools could enable.
Indexing Scenario
The scenario we’ve been using for our Textract research has been an invoice scenario for Accounts Payable. The idea is that the organization has many invoices coming into the system from a variety of vendors. Some of these invoices may be clean, full-text PDFs. However, others may be image-only PDFs that can be OCR’d by Textract and turned into full-text PDFs.
As documents are ingested into the system, they are added to a queue for indexing into the system. Users grab a document from the queue, and it is displayed in our indexing module of OCMS:
The primary idea here is that the user can clearly see the document while also filling out the configured metadata required for indexing the document into the system. Previously, OpenAnnotate’s text select mode allowed users to place the cursor in the desired field, for example “Invoice Number”. Then, the user could highlight the invoice number on the image, and the text would automatically be placed into the Invoice Number field.
Our latest iteration goes a step further. Along with highlighting text, which is still an option, the user can double click to select text and pull the value over to the indexing field. Check out the video below to see the indexer in action:
Potential Future Improvements
Amazon Textract was only recently released, we can see a number of possible future enhancements as Textract evolves. Some of these ideas include:
- Utilize Textract’s text detection capabilities to suggest metadata values to the user. For example, if Textract can identify the Invoice Number value based on the document analysis, automatically put this value into the metadata field for Invoice Number. Utilizing this approach, the user only needs to verify the value rather than typing or clicking on the document.
- Integration with Machine Learning tools to better predict and learn where certain text is on the page. For example, based on a documents unique fingerprint, learn where certain fields are on the page based on prior invoices. This could also be used to improve upon the suggestion feature mentioned in the previous bullet point. For example, if a certain value is suggested, but the user corrects the value by grabbing text from somewhere else on the document, feed that data into the Machine Learning engine so that future suggestions are improved.
- Depending on how accurate suggestions can get as Textract and Machine Learning tools improve, we could start automatically filling out metadata fields rather than simply suggesting. This would turn the indexing process more into a confirmation vs. assignment.
Overall, we’re excited about the possibilities as Textract and Machine Learning tools mature over time. Let us know your thoughts below.