As we have discussed in our Hadoop Series, more and more companies are considering Hadoop for storage and management of documents and files. Just like our ECM clients, companies storing documents or scanned files in Hadoop want to provide PDF renditions of documents for easy viewing and other PDF capabilities. This post will discuss how Adlib can be leveraged with Solr/Lucene behind TSG’s OpenContent layer to provide robust ECM capabilities for your Hadoop repository.
Hadoop – Storing Documents and Image Files
When using Hadoop to store documents, it is important to consider the usage patterns of the document. It is very easy to just store the native Word/Excel document or Image scan (typically TIFF) into a Hadoop repository and retrieve the document based on the Hadoop row key when it is needed. Some potential issues with this approach are:
- Users need to have Word/Excel or a TIFF Image Viewer installed on their PC in order to view the document.
- Mobile/tablet users don’t always have access to the applications required to view the document
- Hadoop doesn’t have a very robust way to search for documents on anything except the unique ID of the document
In an ECM system, typical usage patterns we see are that 70% of users only require view and print for most documents. PDF provides an easy way to quickly view and print documents. To address issue #1 and #2 above, it is a best practice to store both the native content (Word/Excel/AutoCAD, etc) in the Hadoop repository along-side with a PDF “rendition” of this document. The native content and the PDF “rendition” of the document are stored together in Hadoop, which allows for fast retrieval of either format depending on the use case.
Another consideration when storing documents in Hadoop is how users are going to search for and retrieve these documents. As we have blogged about in the past, Hadoop doesn’t provide a very robust “search” feature set by itself. To address Issue #3, Hadoop could leverage the Apache Solr project alongside your Hadoop repository to provide a searchable index. This is a best practice for ECM to allow for efficient and robust searching capabilities in your ECM repository.
If storing scanned documents in Hadoop, another consideration to keep in mind is being able to search for and find these documents. If the scanned documents are dropped directly into Hadoop, there is no meaningful way to be able to find these documents or mine them for data. TSG’s partnership with Adlib solves this problem by leveraging the OCR (Optical Character Recognition) capabilities of its Adlib PDF Conversion Software to read in the scanned image and produce a full text searchable PDF document. When this OCRed PDF document is checked into the Hadoop repository that is Solr/Lucene enabled, users will be able to search for words and phrases inside of the document.
OpenContent for Hadoop
TSG’s recent efforts to address all of the above issues include standardizing all of these best practices behind our OpenContent web services layer. Our OpenContent API abstracts all of this behavior behind a simple to use web service call in order to store/retrieve documents in Hadoop. The scenario for adding a document in a Hadoop ECM looks like:
- Put the document’s native content in Hadoop
- Request a PDF rendition of the native content by calling Adlib or OCR the scanned image in Adlib if it is a scanned document
- Store the PDF rendition produced by Adlib next to the native content in Hadoop
- Fulltext index the PDF rendition in Solr/Lucene to allow for full text and attribute searching
Once these documents are stored in Hadoop, users can easily search for them in the Solr/Lucene index.
Adding Adlib PDF rendition capabilities along Hadoop document system provides robust document transformation capabilities to enable better document searching and viewing. For additional information about Hadoop and ECM, see these related posts.
Please let us know how you are leveraging Hadoop for your ECM platform in the comments below