• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Hadoop for Enterprise Content Management – Adding PDF Renditions with Adlib

You are here: Home / Adlib / Hadoop for Enterprise Content Management – Adding PDF Renditions with Adlib

March 10, 2015

As we have discussed in our Hadoop Series, more and more companies are considering Hadoop for storage and management of documents and files.  Just like our ECM clients, companies storing documents or scanned files in Hadoop want to provide PDF renditions of documents for easy viewing and other PDF capabilities.  This post will discuss how Adlib can be leveraged with Solr/Lucene behind TSG’s OpenContent layer to provide robust ECM capabilities for your Hadoop repository.

Hadoop – Storing Documents and Image Files

When using Hadoop to store documents, it is important to consider the usage patterns of the document. It is very easy to just store the native Word/Excel document or Image scan (typically TIFF) into a Hadoop repository and retrieve the document based on the Hadoop row key when it is needed. Some potential issues with this approach are:

  1. Users need to have Word/Excel or a TIFF Image Viewer installed on their PC in order to view the document.
  2. Mobile/tablet users don’t always have access to the applications required to view the document
  3. Hadoop doesn’t have a very robust way to search for documents on anything except the unique ID of the document

In an ECM system, typical usage patterns we see are that 70% of users only require view and print for most documents. PDF provides an easy way to quickly view and print documents. To address issue #1 and #2 above, it is a best practice to store both the native content (Word/Excel/AutoCAD, etc) in the Hadoop repository along-side with a PDF “rendition” of this document. The native content and the PDF “rendition” of the document are stored together in Hadoop, which allows for fast retrieval of either format depending on the use case.

Another consideration when storing documents in Hadoop is how users are going to search for and retrieve these documents. As we have blogged about in the past, Hadoop doesn’t provide a very robust “search” feature set by itself. To address Issue #3, Hadoop could leverage the Apache Solr project alongside your Hadoop repository to provide a searchable index. This is a best practice for ECM to allow for efficient and robust searching capabilities in your ECM repository.

If storing scanned documents in Hadoop, another consideration to keep in mind is being able to search for and find these documents. If the scanned documents are dropped directly into Hadoop, there is no meaningful way to be able to find these documents or mine them for data. TSG’s partnership with Adlib solves this problem by leveraging the OCR (Optical Character Recognition) capabilities of its Adlib PDF Conversion Software to read in the scanned image and produce a full text searchable PDF document. When this OCRed PDF document is checked into the Hadoop repository that is Solr/Lucene enabled, users will be able to search for words and phrases inside of the document.

OpenContent for Hadoop

TSG’s recent efforts to address all of the above issues include standardizing all of these best practices behind our OpenContent web services layer. Our OpenContent API abstracts all of this behavior behind a simple to use web service call in order to store/retrieve documents in Hadoop. The scenario for adding a document in a Hadoop ECM looks like:

  • Put the document’s native content in Hadoop
  • Request a PDF rendition of the native content by calling Adlib or OCR the scanned image in Adlib if it is a scanned document
  • Store the PDF rendition produced by Adlib next to the native content in Hadoop
  • Fulltext index the PDF rendition in Solr/Lucene to allow for full text and attribute searching

Once these documents are stored in Hadoop, users can easily search for them in the Solr/Lucene index.

Summary

Adding Adlib PDF rendition capabilities along Hadoop document system provides robust document transformation capabilities to enable better document searching and viewing.  For additional information about Hadoop and ECM, see these related posts.

Hadoop for HPI/OpenContent Product Plans

Hadoop Data Model for ECM applications

Hadoop as a Content Store for an ECM Cache

Please let us know how you are leveraging Hadoop for your ECM platform in the comments below

Filed Under: Adlib, ECM Landscape, Hadoop, OpenContent Management Suite, Product Suite Tagged With: Adlib, ECM, Hadoop

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Hadoop well documented – Adding ECM attributes "on the fly"
  • TSG Announces Creation of Hadoop Practice
  • Hadoop – OpenContent/HPI Product Plans
  • Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers
  • Hadoop Document Transformations Using Adlib
  • Hadoop – Disrupting the Relational Database Component of ECM
  • Hadoop – Data Model for ECM applications
  • TSG HPI Search Alfresco Webinar
  • ECM 2.0 – Can you build it yourself?
  • DynamoDB 11 Billion Benchmark Add Documents Success!!! – Lessons Learned

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT