• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Hadoop Document Transformations Using Adlib

You are here: Home / Adlib / Hadoop Document Transformations Using Adlib

March 5, 2015

In our series exploring the use of Hadoop for ECM, the best practice from our years of ECM experience tells us is that documents should be stored in both their native content as well as a PDF rendition of the content. Storing a PDF rendition allows consumers quick access to view the content, as well as being able to watermark and control the content to prevent consumers from altering the documents. This post will explore TSG’s partnership with Adlib and how we are using Adlib’s PDF conversion suite to transform documents being stored in Hadoop.

We have had a few conversations with clients about using Hadoop as an ECM platform, and there have been a few unique business scenarios that we could leverage the Adlib PDF Conversion suite. Adlib PDF delivers an enterprise-class, document-to-PDF conversion framework that offers the highest fidelity PDF rendering engine on the market, with accurate OCR capabilities.

The first scenario involves a large legacy system of scanned TIF and PDF files that are currently stored on a legacy system that makes it very difficult to find anything. There are over a million documents spread across many folders that are not full-text searchable, and are sitting around without adding any value to the business. By leveraging Adlib’s OCR capabilities we can:

  1. Read in the scanned TIF/PDF files from the legacy system
  2. Run the scanned files through Adlib’s OCR engine to extract the text from the document
  3. Migrate the documents into Hadoop using OpenMigrate
  4. Index the documents using Solr to allow for fulltext searching on the contents of the document

With the documents having been transformed by Adlib, stored in Hadoop, and indexed by Solr, the business can now leverage these documents to perform meaningful analytics on the documents.

The second business scenario involves a topic we have previously blogged on, which is running your ECM against Hadoop instead of a traditional RDBMS. In the case of storing documents in the ECM repository, best practice is to store the native content for authors, and PDF renditions for consumers. Since TSG leverages our OpenContent web services layer for all interaction with Hadoop, we have implemented the Adlib transformation engine to request PDF renditions of all documents when they are added to Hadoop. By leveraging the robust Adlib transformation suite that offers best in class transformations of over 400 file types, putting these documents in Hadoop with a PDF renditions ensures that all users will be able to view the documents being stored in the repository, regardless of the native application that was used to author the document. This is a very important consideration when leveraging Hadoop as your document repository that can be easily overlooked since ECM vendors have been transforming to PDFs for years:

  • Users expect mobile access to documents, and PDF is the only format every device manufacturer can reliably display
  • On PCs, there is no guarantee that everybody is going to have the correct version of every piece of software on their machine just to be able to view the document
  • PDFs are the industry standard for long-term storage of documents

The overall architecture for transforming everything to PDFs allows us to leverage the Transformation abilities of Adlib to properly store any type of content in Hadoop.

Hadoop with Adlib

We see Adlib as the perfect marriage with Hadoop to ensure proper storage of documents in Hadoop. Let us know how you are leveraging Hadoop in the comments below.

Filed Under: Adlib, Hadoop, Migrations, OpenMigrate, Scanning Tagged With: Adlib, Hadoop

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Hadoop for Enterprise Content Management – Adding PDF Renditions with Adlib
  • Enterprise Document Search – A Publishing rather than Crawler/Federated Approach
  • Redaction for AWS, Alfresco, Documentum and Hadoop – Bulk Redaction upon Ingestion or Migration
  • Data Visualization Dashboard for ECM Migrations
  • TSG Announces Creation of Hadoop Practice
  • Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers
  • Next Generation ECMS – Architecture Thoughts
  • Alfresco – Do More with OpenMigrate Services
  • AODocs Migration with OpenMigrate
  • FileNet Migration – Not as hard as you think?

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2022 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT