• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers

You are here: Home / Alfresco / Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers

January 19, 2015

Last week we posted on a publishing approach for enterprise search.  Along with enterprise search, we have seen more and more ECM clients look to publish content out of the ECM repository for a variety of business reasons including performance, business continuity and reducing costs.  This post will highlight how Hadoop can be used within a publishing architecture and explain some of the benefits.

ECM publishing infrastructure – Reasons and Justifications

As presented last week as well in other posts, ECM customers look to publish from one or multiple ECM environments for a number of reasons including:

  • Business Continuity – Processes that rely on the documents managed by the ECM infrastructure might be interrupted if that system became unavailable.  Publishing the content to redundant infrastructures for consumption can avoid that interruption.
  • System Performance – typically ECM systems will have overhead (ex: security, extra interface elements for authors) that could slow down retrieval performance. A published approach will have a simple consumer only security model to improve performance for retrieval while removing expensive consumer queries (and resulting performance restrictions) from the ECM repository for authors and approvers.
  • User Training – ECM “do all” interfaces can be confusing for the average user.  A simplified consumer only interface typically can be set up to require zero training.
  • Enterprise Search – As presented last week, content can be published from multiple sources to allow a true “enterprise” search experience of all systems.
  • User License and Maintenance – Additional consumers can be easily added to the published repository without adding and maintaining more users in ECM authoring and approval repository.  This has been helpful as companies grow and add employees as well as bringing on third-party consumers/contractors.

ECM Publishing Infrastructure – Components

ECM publishing infrastructure has four major components as depicted below:

OpenMigrate Publishing

These components include:

  • Publishing Infrastructure – Above, this is pictured as OpenMigrate.  This component polls the ECM repository(ies) and publishes the content when it reaches a certain approved stage.  The publishing job will retrieve the document (typically only PDF renditions) as well as meta-data to post in the index as well as the content store.
  • Index – The index maintains the information about the documents (meta-data) as well as potentially the full-text index for the documents.  Typically we recommend Lucene/Solr for its performance and cost (open source).
  • Interface – The interface allows access to the index to identify documents as part of a user search.
  • Content Store – The content store holds the document itself.  Typically the content store is a mounted file system or SAN.

Typically we will see components of the infrastructure replicated to multiple environments for quick access as well as business continuity.  Typical scenarios include geographic (ex: North America, Europe, Asia…) as well as business (Plant A, Plant B….).  Clients accomplish the redundancy either with multiple publishing jobs or leveraging other replication capabilities.

Hadoop as a content store for caching consumers

Hadoop has some great features that make it a perfect extension to leverage as a content store for the publishing infrastructure.  These include:

  • Open Source – like other components of the architecture, Hadoop is open source and does not require an additional purchase.
  • Hadoop Distributed File System (HDFS) – Hadoop is built on an architecture of replication/duplication to push content to redundant servers that could be geographically separated.  Typically, we will want to have servers close to the physical location of the consumers to speed the content retrieval.  By utilizing separate servers, HDFS can provide quicker access based on retrieval from the closest server rather than maintaining a distributed SAN with duplicate copies that become difficult to maintain.
  • Reindexing Scenarios – Hadoop provides the ability to store not just the content file but also the meta-data in a redundant environment.  Often times, clients will rebuild the index of the publishing repository to meet new taxonomy requirements.  With most solutions, the publishing job (ex OpenMigrate) would have to re-run against each separate ECM repository to perform the reindex.  With Hadoop, the reindex could be accomplished within the publishing environment with no need to access the source repositories.

Because of the ability to store meta-data in Hadoop, some clients have asked if Hadoop can be used to replace the index/search component as well.  We would recommend sticking with Lucene/Solr as it provides a meta-data as well as full-text capability as a tuned search appliance.

Summary

Hadoop can provide a more robust open source content store for ECM publishing infrastructures with added redundancy and performance as well as better support for reindexing requirements.  From a TSG perspective, we have added Hadoop support for this approach with:

  • Publishing – OpenMigrate supports all of the publishing as well as reindexing requirements.
  • Interface – HPI can store and retrieve documents from Hadoop as well as other ECM repositories.

If you have any thoughts, please add your comments below.

Filed Under: Alfresco, Documentum, Hadoop, OpenContent Management Suite, OpenMigrate, Product Suite Tagged With: ECM, Hadoop, HPI, OpenMigrate, portal

Reader Interactions

Comments

  1. shiv Kumar Napit says

    June 5, 2017 at 1:54 am

    Hi,

    Can someone please confirm whether any alfresco customers/clients using Hadoop as an storage option in their projects?

    Thanks in advance..!!

    Reply

Trackbacks

  1. Hadoop – Data Model for ECM applications | TSG Blog says:
    January 21, 2015 at 10:13 pm

    […] « Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers […]

    Reply
  2. Hadoop and its Opportunities for Enterprise Content Management | TSG Blog says:
    January 29, 2015 at 3:04 pm

    […] Hadoop as a Content Store when Caching Content for ECM Consumers. […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Redaction for AWS, Alfresco, Documentum and Hadoop – Bulk Redaction upon Ingestion or Migration
  • Office 365 – Check-in and Check-out with Documentum, Alfresco or Hadoop
  • Hadoop – OpenContent/HPI Product Plans
  • Next Generation ECMS – Architecture Thoughts
  • OpenContent Solr Services – New TSG Product Offering
  • Ephesoft Accounts Payable Solution for Alfresco, Documentum and Hadoop
  • Hadoop well documented – Adding ECM attributes "on the fly"
  • Hadoop for Enterprise Content Management – Adding PDF Renditions with Adlib
  • TSG Announces Creation of Hadoop Practice
  • Enterprise Document Search – A Publishing rather than Crawler/Federated Approach

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT