• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Hadoop – Disrupting the Relational Database Component of ECM

You are here: Home / ECM Landscape / Hadoop – Disrupting the Relational Database Component of ECM

February 24, 2015

We had a good conversation yesterday with a long-time and innovative TSG client.  The client has a mix of technical and business skills that make him a visionary in a highly regulated industry in regards to Enterprise Content Management.  In addition to our normal catch-up discussions about plans for the year and what are we seeing other clients do, we also talked about Hadoop and how it could disrupt traditional Relational Databases (RDBMS).  This post will present highlights of that discussion from a business perspective.

Relational Databases – Built for Scarcity and Search

In the 70’s and 80’s, Hierarchical, Network and finally Relational databases all emerged.  Built for the mainframe systems of the time, CPU, Memory and Disc space were expensive and the database systems were built for maximum efficiency of these limited resources.  Extensive effort was made on the following constraints:

  • only relevant data was stored
  • archived data might be pushed off to tape and
  • the Database Administrator (DBA) was a critical role in the care and feeding in making sure the database performed.
  • Adding indexes
  • Normalizing the data
  • things to make sure that the data maintained in the database could be quickly found and processed.

Disc, Memory and CPU – No longer Scarce

Since relational databases came of age, Moore’s Law has radically changed the scarcity of computing resources.  Some notable items to consider in regard the function of databases.

  • Inexpensive disk space – Disk prices went from  $100,000/GB (early 80’s) to $.20/GB (2013)
  • Google – introduced the concept that a farm of inexpensive computers could support efficient searching
  • Consumer computing – pushed for additional savings in cost in memory, disc and CPU
  • OpenSource Focus – companies like Google and Facebook are building technologies to offer better services for their clients and releasing/enhancing open source rather than turning to traditional software (or database) vendors

It makes much more sense today to store as much data as you can possibly produce since there is value that can be squeezed out of every piece of data captured as a part of a company’s business process. It is more expensive to throw data away and lose potential value than to keep it lying around at 20 cents per GB. No longer do we have to be as concerned with normalizing the data and making sure every piece of data fits a well-defined schema when it is produced.

Hadoop  versus Relational

This post won’t get into all of the different underlying technologies for Hadoop (MapReduce, NameNode, DataNode) but focus on more of the use cases for ECM.  For ECM customers, let’s examine the traditional story of storing a document with attributes.

  • Relational Database – Would store the attributes in columns/rows of the relational database with a pointer to the document file location in a SAN.
  • Hadoop – Would store the attributes in a Hadoop entry with tags/metadata that describe the document along with the document content itself. There is no need for a SAN as Hadoop provides it’s own distributed data store.

To illustrate the differences, now let’s compare what happens when searching for a document with a title of “SOP-1234”.

  • Relational Database – The database could be queried to find the row where document name equals “SOP-1234”.   Given a well indexed database this would be sub second.   Once on that row, it would be fairly easy to retrieve the file content from the SAN and related attributes from other tables.
  • Hadoop – Hadoop would have to rely on multi processing to query ALL the nodes looking where title = “SOP-1234”. (very inefficient).  Once identified, all attributes could be quickly retrieved as well as the document content itself.

In the example above, it would appear that Hadoop, while a good fit for big data doesn’t necessarily line up with the existing paradigm for search and retrieval in an ECM system.

Hadoop ECM Search – Search Appliance to the rescue

To truly understand how Hadoop can disrupt the relational database, particularly in an ECM scenario, let’s adjust the scenario to take advantage of another disrupting technology, the search appliance.  For the bulk of ECM vendors, the ability to have an appliance approach (Solr/Lucene) to provide indexing/searching of content in the relational database eliminates the need to query within the relational database.

Let’s review the same search scenario coupled with a search appliance. Searching for “SOP-1234”:

  • Relational Database with Search Appliance – Solr/Lucene would be used to perform a search for “SOP-1234” and have a pointer to the table containing the attributes. Once on that row, it would be fairly easy to retrieve the document content from the SAN. Legacy ECM vendors have moved to this paradigm over the years to escape the performance issues of querying million row tables with complex queries.
  • Hadoop with Search Appliance– Solr/Lucene would be used to perform a search for “SOP-1234” and would have a pointer to the node in Hadoop to be able to quickly retrieve the document content from HDFS.

In the case above, users wouldn’t notice the difference between an ECM tool built on Hadoop or a traditional relational database.

Hadoop ECM – What else can it do

In addition to being able to do whatever a RDMBS can do with legacy ECM vendors, Hadoop differentiates itself by providing the following:

  • Cost – Removal of RDMS and SAN for file storage
  • Unstructured – Data can be dumped into Hadoop as it is captured rather than worrying about designing a schema or making it fit in an existing schema
  • Unlimited – Can store unlimited number of data objects. Could include audit trail (like auditing every content view) or other big data items that tend to break RDMS structures.
  • Backup – Hadoop includes replication/clustering “out of the box” to remove the need to do a database backup.

Summary

The combination of a Search Appliance with Hadoop has the capability of disrupting the RDMS and SAN components of typical ECM systems.  Some similar shifts in the technology landscape we see telling a similar story are:

  • Commodity servers disrupting proprietary hardware vendors
  • Commodity storage disrupting proprietary SAN vendors
  • Linux disrupting UNIX
  • Apache and Tomcat disrupting proprietary application servers
  • Solr/Lucene disrupting Verity/FAST/Autonomy

Let us know your thoughts in the comments below.

Filed Under: ECM Landscape, Hadoop Tagged With: ECM, Hadoop

Reader Interactions

Comments

  1. Mike Pinter says

    February 24, 2015 at 8:40 am

    Thank you for all of this information about Hadoop. It looks like a very important new direction for content management that I see will have a marked effect on technical direction going forward. Three things that have come to mind to me while I reviewed this article and your others that I feel are important to think about:

    1) Retention/Destruction – if the metadata for a document is stored as a blob rather than intertwined in a multitude of database tables, it will be easier to destroy a document without damage to traditional audit trails and tables.

    2) Movement of documents – See 1) above – it may be easier to migrate content to other systems if you can move the blob with it rather than having to extract data from tables and merge it with the content.

    3) Data backup – traditional backups were full and incremental backups of servers, tied with database backups to keep the data in synch with the documents. It took a marrying of multiple technologies to ensure that the RDBMS layer and the data files were backed up and in synch and could be restored together. Hadoop has native, inherent replication capabilities, but it also offers large scale enterprises a technology footprint that aligns well with the latest enterprise block-level real-time replication tools built directly into SANs.

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Hadoop well documented – Adding ECM attributes "on the fly"
  • Hadoop for Enterprise Content Management – Adding PDF Renditions with Adlib
  • TSG Announces Creation of Hadoop Practice
  • Hadoop – OpenContent/HPI Product Plans
  • Hadoop – Data Model for ECM applications
  • Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers
  • ECM 2.0 – Can you build it yourself?
  • The Deep Analysis Podcast – The 11 Billion File Benchmark
  • DynamoDB and AWS – How to build your own ECM capabilities for massive scale and performance
  • A Big Data Approach to ECM – White Paper from Deep Analysis

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT