In doing our predictions post for 2015, one thing we have been discussing and targeted for further research is Hadoop and potential opportunities for Enterprise Content Management. This post will begin a series of posts on our thoughts and R&D activities to provide an OpenContent connection to Hadoop.
Background on Hadoop
Hadoop is an open-source software framework for distributed storage and distributed processing on clusters of commodity hardware. Two core components provide some interesting capabilities for content management:
- Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity hardware, providing very high aggregate bandwidth across the cluster.
- Hadoop MapReduce – a programming model for large-scale data processing.
All of the modules in Hadoop are designed with a “clustered” assumption that any individual machine or machines failure would be automatically managed by the framework. Apache Hadoop’s MapReduce and HDFS components originally derived respectively from Google’s MapReduce and Google File System (GFS) papers.
Hadoop has benefited by heavy investment both by services providers as well as corporations. From the Apache_Hadoop Wiki, Facebook and Yahoo have used Hadoop extensively for petabyte installations. The Wiki also references more than half of the Fortune 50 use Hadoop.
Hadoop Architecture – HBase Database
One of the potential benefits for clients, is a “not only SQL” approach, often referred to as NoSQL approach. HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data, that sits on top of Hadoop. Since the 1990’s, all ECM vendors have leveraged some type of database under the covers of their architecture to manage the metadata of documents. Metadata/attributes include title, file location, author, security and all other data associated with the document. Typically we see Oracle or MS SQLServer as well as Alfresco customers also picking MySQL. Hadoop with HBase, could be leveraged by ECM vendors or solutions to provide an alternative database. Some of the unique and modern features of HBase versus traditional relational databases include:
- Limited Database/Docbase Administration – Built for a “big data” approach, HBase focuses on an approach that allows the database to adapt as new data is presented rather than the traditional “call the DBA to add a column or an index”. Users should think of it as a “tagging” structure rather than a traditional relational database module with a strict schema. Tagging is something that inherently fits into a content management framework or understanding as we are always tagging documents with metadata. For an ECM example, if I am storing an invoice document, I can just send HBase all of the attributes as consistent column families and HBase will take care of storing all the descriptors and values. If at a later time, the next invoice has a new value not in the old invoices, the HBase can easily append the value to just those documents.
- Limited Back-up/Recovery – Also as a “big data” approach, Hadoop provides a scalable, redundant model that can be leveraged across multiple inexpensive servers with automatic redundancy/clustering. One of the big issues with our typical ECM relational database approach has always been coordinating the back-up of the relational database with the file-store. Hadoop provides the ability to remove that requirement as well as simplify setting up of a clustered environment.
Hadoop and HBase – What’s the Catch – Searching and Commit
Hadoop’s redundancy/clustering approach does come with several small warnings.
For traditional search, a query like “find me all the documents created last year by this author” would initiate a search across the cluster without indexed values and could take a long time. Since HBase and BigTable were architected to deal with billions of rows, searching/querying have to be done either by id, or offloaded to some external searching appliance.
Also, as a big data push, Hadoop does not provide a transaction or commit/rollback option as provided by legacy databases. From our work with ECM clients, it really doesn’t matter as this functionality is more associated with complex transactions involving separate updates to several related tables rather than simple document tagging.
HBase Searching – can we just use Solr/Lucene?
Retrieving metadata from HBase is slightly different from a relational database as it will farm the request out (again a big-data approach) to multiple servers and compile the results. Performance, and particularly search performance, has always been a key requirement by our ECM customers. In our initial testing, we had some concern about a federated search and performance of HBase.
Just like the ECM vendors Documentum and Alfresco, we would recommend leveraging Solr/Lucene as both the metadata and full-text search engine. Similar to how Documentum/Alfresco use a relational database to store attributes, HBase retrieval could be used for system of record requests (ex: What are the attributes of this document). Anytime a search against metadata is needed, Solr/Lucene would be used for searching for documents.
As mentioned above, Hadoop leverages a cluster/redundancy approach to duplicate those files across servers. This capability has the ability to replace another component of the ECM architecture (and one near and dear for EMC), the Storage Area Network or SAN.
Hadoop was written from the ground up to work on cheap commodity hardware (specifically low-cost high-capacity hard drives) that are expected to fail at some point. Hadoop Distributed File System (HDFS) framework is architected on the idea that content will be written to multiple places on the cluster, so that
if when a hard drive or server fails, the system remains completely available and access is uninterrupted since there are multiple copies of the data. There is no need for scheduled backups or outages to restore data in the event of a failure.
Look for more posts over the beginning on 2015 in regard to our TSG R&D efforts on Hadoop. Articles include:
- Hadoop as a Content Store when Caching Content for ECM Consumers
- Hadoop – Data Model for ECM applications
- Hadoop – OpenContent/HPI Product Plans
- TSG Announces creation of Hadoop Practice
If you have any thoughts, please post below.