As we have talked to clients about Hadoop with HBase for ECM, too often we hear “but isn’t that just for big data?” This post will try to explain the benefits of Hadoop’s big data capabilities and data model in an ECM context compared to traditional database systems.
Hadoop – Big Data versus traditional Big Database
At its core, Hadoop provides a very robust, distributed data store that allows for powerful parallel processing capabilities with unique data storage capabilities. To understand the difference between Hadoop and traditional databases requires an understanding of the processes (and timing) of when they were created.
Relational databases first emerged in the 1980’s when disk speed was slow and disk space and CPU usage were at a premium. Built for critical business systems, relational databases were built with the focus on storing known data in a static data model. DBA’s were extensively employed to update the model and add indexing and other performance improvements.
Hadoop is a new approach based on the huge gains in the economics of disk space, hardware cost, and new requirements with unstructured big data. Hadoop allows for very quick and distributed/parallel retrieval of a specific data file that can contain data in a variety of different formats.
Hadoop – Schema-on-Read versus Schema-on-Write
One of the big differences between Hadoop and traditional RDMS is how data is organized in a schema. Traditional databases require Schema-on-Write where the DB schema is very static and needs to be well-defined before the data is loaded. The process for Schema-on-Write requires:
- Analysis of data processes and requirements
- Data Modeling
If any of the requirements change, the process has to be repeated.
Schema-on-Read focuses on a less restrictive approach to allow storage of raw, unprocessed data to be stored immediately. How the data is used is determined when the data is read. Table below summarizes differences:
|Traditional Database (RDMS)||Hadoop|
|New columns must be added (DBA) before new data can be added||New (Big) data can start flowing any time|
Schema-on-Read provides a unique benefit for Big Data in that data can be written to Hadoop without having to know exactly how it will be retrieved.
Hadoop – what it means for Big Data
In a Big Data world, data needs to be captured without the requirement of knowing the structure to hold the data. As is often mentioned, things like Hadoop can be used for consumer/social sites that need to store a huge amount of unstructured data quickly with the consumption of that data coming at a later time.
As a typical Big Data example, a social site stores all of the different links clicked on by a user. The storing application might store date and time and other data in one HDFS file for the user that is updated each time the user returns. Given the user, a retrieval application can quickly access a large, semi-structured file on that particular user’s activity over time.
Hadoop for ECM – It’s not about Search
Schema-on-Write works very well for “known unknown” or what we would typically call in ECM document search, something that Schema-on-Read does not handle as well. To illustrate an example, let’s take a typical ECM requirement, search for all documents requiring review by this date:
- In an RDMS example, the date column would be joined in a query with the “required review” column to quickly provide a list of all the documents that would be needed to be reviewed. If the performance is not acceptable, indexes could be added to the database.
- In the Hadoop example, ALL of the documents in the repository would need to first be retrieved and opened, once opened, the required review and date data retrieved to build the list. There are no indices that could speed up performance.
After this example, at typical ECM architect might conclude that Hadoop doesn’t really fit this basic requirement. What this opinion doesn’t take into account is the emergence of the “Search Appliance” and particularly Lucene/Solr as the default indexing/searching engine for ECM.
For a better performing search for BOTH the RDMS and Hadoop implementations, we would recommend leveraging Lucene/Solr to index the document’s fulltext AND meta-data for superior search performance. As we mentioned in the Enterprise Search article, most ECM vendors (Documentum and Alfresco as examples) now leverage Lucene/Solr for all search.
Hadoop – Big Data for ECM – Schema-on-Read Example
Given that search performance would be identical, the major advantage of Schema-on-Read is the ability to easily store all of the documents meta-data without having to define columns and worry about database sizing. Some quick examples would include:
- Versioning – One difficulty with most ECM tools is, given a structured Schema-on-Write DB model, the ability to store different attributes on different versions is not always available or requires a work-around. With Hadoop, each version would have it’s own data and different attributes could be stored with different documents.
- Audit Trail – This is one we often see difficult to implement with one “audit” database table that ends up getting huge (and is a big data application). With Hadoop, the audit trail could be maintained as a single entry for that given document row and quickly be retrieved and parsed.
- Content/Renditions – Storing the content file and its renditions together in Hadoop, we can store and retrieve the data quickly for that row.
Hadoop – Building an HBase Document Model
- objects table contains a single row for each document including:
This is a rough schema of common data that could exist when a document is added, but since HBase is schemaless, we are not bound by the picture above. Every document can add columns “on the fly”, which means no database schema updates when new metadata is to be captured. One benefit of this is in the case of a corporate acquisition, when new documents need to quickly be entered into the system from the old company, the metadata can be dumped into the HBase repository as-is, without having to worry about mapping individual fields to the existing schema.
Hadoop/HBase has an advantage over traditional RDMS systems in that data can be stored and retrieved quickly in an unstructured file without requiring extensive analysis of how that data will typically be retrieved. The drawbacks of a Hadoop/HBase only approach is that search performance on any attributes other than the “documentId” do not perform well at scale. We would advise that Hadoop/HBase implementations utilize Lucene/Solr as the query component of an ECM solution allow performant searches against the repository.
Stay tuned to the blog for more developments on using Hadoop/HBase as an ECM repository. Please leave your comments below.