DynamoDB and Hadoop – Why Big Data will disrupt Document Management

Back in the 80’s, the emergence and low cost of PC’s, network and relational databases combined to create the beginnings of modern image management systems. Solutions and approaches gradually evolved to include documents and the creation of the ECM (enterprise content management/document management) industry. Back in 2015, TSG began brainstorming about what’s next for the document management industry. In addition to cloud computing already disrupting the data center for ECM applications, we would argue that the emergence of big data repositories will eventually disrupt the traditional relational database component of any document management solution. This post will present a brief history of disruption within the document management industry along with the rational of why the big data repositories will be the next disruption.

Download our white paper A Big Data Approach to ECM below:

Document Management – A History of Disruption

While there were some mainframe or DEC driven systems, the true beginning of mainstream document management systems began with the success of FileNet and other Image Management systems. Early image management systems were built with heavily customized PCs with image processing and network boards and monitors for scanning and viewing images, which, at the time as a fairly expensive solution. Key back-end components included a relational database (FileNet was built on Oracle) built on expensive UNIX solutions from HP, Sun, IBM and others to manage all of the document properties and links to the location of the image files.

Typically technology disruption includes both a better and/or cheaper approach. Within the Document Management industry, key disruptions have included:

As PC processing power, memory and network capabilities improved, gradually the processing boards and specialty monitors were displaced with commodity components as part of the PC revolution of the 80s and 90s.
With the rise of the internet in the late 90s, client server applications were replaced with browser based systems that were more flexible and easier to support.
In the 2000’s, the evolution of Linux as well as Microsoft as cost effective server options removed the former expensive UNIX portions from Sun, HP, IBM and others.
Open Source movements drive alternatives like Lucene/Solr/Elastic Search to replace former search options from Verity and others in the 2000’s.
Cloud options like Amazon continue to disrupt the traditional data center begin to replace on-premise systems between 2005 and today.

In 2015, TSG first identified the relational database component of traditional document management systems as one of the components that has not really changed in 20+ years. We saw the current trend around “Big Data” solutions as a way to offer better, more cost-effective alternatives. In looking toward the future, we saw “Big Data” solutions better positioned to take advantage of other disruptions including open source, cloud, and the investment by large consumer solutions (Amazon, Google, Netflix…) that are driving disruption across all of the IT industry. In 2015 we began offering Hadoop as an ECM alternative. In 2019 we will be offering Amazon DynamoDB as an option as well for Amazon Web Services clients.

Relational Databases – Built for Scarcity and Search

In the 70’s and 80’s, Hierarchical, Network and finally Relational databases all emerged. Built for the capabilities of systems of the time, CPU, Memory and Disc space were expensive and the database systems were built for maximum efficiency of these limited resources. Extensive effort was made on the following constraints:

Only relevant data was stored
Archived data might be pushed off to tape
The Database Administrator (DBA) was a critical role in making sure the database was cared for and performed well.
Adding indexes
Normalizing the data
Technology and approaches to make sure that the data maintained in the database could be quickly found and processed.

Since relational databases came of age, Moore’s Law has radically changed the scarcity of computing resources. Some notable items to consider in regard the function of databases.

Inexpensive disk space – Disk prices went from $100,000/GB (early 80’s) to $0.20/GB (2013)
Google – introduced the concept that a farm of inexpensive computers that could support efficient searching
Consumer computing – pushed for additional savings in cost in memory, disc and CPU
Open Source Focus – companies like Google and Facebook are building technologies to offer better services for their clients and releasing/enhancing open source rather than turning to traditional software (or database) vendors

NOSQL versus Relational for Document Management

One of the big advances for big data is the introduction of NoSQL (Not Only Standard Query Language) as a data storage and retrieval approach. Developed as an alternative to relational databases, benefits of this approach include:

Simplicity of design
Simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases), and finer control over availability.
The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL.
Sometimes the data structures used by NoSQL databases are also viewed as more flexible than relational database tables.

Specifically for Document Management customers, there is a simple difference between the two approaches.

Relational Database – Would store the attributes in columns/rows of the relational database with a pointer to the document file location in a SAN or object store.
NOSQL – Would store the attributes in an entry with tags/metadata that describe the document along with the possibly the document content itself in the repository or in a SAN or object store. Tags can be XML, JSON or a variety of other alternatives.

Schema-on-Read versus Schema-on-Write

One of the other big differences between Big Data NoSQL approache and traditional RDMS is how data is organized in a schema. Traditional databases require Schema-on-Write where the DB schema is very static and needs to be well-defined before the data is loaded. The process for Schema-on-Write requires:

Analysis of data processes and requirements
Data Modeling
Loading/Testing

If any of the requirements change, the process has to be repeated. Schema-on-Read focuses on a less restrictive approach to allow storage of raw, unprocessed data to be stored immediately. How the data is used is determined when the data is read. Table below summarizes differences:

Traditional RDMS	NoSQL
Create static DB schema	Copy data in native format
Tranform Data into RDMS	Create schema and parser
Query Data in RDMS format	Query Data native format
New columns must be added (DBA) before new data can be added	New (Big) data can start flowing any time

Schema-on-Read provides a unique benefit in that data can be written without having to know exactly how it will be retrieved. The major advantage of Schema-on-Read is the ability to easily store all of the documents meta-data without having to define columns and worry about database sizing. Some quick examples would include:

Versioning – One difficulty with most ECM tools is, given a structured Schema-on-Write DB model, the ability to store different attributes after a schema change (adding a new attribute) on different versions is not always available or requires a work-around. With NoSQL, each version would have its own data and different attributes/schemas could be stored with different documents.
Audit Trail – This is one we often see difficult to implement with one “audit” database table that ends up getting huge (and is a big data application). With NoSQL, the audit trail could be maintained as a single entry for that given document row and quickly be retrieved and parsed.

Schema Differences

Below is a typical schema for documents in a relational database model. In this case, the model is from Documentum.

NoSQL models can be much simpler as all of the data is stored in an object as depicted below.

This is a rough schema of common data that could exist when a document is added, but since NoSQL is schema-less, the application is not bound by the picture above. Every document can add columns “on the fly”, which means no database schema updates when new metadata is to be captured. One benefit of this is in the case of a corporate acquisition, when new documents need to quickly be entered into the system from the old company, the metadata can be dumped into the repository as-is, without having to worry about mapping individual fields to the existing schema.

Clustering Differences

One of the major expenses of managing a highly available document management system based on a relational database is setting up the cluster for the database beneath the document managememnt software. Document Management, growing up in an environment where typical customers have selected their RDMS system (Oracle, SQL Server, MySQL) forcing the ECM vendors to support a variety of different repositories and all of their associated differences in clustering. Clients have to deal with both the cost of the repository but also the intricacies of managing a clustered solution and associated costs.

Due to the simplicity of design as well as the schema-less object model, NoSQL with either Hadoop or DynamoDB provides better support of horizontal scaling and clustering. Simply writing an object and having that object replicated to the other servers within the cluster is easier than updating all of the related tables required to replicate an object in a relational database.

For NoSQL solutions like Hadoop and DynamoDB from AWS, both solutions come “out of the box” clustered with no need for expensive DBA’s and other support costs.

Why a Schema-less, Big Data approach will disrupt Legacy Document Management

Many large public and private organizations that run large on-premise document management systems. In many cases these organization store hundreds of millions and in some cases billions of files in first generation ECM (enterprise content management) systems. In almost every instance, these system were installed a decade or more ago and at the time were the best option available. Today they are expensive, outdated and difficult to manage, yet these organizations continue to run them as they see no clear or easily available alternative.

The combination of the desire to move to the cloud along with a need to capture cost savings will push clients to look for alternatives to expensive legacy document management/ECM solutions, both for existing repositories and new repository efforts. As organizations evolve their database choices away from traditional RDMS vendors, we would predict that Big Data approaches like Hadoop or DynamoDB will present an alternative to expensive and difficult to maintain clustered RDMS solutions along with a technology set that can be used elsewhere in the organization.

How will Big Data disrupt Legacy Document Management?

We would predict that Big Data apaches will disrupt traditional document management in a variety of ways including:

Build your own – We would predict that some organizations would attempt to build their own approaches with a combination of cloud object stores (S3 from AWS) and Hadoop or DynamoDB. Just like early image and document management systems from the 1990’s, these custom efforts will include a database with an object store pointer and a custom interface or API.
Vendor Approaches – Some of the traditional ECM vendors will embrace Hadoop or DynamoDB as an alternative from their traditional supported list (Oracle, SQL Server…). We would predict that these vendors would rely on the ability of NoSQL to process SQL and implement in a more traditional relational approach.
New Vendors – We would predict that new vendors will emerge that don’t have the legacy relational DBMS install base to disrupt traditional vendors.

TSG’s Approach

Beginning in 2015, TSG began introducing Hadoop as both a RDMS and ECM vendor alternative to clients looking for an alternative to legacy ECM vendors and RDMS approaches. TSG has successfully implemented Hadoop as an ECM solution for clients in a variety of industries including health insurance, pharmaceutical and others. In 2019 we will begin offering DynamoDB as an Amazon-based alternative. Highlights of the TSG approach include:

Complete Hadoop or DynamoDB backend along with API calls. TSG leverages our OpenContent Web Services to access either Hadoop, DynamoDB or other ECM repositories like Documentum or Alfresco. OpenContent now supports both Hadoop and DynamoDB in clustered settings.
OpenContent Product Suite – TSG supports our complete product offering including OpenContent Search and Case, OpenAnnotate and OpenOverlay on both Hadoop and DynamoDB. Clients can leverage both an “out of the box” backend as well as a highly configurable, efficient and supported front end that has been implemented for hundreds of customers.
OpenMigrate – TSG supports migrations to and from repositories with OpenMigrate. OpenMigrate currently supports a variety of legacy ECM platforms as well as Hadoop. DynamoDB support is being added in 2019.

Update – see a related article by Jeff Potts at ECM Architect on NoSQL and his experience with Alfresco.

Summary – Why will existing ECM vendors be disrupted by Big Data

TSG would predict that it would be very difficult o see the traditional ECM vendors embrace a big data approach for the following reasons:

Cost – Legacy ECM vendors are tied to their current cost point to clients. We have even seen pressure from clients to reduce ECM costs. Supporting additional repositories is not a trivial activity and we don’t see existing vendors making that investment without an identified revenue stream from new clients that doesn’t cannibalize existing revenue streams.
Focus – Legacy ECM vendors have been focused on selling more to clients, not necessarily delivering more capabilities at the same price. See our thoughts on the ECM Suite for more detail.
Interfaces – Many of the Legacy ECM vendors interfaces are hardcoded to their current repositories with proprietary API and RDMS repository calls and would not be portable to a new NoSQL repository that would take advantage of the NoSQL approach.

The innovative vendors will embrace both the new technologies provided by Big Data as well as provide both interfaces and pricing that disrupt rather than continue and existing approach.

Download our white paper A Big Data Approach to ECM below:

Let us know your thoughts below:

Trackbacks

Alfresco needs business-focused innovation to reclaim its "visionary" status | ECM Architect says:

January 16, 2019 at 9:43 am

[…] his clients running the Hadoop-based repository couldn’t be happier. Now the firm has added Amazon’s DynamoDB as an additional back-end repository […]
ECM 2.0 – What does it mean? says:

January 25, 2019 at 6:51 am

[…] Goodbye Oracle – Hello Big Data – Clients looking for cost efficiencies as well as taking advantage of distributed content stores will look for modern repositories that can horizontally scale without the costs and difficulties of old-school relational databases. See our thoughts (and look for an upcoming paper from Alan and Deep Analysis) on how Big Data will disrupt Document Management. […]
A Big Data Approach to ECM – White Paper from Deep Analysis says:

February 6, 2019 at 3:24 pm

[…] relevant posts include – ECM 2.0 – What does it Mean? as well as our thoughts Why Big Data will disrupt Document Management. Also check out the OCMS on DynamoDB now available in the Amazon […]
FileNet Migration to Amazon Web Services – Why now more than ever? says:

May 7, 2019 at 4:03 pm

[…] Lastly, there are just better alternatives from companies like Alfresco that are committed to their platform and growing their client base. As opposed to IBM that has many distractions, (Watson, IBM Cloud), companies like Alfresco are solely focused on ECM and partnering with AWS or Azure, the leaders in cloud platforms. We are also highlighting success of customers leveraging modern, big-data approaches with Hadoop or Amazon DynamoDB. […]
DynamoDB and AWS – How to build your own ECM capabilities for massive scale and performance says:

May 28, 2019 at 7:02 am

[…] than a traditional relational database. See our post and whitepaper from earlier this year on Why Big Data will disrupt Document Management but some relevant points to […]

Reader Interactions

Trackbacks

Leave a Reply to Alfresco needs business-focused innovation to reclaim its "visionary" status | ECM ArchitectCancel reply