One of the ongoing myths about DynamoDB for Document Management we hear too often is “but isn’t that just for big data?” This post will try to explain the benefits of DynamoDB’s big data capabilities and data model in a Document Management context compared to traditional database systems. Examples will include how we are currently building our own DynamoDB offering scheduled to be released at the end of 2018.
DynamoDB – Big Data versus traditional Big Database
At its core, DynamoDB provides a very robust, distributed data store that allows for powerful parallel processing capabilities with unique data storage capabilities. To understand the difference between DynamoDB and traditional databases requires an understanding of the processes (and timing) of when they were created.
Relational databases first emerged in the 1980’s when disk speed was slow and disk space and CPU usage were at a premium. Built for critical business systems, relational databases were built with the focus on storing known data in a static data model. DBA’s were extensively employed to update the model and add indexing and other performance improvements. Performance focused down to the hardware level of where individual fields were stored within the disk array, a very expensive component back in the day. Emerging in the 1990’s and 2000’s, most modern ECM solutions rely on the critical document management fields to be stored in a relational database.
DynamoDB, like Hadoop and other “not only SQL (NOSQL)” repositories follows the more modern approach based on the huge gains in the economics of disk space, hardware cost, and new requirements with unstructured big data. DynamoDB allows for very quick and distributed/parallel retrieval of a specific data file that can contain data in a variety of different formats.
DynamoDB – Schema-on-Read versus Schema-on-Write
One of the big differences between DynamoDB and traditional RDMS is how data is organized in a schema. Traditional databases require Schema-on-Write where the DB schema is very static and needs to be well-defined before the data is loaded. The process for Schema-on-Write requires:
- Analysis of data processes and requirements
- Data Modeling
- Loading/Testing
If any of the requirements change, the process has to be repeated.
Schema-on-Read focuses on a less restrictive approach to allow storage of raw, unprocessed data to be stored immediately. How the data is used is determined when the data is read. Table below summarizes differences:
Traditional Database (RDMS) | DynamoDB |
· Create static DB schema | · Copy data in native format |
· Transform Data into RDMS | · Create schema and parser |
· Query Data in RDMS format | · Query Data native format |
· New columns must be added by a DBA before new data can be added | · New data can start flowing in any time |
Schema-on-Read provides a unique benefit for Big Data in that data can be written to DynamoDB without having to know exactly how it will be retrieved.
DynamoDB – what it means for Big Data (and big Document issues)
In a Big Data world, data needs to be captured without the requirement of knowing the structure to hold the data. As is often mentioned, things like DynamoDB can be used for consumer/social sites that need to store a huge amount of unstructured data quickly with the consumption of that data coming at a later time.
As a typical Big Data example, a social site stores all of the different links clicked on by a user. The storing application might store date and time and other data in one HDFS file for the user that is updated each time the user returns. Given the user, a retrieval application can quickly access a large, semi-structured file on that particular user’s activity over time.
Also, using Amazon allows leverage of S3 containers within the solution. While content could technically be stored as bytes within the DynamoDB, it makes more sense to use Amazon’s S3 content storage and store a simple S3 link within the DynamoDB metadata to connect the two. This allows DynamoDB to function in its intended design as a pure DB and not also rely on it as both a metadata and content store.
DynamoDB for Document Management – It’s not about Search
Schema-on-Write works very well for “known unknown” or what we would typically call in document management for a document search, something that Schema-on-Read does not handle as well. To illustrate an example, let’s take a typical document management requirement, search for all documents requiring review by this date:
In an RDMS example, the date column would be joined in a query with the “required review” column to quickly provide a list of all the documents that would be needed to be reviewed. If the performance is not acceptable, indexes could be added to the database.
In the DynamoDB example, ALL of the documents in the repository would need to first be retrieved and opened, once opened, the required review and date data retrieved to build the list. There are no indices that could speed up performance.
After this example, at typical document management architect might conclude that DynamoDB doesn’t really fit this basic requirement. What this opinion doesn’t take into account is the emergence of the “Search Appliance” and particularly Lucene/Solr/Elastic Search as the default indexing/searching engine for document management.
For a better performing search for BOTH the RDMS and DynamoDB implementations, we would recommend leveraging Lucene/Solr/Elastic Search to index the document’s fulltext AND meta-data for superior search performance. All modern ECM/Document Management vendors (Documentum and Alfresco as examples) now leverage some type of Lucene/Solr/Elastic Search for all search.
Amazon provides a both AWS hosted SolrCloud and Elasticsearch services as an easy way to deploy a DynamoDB combination architecture. By having AWS manage and maintain the physical architecture, organizations can have easier integration for the combined environments.
DynamoDB – Big Data for Document Management – Schema-on-Read Example
The major advantage of Schema-on-Read is the ability to easily store all of the document’s meta-data without having to define column. Some quick examples would include:
- Versioning – One difficulty with most document management tools is, given a structured Schema-on-Write DB model, the ability to store different attributes on different versions is not always available or requires a work-around. With DynamoDB, each version would have its own data and different attributes could be stored with different documents.
- Audit Trail – This is one we often see difficult to implement with one “audit” database table that ends up getting huge (and is a big data application). With DynamoDB, the audit trail could be maintained as a single entry for that given document row and quickly be retrieved and parsed.
- Model Updates – Many times, meta-data needs to be added to a content model after a system has gone live and matures because the content it stores matures with it. With DynamoDB, new meta-data can always be written in as a new column.
DynamoDB – Building a Document Model
Object Model { objectId (key) objectName title modifyDate creationDate creator contains audits ... content (S3 link) rendition (S3 link) }
Objects table contains a single row for each document including:
- metadata
- content (reference to S3 object)
- renditions (references to S3 object)
This is a rough schema of common data that could exist when a document is added, but since DynamoDB is schemaless, the system is not bound by the picture above and could evolve and change. Every document can add columns “on the fly” with no database schema updates when new metadata is to be captured. One relevant example of this approach is in the case of a corporate acquisition or merging of repositories, when new documents need to quickly be entered into the system from the old company. The metadata can be dumped into the DynamoDB repository as-is, without having to worry about mapping individual fields to the existing schema.
Summary
DynamoDB has an advantage over traditional RDMS systems in that data can be stored and retrieved quickly in an unstructured file without requiring extensive analysis of how that data will typically be retrieved. The drawbacks of a DynamoDB only approach is that search performance on any attributes other than the “documentId” do not perform well at scale. We would advise that DynamoDB implementations utilize Lucene/Solr/Elastic Search as the query component of a Document Management solution to allow high performant searches against the repository.
Stay tuned to the blog for more developments on using DynamoDB as a document management repository. Please leave your comments below.
[…] our related post on DynamoDB – Database Model for Document Managemement for an understanding of how the structure is represented in the […]