In discussing our next steps as part of our Amazon Web Services practice, one exciting item TSG is pursuing this quarter will be focused on Amazon DynamoDB and adding Enterprise Content Management capabilities. This post will begin a series of posts on our design and development activities to provide an OpenContent connection to Amazon DynamoDB.
Background on Amazon DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. The main advantage of Dynamo is that it lets customers offload any administration work of operating a scaling and distributed database to AWS – customers do not need to worry about hardware provisioning, replication, software patching or cluster scaling as AWS handles that function. Amazon believes in the power of Dynamo so much that it has become the main database for its daily Amazon transactions.
Amazon is able to trust Dynamo with all of this data due to Dynamo’s high availability and durability design. Dynamo leverages global tables that replicate data across AWS regions so if one region’s cluster goes down, access to the data is available from another region’s table. Dynamo also leverages the AWS cloud to provide both point-in-time recovery from any second over the last 35 days and also has the ability to create on-demand backups and restores as required.
Dynamo compares very favorably in these regards to Hadoop’s Hbase, which is also a non-relational database that provides redundancy across clusters and is available as a TSG ECM solution. One of the main difference between Hadoop and Dynamo is the servers that are required for each solution. Hadoop requires Linux servers for installation. Customers can implement Hadoop either on premise or move Hadoop to the cloud on provisioned servers but need to manage the database and clusters manually or engage a vendor like Hortonworks to manage the Hadoop cluster. Dynamo automatically handles the administrative tasks associated with provisioning servers but is only available in the AWS Cloud. Customers that have a strict on-prem policy would not be able to use Dynamo.
Another major difference between Dynamo and Hadoop is that Hadoop is an Apache open-source project while Dynamo APIs are controlled by Amazon. Hadoop allows you to see the source and even contribute enhancements to it if needed, and generally the libraries will not change in substantial ways after deployment. Dynamo APIs on the other hand are controlled by Amazon, and are subject to change at any point, which could affect deployments against the DB. We would expect major Dynamo API changes to be rare, but it is an important consideration when choosing a database.
DynamoDB Architecture – NoSQL Database
One of the major benefits for Dynamo customers is a “not only SQL” approach, often referred to as NoSQL. Dynamo is a distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data, which sits on top of Dynamo. Since the 1990’s, all ECM vendors have leveraged some type of database under the covers of their architecture to manage the metadata of documents. Metadata/attributes include title, file location, author, security and all other data associated with the document. Typically we see Oracle or MS SQLServer as well as Alfresco customers also picking MySQL. Dynamo could be leveraged by ECM vendors or solutions to provide an alternative database. Some of the unique and modern features of Dynamo versus traditional relational databases include:
- Limited Database/Docbase Administration – Built for a “big data” approach, Dynamo focuses on an approach that allows the database to adapt as new data is presented rather than the traditional “call the DBA to add a column or an index”. Users should think of it as a “tagging” structure rather than a traditional relational database module with a strict schema. Tagging is something that inherently fits into a content management framework or understanding as we are always tagging documents with metadata. For an ECM example, if storing an invoice document, Dynamo can receive all of the attributes as consistent column families and Dynamo will take care of storing all the descriptors and values. If at a later time, the next invoice has a new value not in the old invoices, the Dynamo can easily append the value to just those documents.
- Limited Back-up/Recovery – Also as a “big data” approach, Dynamo provides a scalable, redundant model that can be leveraged across multiple AWS Regions with automatic redundancy/clustering. One of the big issues with our typical ECM relational database approach has always been coordinating the back-up of the relational database with the file-store. Dynamo provides the ability to remove that requirement as well as simplify setting up of a clustered environment.
DynamoDB – What’s the Catch – Searching and Commit (Transactional Queries)
Dynamo’s redundancy/clustering approach does come with several small warnings.
For traditional search, a query like “find me all the documents created last year by this author” would initiate a search across the cluster without indexed values and could take a long time. Since Dynamo was architected to deal with billions of rows, searching/querying have to be done either by id, or offloaded to some external searching appliance.
Also, as a big data push, Dynamo does not provide a transaction option as provided by legacy databases. From our work with ECM clients, it really doesn’t matter as this functionality is more associated with complex transactions involving separate updates to several related tables rather than simple document tagging.
DynamoDB Searching – can we just use Solr or Elasticsearch?
Retrieving metadata from Dynamo is slightly different from a relational database as it will farm the request out (again a big-data approach) to multiple servers and compile the results. Performance, and particularly search performance, has always been a key requirement by our ECM customers. In our initial testing, we had some concern about a federated search and performance of Dynamo.
Just like the ECM vendors Documentum and Alfresco, we would recommend leveraging Solr/Lucene as both the metadata and full-text search engine. Similar to how Documentum/Alfresco use a relational database to store attributes, Dynamo retrieval could be used for system of record requests (ex: What are the attributes of this document). Anytime a search against metadata is needed, Solr/Lucene would be used for searching for documents. (See our related post on Solr Services)
AWS also has various Index services integrated into their platform. While we are using Solr today for our Dynamo search index, this implementation could be moved over to Amazon CloudSearch and/or Amazon Elasticsearch service in the future – the benefits being AWS managing the index in much of the same way as it manages Dynamo.
We are also considering adding a normalized relational DB for certain large volume clients (think 500 million documents for our Insurance Claim Clients) to support either Hadoop/Dynamo but also Alfresco and Documentum. Look for more on future posts.
AWS S3 or Glacier FileStore
Dynamo is unique that it is not priced solely on storage size, but also on read and write “units” into the DB. This pushes the solution away from wanting to store the physical content of documents within Dynamo (which is possible technically but TSG would not recommend it) and instead leverage AWS storage solutions, in particular S3 and Glacier. These services have the ability to replace another component of the ECM architecture (and one near and dear for EMC), the Storage Area Network or SAN. On are related note, see our thoughts on how Cloud Object Store are disrupting traditional ECM.
Dynamo has the ability to set a Time To Live (TTL) on everything stored within the DB. If data storage and pricing of the data was becoming a concern, customers can configure a TTL for all main content and have it linked back to a normal S3 bucket. Once the TTL in the DB gets hit, the content could be archived in Glacier and the DB metadata could be offloaded, if needed, to a lower cost DB store.
TSG Dynamo Roadmap Plans
TSG is currently moving to support our OpenContent WebServices connector to DynamoDB similar to our support for Hadoop. Our roadmap is gradually moving all of the OpenContent Services starting with simple add and retrieve and then gradually adding more complex items like check-in/check-out, renditioning and annotations. By adding OpenContent Web Services for DynamoDB, TSG can support our full OpenContent Search, Opencontent Case, OpenAnnotate/OpenRedact, OpenOverlay and OpenMigrate as well as other components like Adlib for transforamtion services with DynamoDB. We are currently targeting the end of 2018 for the majority of APIs to be available with specific client requirements driving out the exact order.
Look for more posts over the beginning in the next few months as our plans and clients mature. Posts will include both new product development as well as lessons learned.
If you have any thoughts, please post below.