DynamoDB 11 Billion Benchmark – Document and Folder Details

TSG started an 11 Billion Document Benchmark with DynamoDB last Friday to test and verify the power of Amazon Web Services as well as the TSG ECM products on an unprecedented scale. As of this morning we have migrated approximately 9 billion documents. This post will present some underlying detail of DynamoDB repository with a particular focus on document and folder objects.

Our post Monday detailed the reasons and expectations for the 11 billion document benchmark with a post Tuesday showing the interface and migration process. This post will present additional detail on specifics of the document and folder details and how they tie into the OpenContent Management Suite’s ability to configure the user experience.

DynamoDB Object Model – NoSQL Approach

One of the big advances for big data is the introduction of NoSQL (Not Only Standard Query Language) as a data storage and retrieval approach. Developed as an alternative to relational databases, benefits of this approach include:

Simplicity of design
Simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases), and finer control over availability.
The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL.
Sometimes the data structures used by NoSQL databases are also viewed as more flexible than relational database tables.

Specifically for Document Management customers, there is a simple difference between the two approaches.

Relational Database – Would store the attributes in columns/rows of the relational database with a pointer to the document file location in a SAN or object store.
NOSQL – Would store the attributes in an entry with tags/metadata that describe the document along with the possibly the document content itself in the repository or in a SAN or object store. Tags can be XML, JSON or a variety of other alternatives.

For our DynamoDB approach, we are using JSON with the following layouts for documents

As well as a similar layout for folders

(Note: these are example of our Claim Auto Document and Folder types from the benchmark. Other types may have slightly different metadata fields)

One advantage the OpenContent Management Suite (OCMS) has over traditional document management interfaces is the ability to configure all the portions of the interface without requiring any code. While the JSON object can have all of the detail to describe each attribute, OCMS will map the name to a label for display in the interface, allowing names/interfaces to change and adapt for different users and languages without requiring the underlying repository to change.

DynamoDB – Folder Detail

One interesting component of the Folder model is including the detail of the document objects in the folder itself. (the rel_children_ss attribute from the folder picture above is a list of all document ids in the folder)

As part of our benchmark, the team is going to test two models for displaying the contents of a folder.

JSON storage of document objects – Currently the folder object contains all of the document ids in a repeating field. This allows for fast viewing of the objects in the folder, a typical requirement for case management/folder viewing. Benefits include fast, scale-able access to folder objects without a large Elasticsearch index. Downsides would be having to update the folder object every time a folder is added/deleted and large folders (TSG has one client with 65,000 documents in a folder).
Elasticsearch for documents – Currently, Elasticsearch is only being used for access to folder objects. Phase 2 of the benchmark will include indexing of all or part of 11 billion documents to test leverage of Elasticsearch for displaying objects contained in a folder. Benefits include not having to update folder objects for adding or removing documents. Downsides include having to maintain an index for all documents and additional Elasticsearch resources.

TSG is planning on testing and may provide both options to DynamoDB customers where one might make more sense depending on the customer’s use case. Below is a daily video of our progress to date on the benchmark with some more detail on how the object model is configured in the OpenContent Management Suite.

Let us know your thoughts below and look for another entry tomorrow.

Reader Interactions

Trackbacks

Leave a Reply to DynamoDB 11 Billion Benchmark Ingestion Success!!! – Lessons LearnedCancel reply