TSG started a 11 Billion Document Benchmark with DynamoDB last Friday to test and verify the power of Amazon Web Services as well as the TSG ECM products on an unprecedented scale. As of this morning we have migrated approximately 6.3 billion documents averaging 20,000 ingested per second, well over halfway to our goal of 11 Billion. This post will discuss the goals and approach of the benchmark.
DynamoDB Benchmark Goals
TSG initially announced our development efforts for creating an ECM offering for DynamoDB back in October, 2018. Based on the success of our current Hadoop product, developing a product for DynamoDB was greatly simplified with most of our development and testing efforts completed in a couple of months. While we have already had some success with a multiple Hadoop clients, our team thought an internal benchmark partnering with Amazon would show off the true power of DynamoDB/AWS. Our goal is to simulate all the components of a massively large repository to verify that our tools and approaches can scale. One truly powerful component of DynamoDB and other NoSQL approaches is how quickly we can built a massive repository leveraging AWS hardware scaling. When compared with other ECM repositories based on legacy relational database technologies, DynamoDB’s “schema on read” approach can perform ingestion at unbelievable rates of speed. (See our related post on data model differences between relational databases and big data “Not Only SQL – NoSQL” approaches.)
The overall goal for the benchmark is to prove out the scaling capabilities and advantages of AWS, DynamoDB and AWS Elastic Search for our large volume, case management clients. Rather than pursue a benchmark built on a “do everything” approach, the benchmark will focus on limited capabilities typically required for large volume clients like our health and insurance claim repositories. In working with our large volume clients, we have found that often times the “do everything” document management solution can add significant overhead and performance issues for massive repositories. Some examples of large volume approach includes:
- Case Security rather than repository security – The case management approach assumes that an external system is handling the security of which case is available for access by which user. In working with our insurance claim clients, we have set up this approach multiple times to improve performance and reduce bloat and added system requirements of the ECM system. While we have ACL and document security for our DynamoDB solution, it will not be enabled for this benchmark.
- Case Search rather than document search – Similar to the security, typical case or claim security allows for searching for a case or claim and, then once in the claim, searching through the documents. The benchmark’s initial phase will include an all repository case search based on AWS Elastic and will not initially include a all repository document search. Phase 2 will look to add document searching leveraging Lamda Elastic Search index population from DynamoDB.
- DynanmoDB scaling but not S3 scaling – We are targeting 10 billion unique document objects in DynamoDB pointing to content in S3. The benchmark will be sharing S3 links to actual files in S3 as we did not see the need to test the scaling of S3. The benchmark assumes that the content files have been migrated to S3 in preparation for the migration activities with AWS Snowball or other file migration approaches.
TSG’s Product Set – Benchmark will include migration activities leveraging TSG’s OpenMigrate as well as case viewing and updating with TSG’s OpenContent Management Suite consistent with our current client’s access needs. To hit our goals for the benchmark, we were shooting for 12,000 documents migrated per second with 1 billion documents migrated per day. In our preliminary tests, we were able to increase our throughput to 20,000 documents per second. For viewing, we are targeting sub-second viewing of case folder contents as well as document viewing. We will be supporting both viewing and annotating documents as well as combining document capabilities in Phase 1.
DynamoDB AWS Benchmark Environment
Components of our testing environment include:
- DynamoDB AWS Managed Service – 26,000 write units, Minimal read units
- Elastic AWS Managed Service – 8 data nodes – r5.4xlarge.elasticsearch (100 GIB EBS gp2), 3 master nodes – r5.xlarge.elasticsearch
- OpenMigrate – 2 EC2 m5.24xlarge instances (380 GB java heap – 96 CPUs)
- OpenContent Management Suite and OpenAnnotate – EC2 t2.medium instance (4GB RAM – 2 CPUs)
Post large scale migration, we would anticipate significantly reducing the OpenMigrate servers (or retire if all migration is complete) as well as reconfigure DynamoDB write and read units to be more inline with daily usage.
DynamoDB Benchmark Details – How did we do it?
One of the biggest issues of creating a representative benchmark is providing realistic data at volume. While most benchmarks, including ours, rely on sample/text data, the benchmark needs to make sure that the created documents and folders in the large repository are sufficiently different enough to be a real world example of how the repository would function at scale.
For our sample data, we are going to rely on Open Addresses to create example case folders for each address. We have targeted example case folders for Accounts Payable, Auto Claim, Health Care Claim and Human Resources for example cases with different types of documents for each type of case. Each address will have the four case folders attached with the folder and document names being modified with part of the address to keep the uniqueness of each document and folder. We are leveraging the multi-threaded capabilities of OpenMigrate to set up with a massive number of threads to read through the addresses and create the folders and documents in DynamoDB. As already mentioned earlier, since we are not testing the ability of the S3 volume store, each documents in DynamoDB will point to the same 52 links in S3 for the initial load. Additional documents, versions or annotations added manually after the load will all have new content in both DynamoDB and S3. We are targeting all capabilities except redaction given the sharing of file sources.
For Phase 2 of the benchmark, we are going to test Amazon’s ability to create an Elastic Search index on the fly leveraging Lamda Elastic Search index population from DynamoDB to allow for document and expanded folder meta-data searching.
For Phase 3 of the benchmark, we are planning to provide the ability to add new documents to the case.
For Phase 4 of the benchmark, we are planning on testing concurrent users to verify performance at scale. Tentatively targeting 11 thousand current users.
DynamoDB Benchmark Details – So what does it do?
One thing that has guided the benchmark efforts from the beginning was getting input from outside council including Alan Pelz-Sharpe with Deep Analysis as well as Rich Medina from Doculabs. Early Alan stated that, while multiple companies had done billion document benchmarks in the past, his concern as an analyst was “Great – you can hold a billion documents in the repository, what can you do with those documents?” Once the migration activities are complete, here are the different scenarios we plan on supporting.
Benchmark Phase 1 scenarios will include:
- Search for case folder – Users will be able to search across the entire repository for any folder
- View case documents – Users will be able to view a listing of all documents or videos in the folder
- Document Actions – Users will be able to view and update properties and annotate both documents and video
- Folder Actions – Users will be able to view all documents with user preferences for attribute order and column sorting, combine PDF and combine into Zip.
- Related Documents and Folders – Users will be able to see related folders
Benchmark Phase 2 scenarios will include:
- Search for documents – with a new Elastic index for all documents, users will be able to search for documents based on attributes across the repository.
Benchmark Phase 3 scenarios will include:
- Add documents – users will be able to add documents in a variety of supported methods highlighted in this post.
Benchmark Phase 4 scenario will include:
- Concurrent user test of 11,000 threads preforming standard document management including search, annotate and adding documents.
DynamoDB Benchmark – How to monitor our results
One of the benefits of leveraging OpenMigrate for the bulk load of the benchmark test is OpenMigrate’s ability to monitor a migration. To be transparent, TSG will be blogging every day with our results and showing access to the environment and monitoring our migrations. Check-in to see how we are doing. We are hoping to post Daily updates as well to record both our successes and struggles.
(Update: 5/13 10:29AM – we have kicked off our latest batch of one billion documents using two OM instances moving over roughly half a billion documents each. You can check-in on their progress using the links below.)
(Update 5/17 – As the ingestion is finished, these sites are no longer active. See recordings in subsequent posts)
http://ec2-3-17-163-13.us-east-2.compute.amazonaws.com/om-a/#/dashboard
http://ec2-3-17-163-13.us-east-2.compute.amazonaws.com/om-b/#/dashboard
Let us know any of your thoughts on other things you would like to see in the benchmark below:
[…] post Monday detailed the reasons and expectations for the 11 billion document benchmark with a post Tuesday showing the interface and migration process. This post will present […]