TSG initiated our 11 Billion benchmark on Friday, May 10th. The first phase of the benchmark was aimed at building a large repository with our OpenMigrate ingestion tool and proving access for OpenContent Search, OpenContent Case and OpenAnnotate. The initial ingestion phase concluded on May 17th with 11 Billion documents and ingestion speeds of 20,000 documents per second to DynamoDB and related folders indexed into Elasticsearch. We took some time to decompress and started the second phase of benchmark last week focused on building search indices as required for document search with the DynamoDB documents which successfully ended June 11th. Today we have successfully completed the third phase of the project, adding documents. This post will highlight the success of this third phase of the benchmark as well as present how the final phase testing a large number of users will proceed.
Adding Documents – Benchmark Approach
When setting up the benchmark, we specifically chose to separate the first large scale (11 billion documents – almost 1 billion folders) ingestion phase from the third phase of users adding documents to folders. This approach is consistent of our larger clients as many of our large clients have chosen to expose our OpenContent interfaces on their existing content after a large or rolling migration before allowing users to add documents to the new repository. (See related webinar with Tony Parzgnat on how a rolling migration can help retire FileNet early)
One of the key discussions for this phase focused on how to add/retrieve content from a folder. As we mentioned in our initial post describing the benchmark, a key requirement was viewing case documents and that users would be able to “view a listing of all documents or videos in the folder”.
As part of our benchmark, the team tested two approaches for displaying the contents of a folder.
- JSON storage of document objects – The folder DynamoDB object contains all of the document ids in a repeating field. This allows for fast viewing of the objects in the folder, a typical requirement for case management/folder viewing. Benefits included fast, scale-able access to folder objects without a large Elasticsearch index.
- Elasticsearch for documents – In our first ingestion phase, Elasticsearch was only being used for access to folder objects. Phase 2 of the benchmark tested indexing part of 11 billion documents to test leverage of Elasticsearch for displaying objects contained in a folder.
After testing, the team determined that the JSON object store made the most sense for our sample set but that we would continue to offer both alternatives for customers based on the size of the number of documents in the folder. (TSG has one client with 65,000 documents in a folder).
Add Document Results
Below is a video showing adding documents to multiple types of folder with a variety of ingestion options. Video highlights adding and annotating content in the large repository.
Lessons Learned – Elasticsearch versus DynamoDB
We found the current pricing of Elasticsearch versus DynamoDB to be very different with Elasticsearch due to the size of cores need to support large ingestion and indices. In our benchmark, DynamoDB stored around 13 times the amount of nodes as Elasticsearch did, but Elasticsearch currently costed about 1.3 times more than DynamoDB over the course of the benchmark.
Unlike DynamoDB where we could scale up for ingestion and then drop read/write units once the large migration was complete, our approach required Elasticsearch servers to be maintained and operational for both ingestion and later access. DynamoDB read/write units are priced and maintained very differently than Elasticsearch EC2 instances. Maintaining the folder objects/documents in Elasticsearch was price prohibitive for a simple use case that can be accomplished with DynamoDB.
Due to the pricing and the added overhead, we thought leveraging DynamoDB for most folder viewing requirements made the most sense from a cost as well as scaling capability.
Phase 4 – What’s Next and Last
We are looking to finish the benchmark within the next week or two. Our last test will be a concurrent user test of 11,000 threads preforming standard document management including search, annotate and adding documents.
Stay tuned as we look to wrap up the benchmark. Thanks again for all of your questions and thoughts.