TSG initiated our 11 Billion benchmark test on Friday, May 10th. The first phase of the benchmark was aimed at building a large repository with our OpenMigrate ingestion tool and proving access for OpenContent Search, OpenContent Case and OpenAnnotate. The initial ingestion phase concluded on May 17th with 11 Billion documents and ingestion speeds of 20,000 documents per second to DynamoDB and related folders indexed into Elasticsearch. We took some time to decompress and started the second phase of benchmark last week focused on building search indices as required for document search with the DynamoDB documents. TSG successfully completed the indexing benchmark today. This post will highlight the success of this phase of the benchmark as well as present how the final two phases will proceed.
Elasticsearch Index – Benchmark Approach
One of the key focus areas of the benchmark in regards to the document index was focused on departmental search versus a complete repository search. As we pointed out in our DynamoDB and AWS post on “How to build your own ECM capabilities for massive scale and performance”, we recommended pushing for multiple, efficient search indices rather than one large index to allow content services to perform quicker with less complexity. With the success, capabilities and cost of robust search tools like Solr and Elasticsearch, TSG has been recommending for years that clients create focused indices rather than one large “do all” index to serve every possible purpose.
For the benchmark, we focused on two indices:
- Large Folder Index – All folders would have an index for basic search. Typically in a case management scenario, large clients are looking to navigate to the folder to look at document in the case. Getting access to the case folder and documents quickly is a key requirement.
- Focused Document Index – More of a focused index for a subset of documents. These indices could be created when needed for specific scenarios or requirements.
While we felt very comfortable that we could create an Elasticsearch index for all the documents in the repository especially with Sharding capabilities, the index services from AWS come with a higher price than just DynamoDB storage. Our main focus was to prove out on a smaller sample the indexing capabilities while providing a realistic approach for clients to consider when building and maintaining Elasticsearch indices.
Folder Index Results
As part of the 11 billion document ingestion, we leveraged OpenMigrate to create an Elasticsearch index for every folder in the repository finishing with 925,837,980 folders in the Elasticsearch index. We created this index as part of the OpenMigrate ingesting process that was storing 21,000 objects per second. Statistics from the folder indexing benchmark include
- 925,837,980 folders stored in DynamoDB and indexed in Elasticsearch
- 6 Data Nodes and 3 Master Elasticsearch servers to maintain the index
- Objects (documents and folders) ingested per second – 21,000
- Indexing time for 925,837,980 folders – 159 hours
- Elasticsearch Index Size – 372 Gigabytes – DynamoDB Repository Size – 5.32 Terabytes
Below is the search video we did with the ingestion process.
Document Index Results
For the document index, we wanted to show a scenario where an index would be created for a specific purpose. We decided to index 1 million documents from our accounts payable scenario as a feasible test. This is consistent from client experience where clients have said “I only to index the last X months for X”. Statistics from the document indexing benchmark are different from the DynamoDB effort as the content was already in Dynamo (and needed to be retrieved to be indexed):
- 1 million documents already existing in DynamoDB
- Document indexed per second – 501.3936
- Indexing time – 33 minutes 14 seconds
- Elasticsearch Index Increase – 4 Gigabytes – Total Size 376 Gigabytes
This effort scanned through existing DynamoDB content, and if the object type of the content was an AP invoice document, it indexed the content until 1 million documents were found.
We were able to create the 1 million index after the DynamoDB ingestion process was complete in just under 35 minutes or roughly 500 documents per second. Unlike the main ingestion process were we only had 7 servers, we had added more processing power to the Elasticsearch to get up to our 9 servers. We were confident that we could add additional servers to further reduce the indexing time and improve throughput.
Lessons Learned – AWS Lamda
We were hoping to use Amazon Lamda for indexing from DynamoDB to Elasticsearch. We found that:
- AWS Lamda currently has a 5 minute timeout for execution making our large volume difficult to index.
- We considered indexing from DynamoDB streams but those were only available for 24 hours so didn’t fit our scenario of building a purpose built index.
We decided to leverage OpenMigrate to both read and index documents to be consistent and provide the same infrastructure for both initial ingestion as well as the creation of new indices.
Lessons Learned – Scaling Elasticsearch versus DynamoDB
We found the current pricing of Elasticsearch versus DynamoDB to be very different with Elasticsearch due to the size of cores need to support large ingestion and indices. In our benchmark, DynamoDB stored around 13 times the amount of nodes as Elasticsearch did, but Elasticsearch currently costed about 1.3 time more than DynamoDB over the course of the benchmark.
Unlike DynamoDB where we could scale up for ingestion and then drop read/write units once the large migration was complete, Elasticsearch requires servers to be maintained and operational for both ingestion and later access. DynamoDB read/write units are priced and maintained very differently than Elasticsearch EC2 instances.
Phase 3 and Phase 4 – What’s Next
We are looking to finish the benchmark over the next one to two weeks. Upcoming tests will include:
Benchmark Phase 3 – Add documents – users will be able to add documents in a variety of supported methods highlighted in this post.
Benchmark Phase 4 – Concurrent user test of 11,000 threads preforming standard document management including search, annotate and adding documents.
Stay tuned as we look to wrap up the benchmark. Thanks again for all of your questions and emails.
Jeff Potts says
I like hearing about the progress you are making with your custom content repository built on AWS. Thanks for sharing all of this info!
A few things to consider regarding Elasticsearch…
You mentioned that it Elasticsearch nodes must be maintained after ingestion rather than being able to scale them down. But you can scale down a cluster as long as you’ve left enough nodes to handle the query traffic and the number of shards you’ve allocated. Although primary shards cannot be changed, replica shards can, so if you needed to, you could decrease the number of replica shards.
Another thing to consider is the mix of different Elasticsearch node types. It sounds like you used only Master nodes and Data nodes, which is fine, but you can also have Ingest nodes. As the name suggests these nodes are dedicated to ingesting data. If your ingestion was not full-time you could spin up ingestion nodes to handle ingestion, then turn them off while they are not being used.
It sounds like you were able to get some good throughput on your Elasticsearch cluster. In a real world scenario it makes sense to spend some time tuning the cluster. One of our projects, we found that we got better query throughput by increasing the number of client nodes (these nodes are neither master nor data nodes).
By tweaking the number of nodes and different types of nodes you can ensure that you have maximum throughput for both querying and indexing, and that mix can change over time as your workload changes.
Joe Hof says
Thanks so much for your comment, really good stuff on Elasticsearch configuration. Just want to respond to a few of your points.
In regards to scaling back the cluster as long as you have enough nodes to handle query traffic, that is spot on. We found you really needed horsepower when actually indexing but found it overkill for search, so scaling back helped on cost while maintaining a consistent performance for search.
All of the node and shard tweaks you mentioned are great and would benefit in a normal Elasticsearch instance, unfortunately the Amazon Elasticsearch Managed Service only has Master and Data nodes, so we were a little limited in how we could tweak the setup outside of those two. We thought the benefits of the managed services (being able to scale server size and clusters as needed) outweighed the con of not having all of the node types that an on-premise Elasticsearch can have.
However, if we are ever deploying DynamoDB with Elasticsearch installed to a server instead of using the managed service, we will definitely use all of the advice posted here!