DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned

TSG initiated our 11 Billion benchmark on Friday, May 10^th. As of today we have completed the fourth and final phase, testing our repository with 11,000 concurrent users. This post will share the results.

Background on the 11 Billion Document Benchmark

The first phase of the benchmark was aimed at building a large repository with our OpenMigrate ingestion tool and proving access for OpenContent Search, OpenContent Case and OpenAnnotate. The initial ingestion phase concluded on May 17^th with 11 Billion documents and ingestion speeds of 20,000 documents per second to DynamoDB and related folders indexed into Elasticsearch. The second phase of benchmark focused on building search indices as required for document search for the documents already in DynamoDB which successfully ended June 11th. The third phase of the project focused on user addition of documents and finished on June 12. This last phase focused on testing a large number of users concurrently accessing the system for search, view, add and annotation.

Testing 11,000 Concurrent Users – Benchmark Testing Goals

Goals of the concurrent user test were to replicate some of the different issues we have seen from clients in production when a large number of document management users are accessing the system. Components that we specifically wanted to test include:

Application servers to reveal bottlenecks or choke points
DynamoDB performance to detect and remediate any scanning of the table
Document searching performance with Elasticsearch service
Document retrieval and viewing performance under stress
Document annotations

The benchmark test was patterned after the most common use case for our insurance clients – Claim Viewing. The claim viewing scenario is scripted as a user opening a claim folder with 25 documents followed by viewing 3 to 5 of the documents within our OpenAnnotate viewer. This simulates a user accessing the document management system directly from an insurance claim system. For the test we ran a batch of 11,000 users performing the claim viewing scenario across a selection of 20,000 different medical and auto claims out of the total repository of 11 billion documents. The Claim Viewing scenario tested the Application servers and the DynamoDB table under stress.

Searching

Not all users executed a search in a Claim Viewing scenario. The number of users who require the ability to search the repository to locate claim information is approximately 25% of the entire universe of users, often these are supervisors, management, or administrators. During the performance test we kept the script simple and executed a search on Name and Address for each user which taxed the Elasticsearch cluster beyond what we originally targeted for testing.

Annotations

The final use case we implemented was to annotate one of the documents viewed during the Claim Viewing test. Approximately 10% of documents are annotated and for the test we planned on annotating a document for 1 of every 10 users. One issue we discovered is that while it was simple to add an annotation, resetting the data and deleting the annotations between each test run was out of scope for what we wanted to accomplish in this benchmark. We elected to postpone the performance testing of the annotations until an update was made to allow overwrites and deletes for the annotations.

Testing 11 Thousand Concurrent Users – Benchmark Testing with AWS

Leveraging Amazon Web Services, we started with a moderate sizing of the application servers, DynamoDB read units, and Elasticsearch cluster. Before we scaled the environment up or down, we ran initial tests to set a baseline.

Test Runs

We started with simple baseline test runs and as we encountered and resolved issues we expanded the architecture to use two jmeter instances and two OCMS instances. When troubleshooting several test runs we simplified the process and used only one jmeter and one OCMS instance.

Test Run	Users	Jmeter Instance (#) (vCPU, Memory)	OCMS Instance (#) (vCPU / memory)	DynamoDB Unit (min read / write)	Elasticsearch Data Nodes (#) – 2 AZs
0,1	100	t2.micro (1) (1 / 1)	t2.medium (1) (2 / 4)	100 / 50	r5.4xlarge (6)
2	2,000	m5a.4xlarge (1) (16 / 64)	t2.medium (1) (2 / 4)	5000 / 50 and 7000 / 50	r5.large (6)
3	2,000	m5a.4xlarge (2) (16 / 64)	r5a.12xlarge (2) (48 / 384)	2000 / 50 (auto scale)	r5.12xlarge (6)
4	5,500	m5a.4xlarge (1) (16 / 64)	r5a.12xlarge (1) (48 / 384)	2000 / 50 (auto scale)	r5.12xlarge (6)
5	5,500	m5a.12xlarge (1) (48 / 192)	r5a.12xlarge (1) (48 / 384)	2000 / 50 (auto scale)	c5.18xlarge (6)
6	11,000	m5a.12xlarge (2) (48 / 192)	r5a.4xlarge (2) (16 / 128)	2000 / 50 (auto scale)	c5.18xlarge (6)
7	11,000	m5a.2xlarge (2) (8 / 32 )	r5a.4xlarge (2) (16 / 128)	2000 / 50 (auto scale)	r5.4xlarge (6)

When running the initial small baseline tests and subsequent 2,000, 5500, and 11,000 user tests we identified several issues, bottlenecks and choke points in each tier of the environment.

Jmeter Instance relevant issues included:

Unviewable characters in the test dataset csv files caused bad searches and logins. Resolved by changing file encoding settings in jmeter.
Encountered timeout issues and maxing of CPU. Tested ramp up period and task timers to determine impact on the OCMS instance. Modified the Jmeter to increase CPU and limits in Jmeter to resolve issue.
JVM generic issue message for out of memory was resolved by increasing the linux process limit.
Script did not properly handle mime types. Updated script to display mime types correctly.

OpenContent Management Suite Issues included:

Encountered thread timeouts. Modified the JVM memory settings and increased the tomcat threads.
Out of Memory caused by an issue with garbage collection. Modified to use G1 gc and updated the heap settings.
Linux OS error – too many open files. Resolved by increasing the limit.
Encountered HTTP non response messages and timeouts; an increase apache httpd threads resolved the issue.
JVM generic issue message for out of memory was resolved by increasing the linux process limit
Transformation queue for document viewing maxed out the server CPU. Update script to include less transformation and would recommend leveraging external servers.

DynamoDB issues included:

Searching with a bad value for an object id caused a table scan and then blocked the remaining threads. Updated code to validate id and prevent the scan.

Elasticsearch issues included:

Searches with unviewable/hidden characters caused non-terminating search errors and maxed out CPU and memory. Test updated to not include unviewable characters.
Executing searches concurrently for each of the 11,000 users across the 5 shards in the cluster maxed out the elasticsearch. Modified test to include more resonable percentage of users performing search versus direct access to folder (typical use case).

Each of these points above were resolved and retested in subsequent test runs.

On bottleneck we experienced that did represent real-world scenarios was maxing out the transformation queue on the OCMS server. For clients with a large volume of users, the transformation queue and process are moved off the OCMS server and scaled out separately. TSG can provide reference architectures for clients implementing OCMS with a large number of users.

Creating the Test Data

The test data for each of the runs was pulled directly from the Elasticsearch cluster using a Python script. Details for ten thousand auto claim and ten thousand medical claim folders were selected and split across the 11,000 users who were generated in the system for testing. The Jmeter test scripts read two csv files, one for users and one for claim folders and properties.

Creating the Test Scripts

Defining realistic test scenarios seemed fairly simple to do at the outset but required more iteration than expected. We started by using Blazemeter to record a sample set of actions we wanted to take in OCMS. The recording was exported to jmx and then edited to better mimic entry to OCMS from a claim system and accept dynamic parameters from the csv test data files. In order to make the test plan realistic and dynamic we added in http response extractors and logical conditional statements to link the task steps for searching, viewing, and annotating together.

Lessons Learned – Concurrent User Testing

While the Claim Viewing scenario on its surface is a well-defined set of REST endpoints, adding in the automation to process responses and conditionally execute statements took a few days longer than we expected and required more iterative baseline testing loops than originally planned.

The more interesting troubleshooting issues occurred when we scaled up to two thousand users. At that point we observed DynamoDB scans where we had not seen them at a smaller scale. We cautiously disabled pieces from the test script and added in several debuggers. It took several iterations to reduce the noise in the testing logs by increasing thread limits and open file limits until we reached the issue of unviewable data in the csv test files. Even after adjusting jmeter to UTF-8 encoding the first line of the csv file might start with an unviewable character. We added in a conditional statement in jmeter to avoid executing any statements containing the character. Alternatively, a more sophisticated means to strip the character from the file could have been done with a pre-processor or other script. This issue was more due to our address data file and we would not anticipate it is a normal client environment.

Once it was ensured only good data was being fed into the OCMS test script and we had resolved the thread, process, and open file limits, we scaled up to 5,500 and 11,000 users. While other smaller issues with large amounts of users surfaced, the team was confident that they could be resolved for a production client and ended the testing.

Summary – Conconurrent Users Benchmark Recommendations

There are several recommendations we’d make for a client wanting to build out a production version of the 11 billion benchmark for a set of 11,000 users.

During document ingestion, when writing data into Elasticsearch create enough replicas along with shards to service the expected search demand. If that level of demand is unknown, expect that the Elasticsearch index may need to be rebuilt in order to balance the search needs with the number of shards and replicas.
Plan that the application instance will need to scale to satisfy the viewer transformation needs or the transformation process will need to be farmed out to a separate server or set of servers. If the documents are pre-transformed and available on S3, then this may not become a bottleneck. Pre-transformation might be done at the point of ingestion for example.
TSG has put together a day 0 guide for tuning our implementations and has included all of the lessons learned on file, process, thread and JVM limits. Our approach was to start out with out-of-the-box AWS EC2 instances and services to observe the behavior of the tests and identify where our clients can expect to see limitations. This approach allowed us to see break points, add more capabilities and scale up to provide the most cost-effective instances.

We hope this journey through our 11 billion benchmark as inspired a new look at how ECM solutions can be implemented using AWS services.

Reader Interactions

Comments

Trackbacks

Leave a ReplyCancel reply