• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned

You are here: Home / Amazon / DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned

June 19, 2019

TSG initiated our 11 Billion benchmark on Friday, May 10th.  As of today we have completed the fourth and final phase, testing our repository with 11,000 concurrent users.  This post will share the results.

11 Billion Document White Paper

Background on the 11 Billion Document Benchmark

The first phase of the benchmark was aimed at building a large repository with our OpenMigrate ingestion tool and proving access for OpenContent Search, OpenContent Case and OpenAnnotate.  The initial ingestion phase concluded on May 17th with 11 Billion documents and ingestion speeds of 20,000 documents per second to DynamoDB and related folders indexed into Elasticsearch.  The second phase of benchmark focused on building search indices as required for document search for the documents already in DynamoDB which successfully ended June 11th.  The third phase of the project focused on user addition of documents and finished on June 12.  This last phase focused on testing a large number of users concurrently accessing the system for search, view, add and annotation.

Testing 11,000 Concurrent Users – Benchmark Testing Goals

Goals of the concurrent user test were to replicate some of the different issues we have seen from clients in production when a large number of document management users are accessing the system.  Components that we specifically wanted to test include:

  • Application servers to reveal bottlenecks or choke points
  • DynamoDB performance to detect and remediate any scanning of the table
  • Document searching performance with Elasticsearch service
  • Document retrieval and viewing performance under stress
  • Document annotations

The benchmark test was patterned after the most common use case for our insurance clients – Claim Viewing. The claim viewing scenario is scripted as a user opening a claim folder with 25 documents followed by viewing 3 to 5 of the documents within our OpenAnnotate viewer.  This simulates a user accessing the document management system directly from an insurance claim system. For the test we ran a batch of 11,000 users performing the claim viewing scenario across a selection of 20,000 different medical and auto claims out of the total repository of 11 billion documents.  The Claim Viewing scenario tested the Application servers and the DynamoDB table under stress.

Searching

Not all users executed a search in a Claim Viewing scenario. The number of users who require the ability to search the repository to locate claim information is approximately 25% of the entire universe of users, often these are supervisors, management, or administrators. During the performance test we kept the script simple and executed a search on Name and Address for each user which taxed the Elasticsearch cluster beyond what we originally targeted for testing.

Annotations

The final use case we implemented was to annotate one of the documents viewed during the Claim Viewing test. Approximately 10% of documents are annotated and for the test we planned on annotating a document for 1 of every 10 users. One issue we discovered is that while it was simple to add an annotation, resetting the data and deleting the annotations between each test run was out of scope for what we wanted to accomplish in this benchmark. We elected to postpone the performance testing of the annotations until an update was made to allow overwrites and deletes for the annotations.

Testing 11 Thousand Concurrent Users – Benchmark Testing with AWS

Leveraging Amazon Web Services, we started with a moderate sizing of the application servers, DynamoDB read units, and Elasticsearch cluster. Before we scaled the environment up or down, we ran initial tests to set a baseline.

Test Runs

We started with simple baseline test runs and as we encountered and resolved issues we expanded the architecture to use two jmeter instances and two OCMS instances. When troubleshooting several test runs we simplified the process and used only one jmeter and one OCMS instance.

Test Run Users Jmeter Instance (#) (vCPU, Memory) OCMS Instance (#) (vCPU / memory) DynamoDB Unit (min read / write) Elasticsearch Data Nodes (#) – 2 AZs
0,1 100 t2.micro (1) (1 / 1) t2.medium (1) (2 / 4) 100 / 50 r5.4xlarge (6)
2 2,000 m5a.4xlarge (1) (16 / 64) t2.medium (1) (2 / 4) 5000 / 50 and 7000 / 50 r5.large (6)
3 2,000 m5a.4xlarge (2) (16 / 64) r5a.12xlarge (2) (48 / 384) 2000 / 50 (auto scale) r5.12xlarge (6)
4 5,500 m5a.4xlarge (1) (16 / 64) r5a.12xlarge (1) (48 / 384) 2000 / 50 (auto scale) r5.12xlarge (6)
5 5,500 m5a.12xlarge (1) (48 / 192) r5a.12xlarge (1) (48 / 384) 2000 / 50 (auto scale) c5.18xlarge (6)
6 11,000 m5a.12xlarge (2) (48 / 192) r5a.4xlarge (2) (16 / 128) 2000 / 50 (auto scale) c5.18xlarge (6)
7 11,000 m5a.2xlarge (2) (8 / 32 ) r5a.4xlarge (2) (16 / 128) 2000 / 50 (auto scale) r5.4xlarge (6)

When running the initial small baseline tests and subsequent 2,000, 5500, and 11,000 user tests we identified several issues, bottlenecks and choke points in each tier of the environment.

Jmeter Instance relevant issues included:

  • Unviewable characters in the test dataset csv files caused bad searches and logins. Resolved by changing file encoding settings in jmeter. 
  • Encountered timeout issues and maxing of CPU. Tested ramp up period and task timers to determine impact on the OCMS instance. Modified the Jmeter to increase CPU and limits in Jmeter to resolve issue.
  • JVM generic issue message for out of memory was resolved by increasing the linux process limit.
  • Script did not properly handle mime types. Updated script to display mime types correctly.

OpenContent Management Suite Issues included:

  • Encountered thread timeouts. Modified the JVM memory settings and increased the tomcat threads.  
  • Out of Memory caused by an issue with garbage collection. Modified to use G1 gc and updated the heap settings.  
  • Linux OS error – too many open files. Resolved by increasing the limit.
  • Encountered HTTP non response messages and timeouts; an increase apache httpd threads resolved the issue.   
  • JVM generic issue message for out of memory was resolved by increasing the linux process limit   
  • Transformation queue for document viewing maxed out the server CPU. Update script to include less transformation and would recommend leveraging external servers.

DynamoDB issues included:

  • Searching with a bad value for an object id caused a table scan and then blocked the remaining threads. Updated code to validate id and prevent the scan.   

Elasticsearch issues included:

  • Searches with unviewable/hidden characters caused non-terminating search errors and maxed out CPU and memory.  Test updated to not include unviewable characters. 
  • Executing searches concurrently for each of the 11,000 users across the 5 shards in the cluster maxed out the elasticsearch. Modified test to include more resonable percentage of users performing search versus direct access to folder (typical use case).

Each of these points above were resolved and retested in subsequent test runs.

On bottleneck we experienced that did represent real-world scenarios was maxing out the transformation queue on the OCMS server. For clients with a large volume of users, the transformation queue and process are moved off the OCMS server and scaled out separately.  TSG can provide reference architectures for clients implementing OCMS with a large number of users.

Creating the Test Data

The test data for each of the runs was pulled directly from the Elasticsearch cluster using a Python script. Details for ten thousand auto claim and ten thousand medical claim folders were selected and split across the 11,000 users who were generated in the system for testing. The Jmeter test scripts read two csv files, one for users and one for claim folders and properties.

Creating the Test Scripts

Defining realistic test scenarios seemed fairly simple to do at the outset but required more iteration than expected.  We started by using Blazemeter to record a sample set of actions we wanted to take in OCMS. The recording was exported to jmx and then edited to better mimic entry to OCMS from a claim system and accept dynamic parameters from the csv test data files. In order to make the test plan realistic and dynamic we added in http response extractors and logical conditional statements to link the task steps for searching, viewing, and annotating together.

Lessons Learned – Concurrent User Testing

While the Claim Viewing scenario on its surface is a well-defined set of REST endpoints, adding in the automation to process responses and conditionally execute statements took a few days longer than we expected and required more iterative baseline testing loops than originally planned.

The more interesting troubleshooting issues occurred when we scaled up to two thousand users. At that point we observed DynamoDB scans where we had not seen them at a smaller scale. We cautiously disabled pieces from the test script and added in several debuggers. It took several iterations to reduce the noise in the testing logs by increasing thread limits and open file limits until we reached the issue of unviewable data in the csv test files.  Even after adjusting jmeter to UTF-8 encoding the first line of the csv file might start with an unviewable character. We added in a conditional statement in jmeter to avoid executing any statements containing the character. Alternatively, a more sophisticated means to strip the character from the file could have been done with a pre-processor or other script. This issue was more due to our address data file and we would not anticipate it is a normal client environment.

Once it was ensured only good data was being fed into the OCMS test script and we had resolved the thread, process, and open file limits, we scaled up to 5,500 and 11,000 users. While other smaller issues with large amounts of users surfaced, the team was confident that they could be resolved for a production client and ended the testing.

Summary – Conconurrent Users Benchmark Recommendations

There are several recommendations we’d make for a client wanting to build out a production version of the 11 billion benchmark for a set of 11,000 users.

  • During document ingestion, when writing data into Elasticsearch create enough replicas along with shards to service the expected search demand. If that level of demand is unknown, expect that the Elasticsearch index may need to be rebuilt in order to balance the search needs with the number of shards and replicas.
  • Plan that the application instance will need to scale to satisfy the viewer transformation needs or the transformation process will need to be farmed out to a separate server or set of servers. If the documents are pre-transformed and available on S3, then this may not become a bottleneck. Pre-transformation might be done at the point of ingestion for example.
  • TSG has put together a day 0 guide for tuning our implementations and has included all of the lessons learned on file, process, thread and JVM limits. Our approach was to start out with out-of-the-box AWS EC2 instances and services to observe the behavior of the tests and identify where our clients can expect to see limitations. This approach allowed us to see break points, add more capabilities and scale up to provide the most cost-effective instances.  

We hope this journey through our 11 billion benchmark as inspired a new look at how ECM solutions can be implemented using AWS services.

11 Billion Document White Paper

Filed Under: Amazon, Amazon EC2, Cloud Computing, DynamoDB, ECM Landscape, ECM Solutions, OpenContent Management Suite, Performance Tuning, Uncategorized

Reader Interactions

Comments

  1. Jeff Potts says

    June 20, 2019 at 9:25 am

    How many JMeter machines did you use to drive this traffic?

    Reply
    • Christine Adcock says

      June 20, 2019 at 9:56 am

      Hi Jeff. We used two jmeter machines.

      Reply

Trackbacks

  1. The Deep Analysis Podcast – The 11 Billion File Benchmark says:
    July 31, 2019 at 3:51 pm

    […] documents and then Phase 4 is going to be something our clients and struggle with is which is concurrent user access we want to scale up to five to ten thousand concurrent users to hit that part as well. Phase 3 and […]

    Reply
  2. Content Service Platform Scaling - How Good Key Design and NoSQL can avoid the need for Elastic/Solr or other indexes — Technology Services Group says:
    April 16, 2020 at 6:16 am

    […] When a document has a Case ID in the object model, NoSQL can make use of a design pattern by prepending the Case ID to the beginning of the document ID. In this manner, when a user performs a search for documents, HBase design best practices and DynamoDB design best practices dictate a key pattern to allow a lightning fast Range Scan to quickly bring back all of the documents for a particular case. This operation is significantly faster in a large repository, with the added benefit of not needing to leverage the search index at all. We have found that scanning the database directly via these patterns offers predictably fast access to view all documents in a particular claim, regardless if there are 1 million or 11 billion documents in the repository. […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • AWS with DynamoDB for Content Management – Reference Architecture & Cost Estimate
  • ECM 2.0 – Can you build it yourself?
  • ECM 2.0 – Vision & Review of 2019
  • Reference Architecture for Content Management on Azure HDInsight with HBase
  • DynamoDB Benchmark – Building an 11 Billion Document DR Process
  • DynamoDB 11 Billion Document Benchmark – Summary of Postings
  • DynamoDB 11 Billion Benchmark Add Documents Success!!! – Lessons Learned
  • DynamoDB 11 Billion Benchmark Search Index Success!!! – Lessons Learned
  • DynamoDB 11 Billion Benchmark – Document and Folder Details
  • Redaction for AWS, Alfresco, Documentum and Hadoop – Bulk Redaction upon Ingestion or Migration

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT