• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

DynamoDB 11 Billion Benchmark Search Index Success!!! – Lessons Learned

You are here: Home / DynamoDB / DynamoDB 11 Billion Benchmark Search Index Success!!! – Lessons Learned

June 12, 2019

TSG initiated our 11 Billion benchmark test on Friday, May 10th.  The first phase of the benchmark was aimed at building a large repository with our OpenMigrate ingestion tool and proving access for OpenContent Search, OpenContent Case and OpenAnnotate.  The initial ingestion phase concluded on May 17th with 11 Billion documents and ingestion speeds of 20,000 documents per second to DynamoDB and related folders indexed into Elasticsearch.  We took some time to decompress and started the second phase of benchmark last week focused on building search indices as required for document search with the DynamoDB documents.  TSG successfully completed the indexing benchmark today.  This post will highlight the success of this phase of the benchmark as well as present how the final two phases will proceed.

11 Billion Document White Paper

Elasticsearch Index – Benchmark Approach

One of the key focus areas of the benchmark in regards to the document index was focused on departmental search versus a complete repository search.  As we pointed out in our DynamoDB and AWS post on “How to build your own ECM capabilities for massive scale and performance”, we recommended pushing for multiple, efficient search indices rather than one large index to allow content services to perform quicker with less complexity.  With the success, capabilities and cost of robust search tools like Solr and Elasticsearch, TSG has been recommending for years that clients create focused indices rather than one large “do all” index to serve every possible purpose. 

For the benchmark, we focused on two indices:

  • Large Folder Index – All folders would have an index for basic search.  Typically in a case management scenario, large clients are looking to navigate to the folder to look at document in the case.  Getting access to the case folder and documents quickly is a key requirement.
  • Focused Document Index – More of a focused index for a subset of documents.  These indices could be created when needed for specific scenarios or requirements.

While we felt very comfortable that we could create an Elasticsearch index for all the documents in the repository especially with Sharding capabilities, the index services from AWS come with a higher price than just DynamoDB storage.  Our main focus was to prove out on a smaller sample the indexing capabilities while providing a realistic approach for clients to consider when building and maintaining Elasticsearch indices.

Folder Index Results

As part of the 11 billion document ingestion, we leveraged OpenMigrate to create an Elasticsearch index for every folder in the repository finishing with 925,837,980 folders in the Elasticsearch index.  We created this index as part of the OpenMigrate ingesting process that was storing 21,000 objects per second.  Statistics from the folder indexing benchmark include

  • 925,837,980 folders stored in DynamoDB and indexed in Elasticsearch
  • 6 Data Nodes and 3 Master Elasticsearch servers to maintain the index
  • Objects (documents and folders) ingested per second – 21,000
  • Indexing time for 925,837,980 folders – 159 hours
  • Elasticsearch Index Size – 372 Gigabytes – DynamoDB Repository Size – 5.32 Terabytes

Below is the search video we did with the ingestion process.

Document Index Results

For the document index, we wanted to show a scenario where an index would be created for a specific purpose.  We decided to index 1 million documents from our accounts payable scenario as a feasible test.  This is consistent from client experience where clients have said “I only to index the last X months for X”. Statistics from the document indexing benchmark are different from the DynamoDB effort as the content was already in Dynamo (and needed to be retrieved to be indexed):

  • 1 million documents already existing in DynamoDB
  • Document indexed per second – 501.3936
  • Indexing time – 33 minutes 14 seconds
  • Elasticsearch Index Increase – 4 Gigabytes – Total Size 376 Gigabytes

This effort scanned through existing DynamoDB content, and if the object type of the content was an AP invoice document, it indexed the content until 1 million documents were found.

We were able to create the 1 million index after the DynamoDB ingestion process was complete in just under 35 minutes or roughly 500 documents per second.  Unlike the main ingestion process were we only had 7 servers, we had added more processing power to the Elasticsearch to get up to our 9 servers.  We were confident that we could add additional servers to further reduce the indexing time and improve throughput.

Lessons Learned – AWS Lamda

We were hoping to use Amazon Lamda for indexing from DynamoDB to Elasticsearch.  We found that:

  • AWS Lamda currently has a 5 minute timeout for execution making our large volume difficult to index.
  • We considered indexing from DynamoDB streams but those were only available for 24 hours so didn’t fit our scenario of building a purpose built index.

We decided to leverage OpenMigrate to both read and index documents to be consistent and provide the same infrastructure for both initial ingestion as well as the creation of new indices.

Lessons Learned – Scaling Elasticsearch versus DynamoDB

We found the current pricing of Elasticsearch versus DynamoDB to be very different with Elasticsearch due to the size of cores need to support large ingestion and indices.  In our benchmark, DynamoDB stored around 13 times the amount of nodes as Elasticsearch did, but Elasticsearch currently costed about 1.3 time more than DynamoDB over the course of the benchmark.

Unlike DynamoDB where we could scale up for ingestion and then drop read/write units once the large migration was complete, Elasticsearch requires servers to be maintained and operational for both ingestion and later access.  DynamoDB read/write units are priced and maintained very differently than Elasticsearch EC2 instances.

Phase 3 and Phase 4 – What’s Next

We are looking to finish the benchmark over the next one to two weeks.  Upcoming tests will include:

Benchmark Phase 3 – Add documents – users will be able to add documents in a variety of supported methods highlighted in this post.

Benchmark Phase 4 – Concurrent user test of 11,000 threads preforming standard document management including search, annotate and adding documents.

Stay tuned as we look to wrap up the benchmark.  Thanks again for all of your questions and emails.

11 Billion Document White Paper

Filed Under: DynamoDB, ECM Landscape, Elastic APM, OpenAnnotate, OpenContent Management Suite, OpenOverlay, Uncategorized

Reader Interactions

Comments

  1. Jeff Potts says

    June 14, 2019 at 12:39 pm

    I like hearing about the progress you are making with your custom content repository built on AWS. Thanks for sharing all of this info!

    A few things to consider regarding Elasticsearch…

    You mentioned that it Elasticsearch nodes must be maintained after ingestion rather than being able to scale them down. But you can scale down a cluster as long as you’ve left enough nodes to handle the query traffic and the number of shards you’ve allocated. Although primary shards cannot be changed, replica shards can, so if you needed to, you could decrease the number of replica shards.

    Another thing to consider is the mix of different Elasticsearch node types. It sounds like you used only Master nodes and Data nodes, which is fine, but you can also have Ingest nodes. As the name suggests these nodes are dedicated to ingesting data. If your ingestion was not full-time you could spin up ingestion nodes to handle ingestion, then turn them off while they are not being used.

    It sounds like you were able to get some good throughput on your Elasticsearch cluster. In a real world scenario it makes sense to spend some time tuning the cluster. One of our projects, we found that we got better query throughput by increasing the number of client nodes (these nodes are neither master nor data nodes).

    By tweaking the number of nodes and different types of nodes you can ensure that you have maximum throughput for both querying and indexing, and that mix can change over time as your workload changes.

    Jeff

    Reply
    • Joe Hof says

      June 18, 2019 at 9:01 am

      Hey Jeff,

      Thanks so much for your comment, really good stuff on Elasticsearch configuration. Just want to respond to a few of your points.

      In regards to scaling back the cluster as long as you have enough nodes to handle query traffic, that is spot on. We found you really needed horsepower when actually indexing but found it overkill for search, so scaling back helped on cost while maintaining a consistent performance for search.

      All of the node and shard tweaks you mentioned are great and would benefit in a normal Elasticsearch instance, unfortunately the Amazon Elasticsearch Managed Service only has Master and Data nodes, so we were a little limited in how we could tweak the setup outside of those two. We thought the benefits of the managed services (being able to scale server size and clusters as needed) outweighed the con of not having all of the node types that an on-premise Elasticsearch can have.

      However, if we are ever deploying DynamoDB with Elasticsearch installed to a server instead of using the managed service, we will definitely use all of the advice posted here!

      Joe

      Reply

Trackbacks

  1. DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned says:
    June 20, 2019 at 1:46 pm

    […] focused on building search indices as required for document search for the documents already in DynamoDB which successfully ended June 11th.  The third phase of the project focused on user addition of documents and finished on June […]

    Reply
  2. DynamoDB and Hadoop/HBase as a Document Store – How Key Design can be used to reduce index requirements says:
    July 17, 2019 at 11:42 am

    […] clients in a production environment without having an index server at all. As we found in our indexing benchmark, the infrastructure costs alone for an index of this size can be multiple times more expensive than […]

    Reply
  3. DynamoDB 11 Billion Document Benchmark – Summary of Postings says:
    July 31, 2019 at 3:53 pm

    […] on building Elasticsearch indices as required for document search for the documents already in DynamoDB which successfully ended June 11th.  While the initial migration included Elasticsearch index for all of the 925,837,980 folders […]

    Reply
  4. Content Service Platform Scaling - How Good Key Design and NoSQL can avoid the need for Elastic/Solr or other indexes — Technology Services Group says:
    April 15, 2020 at 4:16 pm

    […] clients in a production environment without having an index server at all. As we found in our indexing benchmark, the infrastructure costs alone for an index of this size can be multiple times more expensive than […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • ECM 2.0 – Can you build it yourself?
  • DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned
  • TSG Announces Creation of Hadoop Practice
  • Documentum Client Briefing – Final Agenda – June 7th – University of Chicago Gleacher Center in Chicago
  • Alfresco Consulting – Documentum Disruptor #2
  • Third Annual TSG Client Briefing – June 3rd – 2010
  • AWS with DynamoDB for Content Management – Reference Architecture & Cost Estimate
  • TECHNOLOGY SERVICES GROUP SUCCESSFULLY BENCHMARKS 11 BILLION DOCUMENT REPOSITORY WITH AMAZON WEB SERVICES – PRESS RELEASE
  • DynamoDB 11 Billion Document Benchmark – Summary of Postings
  • DynamoDB 11 Billion Benchmark Add Documents Success!!! – Lessons Learned

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT