• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

DynamoDB Benchmark – Building an 11 Billion Document DR Process

You are here: Home / Amazon / DynamoDB Benchmark – Building an 11 Billion Document DR Process

July 8, 2019

In June, 2019, Technology Services Group completed an unprecedented 11 billion document benchmark leveraging Amazon Web Services and specifically DynamoDB and Elasticsearch.  As with any of our enterprise class solutions, we didn’t view the benchmark as complete without a disaster recovery process in place.  This post will share our architecture approach and lessons learned from implementing DR for the benchmark.

11 Billion Document White Paper

To back up the benchmark data we tried a few different Amazon services with varying levels of success.  The most crucial parts of the solution to secure was the DynamoDB tables and the S3 buckets. We felt the remaining components: the Elasticsearch cluster, EC2 instances, auto-scaling, logging, and alarms can be re-created from CloudFormation templates or using OpenMigrate and a few python scripts given the specifics of our case management approach.

Amazon S3

For the content stored on Amazon S3, the default durability, 99.999999999%, of the bucket means there is very little we need to do to ensure no data loss occurs to the content and we relied on this for the Benchmark. We felt very confident that if a client had the need for higher durability or complete replicatoin, the content could be copied to a second bucket or archived and stored in Glacier.

DynamoDB

Given 11 Billion Documents and almost 1 billion folder objects, the DynamoDB table for the benchmark grew to 5.3 TB in size and posed a challenge in backing up the data easily.  Taking a DynamoDB table backup within the DDB service is super easy but storing that amount of data gets expensive; pricing as of this blog post is $0.25 per GB it costs $1,325 for a month!  Going a bit cheaper than that, we could use the AWS Backup service where the data would be stored at $0.10 per GB, but that is still $530 per month. Since we want to preserve the data at the lowest cost possible storing the table data on S3 would be the cheapest option – $0.023 per GB for $121.90; and Glacier’s pricing is even better at $0.004 per GB for $21.20 per month.

To copy data from DynamoDB to S3, AWS provides a template as part of the AWS Data Pipeline service. Unfortunately, we built the benchmark in the us-east-2 Ohio region which does not yet support Data Pipeline. The next out of the box option available is to use the EMR service and follow guidance published by AWS to use a built-in library, Hadoop, and Hive to copy the data from DynamoDB to S3. The EMR process was very appealing because it is as straightforward to copy the DynamoDB table data to S3 as it is to restore it back into DynamoDB.

We tested the EMR process on some smaller DDB tables successfully but confirmed that in order to move the billions of entries it would take 10s of hours. The EMR process executes a SELECT * query against the DynamoDB table and the performance for that query is directly related to the allocated read units on the table. If we set it at 10,000 read units it would have taken over 35 hrs to simply execute the query. Even at the max setting of 40,000 read units the query would still take about 9 hours. Once the query finishes, the process to write the data is constrained by the EMR instance sizes and number of mapper threads.  

In sizing out the EMR process our lesson learned was that it would have been easier to use EMR to copy the data to S3 if we could have run queries to select sets of data instead of all rows at once. Since we had only one value as the index, objectId_s, it was necessary to run a scan.

With the high cost for the DynamoDB read units and EMR cluster we tested a few other options to write the table data to S3. The first process was to use AWS Glue. The process worked without error on the smaller tables but ran into an internal error when the crawler tried to execute against the large documentProperties table. We suspect it the error was due to the table’s size. One of the lessons learned with Glue is that while its point and click to configure and export the DynamoDB data, there isn’t an equivalent way to ingest it back in, a separate program would be required to restore the data.

The final DDB backup process we tested was to run a short python script which initiated a scan against the table and recursively processed the data 1MB at a time. The script put each record onto an AWS Kinesis stream. To write the data to S3 we configured Kinesis Firehose. Kinesis is designed to handle large volumes of serial data and while we didn’t process the entire table it easily handled our test volumes. The Kinesis stream is also limited by the read units on the DynamoDB table limiting the speed at which the data can be copied. Unfortunately, similar to Glue there isn’t an out-of-the-box way to restore the data back into the DynamoDB table, a separate program is necessary.

Elasticsearch

Backing up and restoring a domain in AWS Elasticsearch was a very easy process. Even though we had over 920 million documents indexed it took a merely seconds to store the 40GB of data. For our benchmark domain we used a short python script to register the S3 bucket as a backup location and then executed a REST API call to create the manual backup to the bucket. Once the backup completed, we made an API call to get the details of the backup and confirmed all the pieces were in S3. A simple REST API call again and we tested restoring the index into a separate Elasticsearch domain. The process was quick and flawless. The trickiest piece was setting up the python script to create the needed AWS authentication object and getting the syntax for commands correct. Once we had that down, it was quick to get the rest working.

Summary

The 11 billion benchmark gave us a great opportunity to investigate DR options at scale. We’ve written about several of these options and more in our Disaster Recovery / Business Continuity white paper. We’d love to hear your thoughts and experience more in the comments below.

11 Billion Document White Paper

Filed Under: Amazon, Amazon EC2, Cloud Computing, DynamoDB, Elastic APM

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned
  • FileNet AWS Cloud Native? Thoughts on recent announcement
  • ECM 2.0 – Vision & Review of 2019
  • AWS with DynamoDB for Content Management – Reference Architecture & Cost Estimate
  • DynamoDB 11 Billion Benchmark – Repository Walk-through
  • DynamoDB – Amazon Web Services 11 Billion Document Benchmark
  • Redaction for AWS, Alfresco, Documentum and Hadoop – Bulk Redaction upon Ingestion or Migration
  • Alfresco on AWS – Achieving High Availability
  • ECM Roadmap – Thoughts on Planning for the Future
  • Claims Documentation – Modernizing Insurance Platforms

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT