In June, 2019, Technology Services Group completed an unprecedented 11 billion document benchmark leveraging Amazon Web Services and specifically DynamoDB and Elasticsearch. As with any of our enterprise class solutions, we didn’t view the benchmark as complete without a disaster recovery process in place. This post will share our architecture approach and lessons learned from implementing DR for the benchmark.
To back up the benchmark data we tried a few different Amazon services with varying levels of success. The most crucial parts of the solution to secure was the DynamoDB tables and the S3 buckets. We felt the remaining components: the Elasticsearch cluster, EC2 instances, auto-scaling, logging, and alarms can be re-created from CloudFormation templates or using OpenMigrate and a few python scripts given the specifics of our case management approach.
For the content stored on Amazon S3, the default durability, 99.999999999%, of the bucket means there is very little we need to do to ensure no data loss occurs to the content and we relied on this for the Benchmark. We felt very confident that if a client had the need for higher durability or complete replicatoin, the content could be copied to a second bucket or archived and stored in Glacier.
Given 11 Billion Documents and almost 1 billion folder objects, the DynamoDB table for the benchmark grew to 5.3 TB in size and posed a challenge in backing up the data easily. Taking a DynamoDB table backup within the DDB service is super easy but storing that amount of data gets expensive; pricing as of this blog post is $0.25 per GB it costs $1,325 for a month! Going a bit cheaper than that, we could use the AWS Backup service where the data would be stored at $0.10 per GB, but that is still $530 per month. Since we want to preserve the data at the lowest cost possible storing the table data on S3 would be the cheapest option – $0.023 per GB for $121.90; and Glacier’s pricing is even better at $0.004 per GB for $21.20 per month.
To copy data from DynamoDB to S3, AWS provides a template as part of the AWS Data Pipeline service. Unfortunately, we built the benchmark in the us-east-2 Ohio region which does not yet support Data Pipeline. The next out of the box option available is to use the EMR service and follow guidance published by AWS to use a built-in library, Hadoop, and Hive to copy the data from DynamoDB to S3. The EMR process was very appealing because it is as straightforward to copy the DynamoDB table data to S3 as it is to restore it back into DynamoDB.
We tested the EMR process on some smaller DDB tables successfully but confirmed that in order to move the billions of entries it would take 10s of hours. The EMR process executes a SELECT * query against the DynamoDB table and the performance for that query is directly related to the allocated read units on the table. If we set it at 10,000 read units it would have taken over 35 hrs to simply execute the query. Even at the max setting of 40,000 read units the query would still take about 9 hours. Once the query finishes, the process to write the data is constrained by the EMR instance sizes and number of mapper threads.
In sizing out the EMR process our lesson learned was that it would have been easier to use EMR to copy the data to S3 if we could have run queries to select sets of data instead of all rows at once. Since we had only one value as the index, objectId_s, it was necessary to run a scan.
With the high cost for the DynamoDB read units and EMR cluster we tested a few other options to write the table data to S3. The first process was to use AWS Glue. The process worked without error on the smaller tables but ran into an internal error when the crawler tried to execute against the large documentProperties table. We suspect it the error was due to the table’s size. One of the lessons learned with Glue is that while its point and click to configure and export the DynamoDB data, there isn’t an equivalent way to ingest it back in, a separate program would be required to restore the data.
The final DDB backup process we tested was to run a short python script which initiated a scan against the table and recursively processed the data 1MB at a time. The script put each record onto an AWS Kinesis stream. To write the data to S3 we configured Kinesis Firehose. Kinesis is designed to handle large volumes of serial data and while we didn’t process the entire table it easily handled our test volumes. The Kinesis stream is also limited by the read units on the DynamoDB table limiting the speed at which the data can be copied. Unfortunately, similar to Glue there isn’t an out-of-the-box way to restore the data back into the DynamoDB table, a separate program is necessary.
Backing up and restoring a domain in AWS Elasticsearch was a very easy process. Even though we had over 920 million documents indexed it took a merely seconds to store the 40GB of data. For our benchmark domain we used a short python script to register the S3 bucket as a backup location and then executed a REST API call to create the manual backup to the bucket. Once the backup completed, we made an API call to get the details of the backup and confirmed all the pieces were in S3. A simple REST API call again and we tested restoring the index into a separate Elasticsearch domain. The process was quick and flawless. The trickiest piece was setting up the python script to create the needed AWS authentication object and getting the syntax for commands correct. Once we had that down, it was quick to get the rest working.
The 11 billion benchmark gave us a great opportunity to investigate DR options at scale. We’ve written about several of these options and more in our Disaster Recovery / Business Continuity white paper. We’d love to hear your thoughts and experience more in the comments below.