11 Billion Documents, 12 Months Later – Thoughts and best practices 1 year after our industry leading document benchmark.

TSG initiated our 11 Billion Document DynamoDB benchmark on Friday, May 10th 2019 and ended all of our testing activities on June 20th 2019 with our findings documented in our DynamoDB 11 Billion Document Migration – Summary of Findings post. The benchmark was an unbelievable success with our team learning many lessons in regards to scaling AWS, DynamoDB, Elasticsearch and our OpenContent and OpenMigrate products. Since that benchmark was completed, we have continued to leverage our experience with large migrations, NoSQL and Elasticsearch for multiple clients. This post will present some of our thoughts 12 months later on how to do massive migrations and lessons learned.

Before going through the list of things we learned, we did post an article on what not to do early in 2020 entitled Migrations – why do they fail? (12 Worst Practices) to help share the list of what not to do. As TSG has been involved in fixing multiple failed migrations over the years, some of the points presented included:

Oversimplifying the migration effort based on sales input
Staffing the migration effort with internal resources or other on-site resources that lack migration experience and have other commitments.
Planning for a “Big Bang” migration with all content, people and integrations moving in mass to the new system.
Assuming the data in the old system is clean and can be easily translated to the new system
Assuming that the new system will offer similar capabilities without deep diving into the specific user scenarios.
Assuming ideal performance during a migration
Assuming that sample Data/Documents and environments will be the same as production
Migrate everything ASAP rather than by department and document types.
Migrating without business benefits for platform migration
Assuming that the cloud will make things easy
Not Involving Business Throughout the Process
Export and Import are separate teams

With the above being said, the rest of the article will share more of our specific experience with large migrations and specifically our 10 lessons learned.

Lesson #1 – AWS is a great cloud partner

One of the key findings in our time since our benchmark was just how easy it was to work with AWS for prototyping this type of benchmark effort. After the benchmark was complete, we attempted to reach out to the other major cloud vendors to consider conducting the same type of benchmark in their environments. Compared to AWS, other vendors were much more difficult to coordinate as well as see the investment AWS was willing to make in our efforts. By our calculation, our benchmark consumed roughly US $17,000 of AWS resources that were all absorbed by AWS with minimal hassle or delay.

While all of our products are designed to be cloud vendor neutral, AWS was the easiest partner to work with to set up the effort.

Lesson #2 – Leverage Good NoSQL Key Design

One of the major updates since the benchmark has been our experience with other NoSQL clients, specifically our 4 Billion Document FileNet Migration. Our client, like other clients that have traditionally struggled to scale their large content services platform, we typically see 90% of users access documents through one or two defined access patterns. For many high volume environments, the vast majority of users access a document based on one key attribute. In these applications, NoSQL allows a key design to provide for fast access to the case without requiring any use of the search/index server. While Elastic or Solr indices can always be added, often, by leveraging a smart key design, many of our case management clients can function access and work the case without a Solr/Elasticsearch index at all.

See our detailed post on Good NoSQL Key Design as a unique approach to provide rapid access for large repositories without the need for Elasticsearch or Solr indices.

Lesson #3 – Elasticsearch on Demand

The second phase of benchmark focused on building Elasticsearch indices as required for document search for the documents already in DynamoDB which successfully ended June 11th. While the initial migration included Elasticsearch index for all of the 925,837,980 folders in the repository, we wanted to show a modern approach of creating indices for specific scenarios “on demand” rather than one massive search index for the entire repository. For this test we created a quick million document index for accounts payable in about 33 minutes. Lots of good lessons learned about AWS Lambda, DynamoDB streams and differences between scaling DynamoDB and Elasticsearch.

Based on our 4 Billion object migration client case study from FileNet, we would say a best practice is to dive deep into user search patterns and away from a “just index everything” approach. Taking a “less is more” approach to Elasticsearch, keeping the search index focused on what is required provides:

Significant cost savings in AWS as Elasticsearch is one of the more pricey components.
Significant effort reduction in both indexing and keeping the search engine simple and efficient.

You can learn more about our plans and recommendations for Elasticsearch in our roadmap.

Lesson #4 – NoSQL Horizontal Scaling

The first phase of the benchmark was aimed at building a large repository with our OpenMigrate ingestion tool and proving access for OpenContent Search, OpenContent Case and OpenAnnotate (Now the Alfresco Enterprise Viewer). The initial ingestion phase concluded on May 17^th with 11 Billion documents and ingestion speeds of 20,000 documents per second to DynamoDB and related folders indexed into Elasticsearch.

We posted daily during the migration run. For additional detail, view the following posts and videos:

Our biggest lesson learned was NoSQL, and specifically DynamoDB, does scale consistently based on resources providing the ability to improve ingestion speeds easily. Where our traditional relational database approaches typically have limitations on throughput at large volume, we found no evidence of that same restriction with DynamoDB.

Lesson # 5 – Rolling Migration

For our 4 Billion Document FileNet Migration we implemented a rolling migration for a variety of reasons. See our post on Migrating to Alfresco – Reducing Risk, Stress and Cost with a Rolling Migration for additional detail. For our client, relevant reasons included:

As the effort focused on the retirement of FileNet, our unique approach of being able to “go under the covers” of FileNet to migrate documents, the rolling migration allowed us to turn off FileNet and just access the native storage for the rolling migration. See our post on specifics for FileNet COLD – cracking the proprietary format issue.
The cost and effort of a large migration did not delay getting the benefits of the new system. Pilot users were able to start using the system immediately without having to wait for all documents to be migrated.
The migration could begin without all the storage and other resources required for the full migration.
Users could be gradually rolled onto the new system. Training and change management could be scheduled and conducted with small groups increasing the acceptance of the system.
Leveraging a rolling migration significantly de-risked the project rollout as there were considerable concerns about the ability to scale the system. By running both systems in parallel during the rollout, the users and management were able to build confidence (IT and the Business) in the new platform and interface throughout the rollout.

Lesson #6 – Build versus Buy?

For our development community, we also posted on a How to build your own ECM capabilities for massive scale and performance that shared our experience and background on how to simplify and build capabilities from our experience.

See and updated post now that TSG is part of Alfresco, Modern Content Services Platform in the Cloud – Build versus Buy. We would share that the determining factor in any build versus buy decision process would be people, particularly when it comes to long-term support.

Lesson #7 – Deduplication Efforts

Investigating the content before migration can have huge benefits, especially for legacy solutions. For our health insurance client, one explanation of benefits document might have been linked to 100’s of different locations. While FileNet access made it appear to be a different file (all different document ids), the storage location was the same. By linking these documents rather than creating multiple documents, we were able to improve migration speeds by 5 times by realizing content was being deduplicated and mimicking that behavior during the migration. This also turned out to big a huge storage saving as well – even though storage is generally cheap, when content reaches the billions it becomes more of a concern.

Lesson #8 – NoSQL scales, Legacy systems not so much

Even though NoSQL has been proven to scale in our benchmark using bulked up infrastructure, the legacy systems the content comes from may not. For our 4 Billion FileNet migration, FileNet P8 content (about 30% of the migration) had to be run through FileNet APIs while the FileNet ImageServices storage had a proprietary COLD format that needs to be converted. Converting any format into a different one takes time, and the usually the older the system, the longer the conversion will take. Converting the COLD format including parsing and decoding two separate headers, decompression, background image retrieval and page combining. Understanding your legacy platform, and its limitations, leads to a better understanding of the true level of effort to move to NoSQL and if you can move everything at once or taking a rolling approach.

Lesson #9 – Managed NoSQL

In comparing our 4 Billion FileNet instance running on an on-premise Hadoop instance to our benchmark, we would say that cloud vendors provide huge benefits in regards to managed cluster offerings and serverless environments. Azure and AWS mange distributing the DB load, and even Hadoop can be installed using Ambari on physical servers, which drastically reduces the learning curve of working with NoSQL databases. Leveraging managed/serverless environments also results in less dedicated resources needed to manage the databases during and after setup.

Lesson #10 – Efficient Metadata

While the main benefit of NoSQL is its ability to massively scale, just like in relational models, metadata should be managed responsibly. During the benchmark we used an object model that we commonly find in place for insurance claims processing, and it consisted of about 15-20 attributes per document or folder, including “system” attributes like created and modified date.

With many clients, when migrating to a new system, it too often understandable to want to move all current metadata from the old system, but in most cases at least some of that metadata is obsolete or unused. Legacy systems often came with default attributes like “keywords” that are no longer applicable. Trimming metadata from the object model before migration saves both migration time and cost as well as storage costs.

Summary

In hindsight, our 11 Billion document migration thoroughly prepared us for our eventual 4 billion migration from FileNet as well as other migrations in the last year. Overall, with billions rather than millions of documents, while the migration process is similar, any mistakes are magnified due to the volume. We found AWS to be a great partner and want to thank them specifically for the support during the benchmark and getting our team the critical experience needed that has already translated into client success.

Let us know your thoughts below.