TSG has implemented multiple large document solutions in our Alfresco, Documentum and NoSQL (DynamoDB and HBase/Hadoop) practices with volumes in the hundreds of millions and sometimes billions of documents. Our most recent FileNet Migration had 4 Billion Documents and in 2019 TSG successfully benchmarked 11 Billion documents internally on DynamoDB. This post will highlight the ‘key’ design and access patterns that we see in many of our large content services deployments to avoid issues that arise when scaling into the 100’s of millions or billions of documents.
As we have posted in the past, there are a few access patterns that a typical content services deployment sees.
Repository Wide Search (Less than 50 million documents)
Platforms like Documentum and Alfresco include a Solr (Lucene under the covers) index of every document in the repository in order to fulfill the requirements of being able to search across the entire repository for specific attributes or combined with full-text search to provide Google like searching within the contents of the documents. For simple deployments of a few million documents, this architecture can scale relatively easily without the need for a complex Solr sharded deployment across multiple servers.
Once a repository begins to reach beyond 50+ million documents, the need for a complex sharded deployment can become necessary to support that many documents. At this scale, the performance of a search can degrade as the infrastructure of these deployments with one massive index that requires each shard in the cluster to provide their result set before any results can be returned to the end user. Issues also come into play with the size of the relational database to hold the meta-data and check for security.
Index on Demand rather than One Index to Rule them All (25 million to 100 million documents)
As we have posted from our Solr and Elasticsearch practices, a much more scalable approach to search is to create separate indexes for each area that needs access to be able to search for documents. By simplifying the architecture, the solution allows for independent scaling for the areas of the business that leverage search to meet each area’s specific Service Level Agreement (SLA).
Separate indices can take advantage of several differences from the one massive index for the entire repository:
- Scale and tune independent indices rather than one large index
- Application Level Security rather than per-document security to keep performance consistent
- Separate Indices and Access for analytics users to “bang on the system” when they run large/expensive queries to do analytics
See our posts on how to leverage a separate index for Solr or also with Elasticsearch for clients that are more comfortable with scaling their Elastic stack. With this “on demand” approach to indexing, the areas of the business that utilize search heavily can get the exact indices that they need to support their use case without jeopardizing the content services platform as a whole.
No Index, No Problem (More than 100 million documents)
In working with clients that have traditionally struggled to scale their large content services platform, we typically see 90% the way users access documents is through one or two defined access patterns. Rather than a “I want to search on anything”, these high volume patterns follow specific business usage.
Particularly for insurance and other case management clients, typically there is one key attribute or number that is the controls access to the case or particular set of documents. Some examples:
- Claim Number (Insurance)
- Policy Number (Insurance)
- Member ID (Healthcare)
- Incident Number (Government)
- Matter Number (Legal)
- Employee ID (HR)
- Vendor Number (Accounts Payable/Receivable)
For many high volume environments, the vast majority of users access a document based on one key attribute. In these applications, it is important to leverage the proper architecture to allow for fast access to the case without requiring any use of the search/index server. While Elastic or Solr indices can always be added, often, by leveraging a smart key design, many of our case management clients can function access and work the case without a Solr/Elasticsearch index at all.
No SQL Key Design
When a document has a Case ID in the object model, NoSQL can make use of a design pattern by prepending the Case ID to the beginning of the document ID. In this manner, when a user performs a search for documents, HBase design best practices and DynamoDB design best practices dictate a key pattern to allow a lightning fast Range Scan to quickly bring back all of the documents for a particular case. This operation is significantly faster in a large repository, with the added benefit of not needing to leverage the search index at all. We have found that scanning the database directly via these patterns offers predictably fast access to view all documents in a particular claim, regardless if there are 1 million or 11 billion documents in the repository.
In the example of an Insurance Claim management implementation, putting the claim number in the document key makes it a trivial operation for the NoSQL database to get you all documents for that claim. As the system scales to multiple billions of documents, the power of NoSQL key design means lookups will remain efficient and performant at any scale.
Clients with a case management use case can make use of this design pattern to avoid entirely having to create a massive Solr or Elasticsearch index/infrastructure, which in our experience can be difficult/expensive to maintain for multi billion document repositories. We have many of our case management clients in a production environment without having an index server at all. As we found in our indexing benchmark, the infrastructure costs alone for an index of this size can be multiple times more expensive than the database infrastructure, so we recommend that clients take this approach for case management.
Let us know your thoughts on scaling your content services platform in the comments below.