TSG continues to have success with Hadoop/HBase and DynamoDB as alternatives to legacy ECM platforms that rely on relational databases. As more clients adopt modern NoSQL platforms for ECM, we are seeing additional and interesting benefits from NoSQL/Key-Value database. This post will highlight the ‘key’ design in these Key-Value databases to allow for massively scalable content management on HBase and DynamoDB for billions of documents and millisecond retrieval times.
As we have discussed in Hadoop for ECM and DynamoDB for ECM, both Key-Value store databases are superior for storing links to documents in an ECM system. DynamoDB and HBase do away with the need to define a schema ahead of time, which allows for a very flexible data model. Documents stored in a Key-Value database like HBase or DynamoDB leveraging OpenContent are stored via the below architecture.
Architecture of an ECM on HBase/Dynamo DB
- NoSQL repository – Stores all attributes – stores content path link to the SAN – lightning fast retrievals by ID
- Search Index – Index all attributes that are searchable – fully searchable index of all attributes and fulltext content
- SAN/Object Store/S3 – Store the physical document (PDF, Word, Video, etc) – Inexpensive and fast storage/retrieval of large amounts of data retrievable by path/id
By leveraging Hadoop/HBase, Solr/Elastic and SAN/Object Store to perform what they do best, a modern ECM architecture can easily solve many of the issues of scaling a legacy ECM beyond billions of documents. HBase and DynamoDB were built from the ground up with scale in mind, so storing and retrieving documents in a billion plus object repositories. In our 11 billion document benchmark, we noticed no performance degradation when we went from 1 million documents in the repository to 11 billion documents in the repository.
For the majorty of ECM implementations, there are typically two (and sometimes three) distinct patterns for how users get to documents in the repository:
- Search – Users know one or more attributes that they want to search on
- Analytics – Data Scientists or Business Analysts ask open ended questions of the data and are looking for patterns/trends
- Case Management – Users are coming into the system looking for documents based on a particular case number that is known (ex: insurance claim number)
Retrieval Pattern 1: Search
Typically search users will know a few of the attributes of the document (title, date, author) and will run various searches against the Solr/Elasticsearch index to find the documents. Solr and Elasticsearch are the perfect tools for quickly searching through the index and returning the IDs and meta-data for each documents that fit the criteria. Once the user has found the document they would like to work with, the ID of that document is passed along to HBase/DynamoDB in order to retrieve the content for the user to view, edit, annotate or other document action.
For a typical ECM deployment with high search requirements, the Solr/Elasticsearch infrastructure can be scaled to meet the needs of the system.
Search Pattern 2: Analytics
If the ability to perform deep analytics on the attributes and full-text content of documents is a requirement, we typically recommend separate indexes in Solr/Elasticsearch targeted for the specific use cases that the data scientists are requesting. TSG no longer recommends one massive index for all of the attributes and full-text content for all documents in the entire repository as it can be problematic, especially if this is the same index that is being used by end users. See our thoughts on creating separate indexes with Solr this post.
Retrieval Pattern 3: Case Management
For many high volume environments, Case Management represents the vast majority of how users access a document or case of multiple documents. Typically a Case ID or similar ID (example claim number, vendor number…) can be used to uniquely identify the folder/case. In these applications, it is important to leverage the proper architecture to allow for fast access to the case without requiring the use of the search/index server. By leveraging a smart key design, many of our case management clients can function access the case without the Solr/Elasticsearch infrastructure at all.
When a document can have a Case ID in the object model, NoSQL can make use of a design pattern by prepending the Case ID to the beginning of the document ID. In this manner, when a user performs a search for documents, HBase design best practices and DynamoDB design best practices dictate a pattern like this to allow a lightning fast Range Scan to quickly bring back all of the documents for a particular case. This operation is significantly faster in a large repository, with the added benefit of not needing to leverage the search index at all. We have found that scanning the database directly via these patterns offers predictably fast access to view all documents in a particular claim.
Clients with a case management use case can make use of this design pattern to avoid entirely having to create a Solr or Elasticsearch index, which in our experience can be difficult/expensive to maintain for multi billion document repositories. We have many of our case management clients in a production environment without having an index server at all. As we found in our indexing benchmark, the infrastructure costs alone for an index of this size can be multiple times more expensive than the NoSQL database infrastructure, so we recommend that clients take this approach for case management.
Below is an example of how an HBase/DynamoDB table looks and how it can efficiently get to the case id documents when the case number is known:
As more of our clients are moving to modern NoSQL architecture for their document management needs, good key pattern design can be used to avoid additional Solr or Elasticsearch indexes while providing very quick user response times.
Let us know your thoughts in the comments.