• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Content Service Platform Scaling – How Good Key Design and NoSQL can avoid the need for Elastic/Solr or other indexes

You are here: Home / Alfresco / Content Service Platform Scaling – How Good Key Design and NoSQL can avoid the need for Elastic/Solr or other indexes

April 15, 2020

TSG has implemented multiple large document solutions in our Alfresco, Documentum and NoSQL (DynamoDB and HBase/Hadoop) practices with volumes in the hundreds of millions and sometimes billions of documents. Our most recent FileNet Migration had 4 Billion Documents and in 2019 TSG successfully benchmarked 11 Billion documents internally on DynamoDB. This post will highlight the ‘key’ design and access patterns that we see in many of our large content services deployments to avoid issues that arise when scaling into the 100’s of millions or billions of documents.

As we have posted in the past, there are a few access patterns that a typical content services deployment sees.

Repository Wide Search (Less than 50 million documents)

Platforms like Documentum and Alfresco include a Solr (Lucene under the covers) index of every document in the repository in order to fulfill the requirements of being able to search across the entire repository for specific attributes or combined with full-text search to provide Google like searching within the contents of the documents. For simple deployments of a few million documents, this architecture can scale relatively easily without the need for a complex Solr sharded deployment across multiple servers.

Once a repository begins to reach beyond 50+ million documents, the need for a complex sharded deployment can become necessary to support that many documents. At this scale, the performance of a search can degrade as the infrastructure of these deployments with one massive index that requires each shard in the cluster to provide their result set before any results can be returned to the end user. Issues also come into play with the size of the relational database to hold the meta-data and check for security.

Index on Demand rather than One Index to Rule them All (25 million to 100 million documents)

As we have posted from our Solr and Elasticsearch practices, a much more scalable approach to search is to create separate indexes for each area that needs access to be able to search for documents. By simplifying the architecture, the solution allows for independent scaling for the areas of the business that leverage search to meet each area’s specific Service Level Agreement (SLA).

Separate indices can take advantage of several differences from the one massive index for the entire repository:

  • Scale and tune independent indices rather than one large index
  • Application Level Security rather than per-document security to keep performance consistent
  • Separate Indices and Access for analytics users to “bang on the system” when they run large/expensive queries to do analytics

See our posts on how to leverage a separate index for Solr or also with Elasticsearch for clients that are more comfortable with scaling their Elastic stack. With this “on demand” approach to indexing, the areas of the business that utilize search heavily can get the exact indices that they need to support their use case without jeopardizing the content services platform as a whole.

No Index, No Problem (More than 100 million documents)

In working with clients that have traditionally struggled to scale their large content services platform, we typically see 90% the way users access documents is through one or two defined access patterns. Rather than a “I want to search on anything”, these high volume patterns follow specific business usage.

Particularly for insurance and other case management clients, typically there is one key attribute or number that is the controls access to the case or particular set of documents. Some examples:

  • Claim Number (Insurance)
  • Policy Number (Insurance)
  • Member ID (Healthcare)
  • Incident Number (Government)
  • Matter Number (Legal)
  • Employee ID (HR)
  • Vendor Number (Accounts Payable/Receivable)

For many high volume environments, the vast majority of users access a document based on one key attribute. In these applications, it is important to leverage the proper architecture to allow for fast access to the case without requiring any use of the search/index server. While Elastic or Solr indices can always be added, often, by leveraging a smart key design, many of our case management clients can function access and work the case without a Solr/Elasticsearch index at all.

No SQL Key Design

When a document has a Case ID in the object model, NoSQL can make use of a design pattern by prepending the Case ID to the beginning of the document ID. In this manner, when a user performs a search for documents, HBase design best practices and DynamoDB design best practices dictate a key pattern to allow a lightning fast Range Scan to quickly bring back all of the documents for a particular case. This operation is significantly faster in a large repository, with the added benefit of not needing to leverage the search index at all. We have found that scanning the database directly via these patterns offers predictably fast access to view all documents in a particular claim, regardless if there are 1 million or 11 billion documents in the repository.

In the example of an Insurance Claim management implementation, putting the claim number in the document key makes it a trivial operation for the NoSQL database to get you all documents for that claim. As the system scales to multiple billions of documents, the power of NoSQL key design means lookups will remain efficient and performant at any scale.

Claim Number in the Key

Clients with a case management use case can make use of this design pattern to avoid entirely having to create a massive Solr or Elasticsearch index/infrastructure, which in our experience can be difficult/expensive to maintain for multi billion document repositories. We have many of our case management clients in a production environment without having an index server at all. As we found in our indexing benchmark, the infrastructure costs alone for an index of this size can be multiple times more expensive than the database infrastructure, so we recommend that clients take this approach for case management.

Let us know your thoughts on scaling your content services platform in the comments below.

Filed Under: Alfresco, Documentum, DynamoDB, Elasticsearch, Hadoop, HBase, Insurance, Search

Reader Interactions

Trackbacks

  1. 11 Billion Documents, 12 Months Later - Thoughts and best practices 1 year after our industry leading document benchmark. — Technology Services Group says:
    May 21, 2020 at 8:49 am

    […] our detailed post on Good NoSQL Key Design as a unique approach to provide rapid access for large repositories without the need for […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Adobe Acrobat Alternative – Doing more with just a web browser
  • Elastic Services for ECM – TSG OpenContent Roadmap
  • Print to Repository – OpenContent Print Driver Support
  • ECM 2.0 – One-Step vs. Two-Step Migrations
  • ECM 2.0 – Can you build it yourself?
  • ECM 2.0 – What does it mean?
  • Case Analytics for Insurance
  • Redaction for AWS, Alfresco, Documentum and Hadoop – Folder Case Redaction
  • OpenContent Solr Services – New TSG Product Offering
  • Suggested Redactions for Documentum, Alfresco or Hadoop using OpenRedact

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT