DynamoDB and AWS – How to build your own ECM capabilities for massive scale and performance

After the successful 11 Billion Document Benchmark with DynamoDB, we had a discussion with a couple of the major analysts as well as a nice shout-out from Jeff Potts on ECM Architect. We received a ton of great feedback on the scale of what we were able to accomplish along with lots of discussion specifically on how we built it and sharing some lessons learned. This post will break down the components of our NoSQL (DynamoDB and Hadoop) ECM components with background and comparison to the broader ECM market to explain our strategy.

ECM 2.0 (or Content Services) – Building and simplifying from the bottom up rather than the complexity of top down

Back in the 1990’s, ECM began with goals to:

Replace Network Drives
Manage all types of documents in one shared repository (when CPU, Memory and Storage were very expensive)
Typically had one interface and was often seen as the document management application
Allow for security and search across the repository for a variety of user scenarios

Back in the day, core infrastructure for the ECM platform was a relational database. For those of us that had to “look under the covers” to tweak performance, we would often see pages of SQL to tie in the complex data model of the repository combined with even more pages of SQL to wrap in the security and Access Control List capabilities. When Verity and other full-text search tools were added on top, the security, object model and tuning got even more complicated.

In building a new content services model, we wanted to question some of the primary tenants of the old ECM model, specifically the “Enterprise” part of ECM. Specifically the enterprises want one central place to store lots of different documents with one index to search them all.

With a move to more content services and less “document management as an application” focus, modern content services needs to be rebuilt from the ground up focused on a new approach rather than rely on paradigms from the 1990’s that haven’t worked. This post will break down each of the ECM components and present how and why TSG built our alternatives with NoSQL to provide clients with the benefits of modern architecture, reduced cost of the new approach combined with all of the capabilities of a legacy ECM suite.

Application Security rather than Repository Security

One of the many performance improvements for content services clients is moving the security control from the repository to the application. Consider the two paradigms below:

Repository Security – ECM as an application – When a user enters the document management system, if they are in the Human Resources department they will have access to the HR documents.
Application Security – Content Services – Only the Human Resources system will have access to content services that access the HR documents.

For repository security and document management as a shared application, every user needs to be defined within the ECM repository and in an Access Control List or other mechanism to allow them access to the respective documents. While this is typically LDAP integration, it still adds complexity and performance degradation to the query.

For Application Security, what if HR is the only department that can access the content services that access HR documents? Building from the bottom up, content services could make sure that only HR has access to the queries to access HR documents (that could be fine-tuned by role) and potentially a small HR repository all to themselves. Simplified searching resulting in better performance and still could include LDAP integration. Many times we see other applications controlling access to data that points to documents (example ERP for the employee data) is the only way to access the respective employee documents.

By pushing security to the application, the repository can perform quicker with less complexity and management. We see this often with our case management and large clients like insurance claims.

Departmental Search versus Repository Search

Coming from the top down, ECM tools typically have a “Search the Repository” for X mentality. See our thoughts from 2011 on “what to do if a user asks for a Google Search”. Similar to the failures of Enterprise Search, most content services clients are looking for specific searching scenarios. Consider the two paradigms below:

Enterprise/Repository Search- Document Management as an application – I would like to search the entire repository/organization for information about an employee
Departmental Search – Content Services – I would like to search just HR for information about an employee

With the success, capabilities and cost of robust search tools like Solr and Elastic Search, TSG has been recommending for years that clients create focused indexes rather than one large index to serve every purpose. Similar to the security post mentioned above, what if HR had their own index that potentially spanned multiple repositories or content services that included security needs?

By pushing for multiple, efficient search indices rather than one large index, content services can perform quicker with less complexity.

Object Store versus Mounted Drives

Back in the 1990’s, setting up an ECM system involved setting up the database and storage hardware itself. Often we would have to configure a Raid 0 disk array for the database with Raid 5 for the storage of the documents.

With the evolution of SAN and Object Stores, TSG has seen more and more clients evolve from having the ECM system manage the storage of content to the object storage manage the content with a link placed in the content repository. This includes not only the storage of content but also integration to the cloud, encryption and a host of other capabilities that used to be in the domain of the ECM system. See our thoughts from last year on how cloud object stores will disrupt ECM.

With the object store, either in the cloud or on premise, better positioned to store and retrieve content (with speed and streaming) from the browser rather than via the repository API, does it make sense for the repository to include APIs and other content storage update and retrieval capabilities?

By building content storage and retrieval with the content store in mind, content services can retrieve, store and stream content quicker with less complexity.

NoSQL versus relational database for the Metadata, Versioning and Relationships

When talking with multiple analysts and experienced ECM clients, one thing all agree with is that if a software company wanted to build a new metadata and document relationship repository today they would develop on a NoSQL repository rather than a traditional relational database. See our post and whitepaper from earlier this year on Why Big Data will disrupt Document Management but some relevant points to consider:

Simplicity of design
Simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases), and finer control over availability.
The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL, specifically storage and retrieval of document properties.
The data structures used by NoSQL databases are viewed as more flexible than relational database tables.

Specifically for Document Management customers, there is a simple difference between the two approaches.

Relational Database – Would store the attributes in columns/rows of the relational database with a pointer to the document file location in a SAN or object store.
NOSQL – Would store the attributes in an entry with tags/metadata that describe the document along with the possibly the document content itself in the repository or in a SAN or object store. Tags can be XML, JSON or a variety of other alternatives

To illustrate – consider the pictures below that compare a document management relational database layout (Documentum) for a document and its properties (8+ tables) compared to one simple Json object.

It isn’t that difficult to determine that the simplicity of design would make ingestion and retrieval patterns for typical content services action faster and simpler. With our latest DynamoDB benchmark, TSG was able to ingest 20,000 documents per second for an 11 billion document repository compared with a maximum of 250 documents per second calling traditional APIs and relational DB ECM repositories that struggled with over 100 million documents.

One great advantage of NoSQL’s is in its simplicity as it relates to ingestion speeds. Since there is only one “write” per document, we found that our repository did not slow down as it got larger. For our benchmark, the first thousand and first billion documents stored as fast as the last billion. Also, we were fairly confident that, if we wanted to add more servers and OpenMigrate instances, we could scale past the 20,000 documents a second to even more.

By building metadata and document relationship objects with a NoSQL repository rather than relational database, content services can store and retrieve document properties and relationships quicker with less complexity.

But what about Records Management?

As we have mentioned before, one of the last things implemented for most ECM repositories is the records management component. Most modern records management platforms come with robust Department of Defense capable add-on that the vendors are happy to sell but few companies implement to the extent of the capabilities provided by the tools. At its most basic, there are three activities in an RM program: retention, disposition, and holds.

TSG often highlights a solution based approach that leverages minor customizations and out-of-the-box (OOTB) capabilities to satisfy most electronic and physical records management requirements regardless of the repository. With a build from the bottom-up strategy, clients should consider if a light records management approach will give them adequate and cost-effective compliance.

Modern Content Services Interfaces versus Document Management Applications

The last component to complete the new model for content services is the interface and other content services to complete the “suite” of content services capabilities. TSG has a substantial advantage over those looking to build on their own with a custom interface due to the portability of the all our products from current ECM tools to DynamoDB or Hadoop offerings. TSG has a jump start with interfaces for search, viewing, annotation, redacting, form and workflow, migration, audit trail and other integrations that have been proven in production environments and offer “out of the box” configurable approaches for DynamoDB and Hadoop. We would recommend those building solutions look to leverage other products or simplify their development efforts to avoid building difficult to support custom interfaces.

Clients looking to build similar capabilities for their modern solution should consider some of the capabilities of the OpenContent Management Suite including:

OpenContent Web Services – Provides isolation as well as a robust API to access Alfresco, Documentum, Hadoop or DynamoDB repositories. All of our interfaces are built on OpenContent Web Services so that DynamoDB or Hadoop customers have access to interfaces that have been thoroughly tested and in production for other repositories. For DynamoDB and Hadoop, we have added 100% support for OpenContent Web Services to support versioning, relationships and other complex ECM capabilities.
OpenAnnotate – Provides for viewing, annotation and redaction of documents. Based on OpenContent Web Services, OpenAnnotate works with DynamoDB and Hadoop “out of the box”. Currently available with OpenContent search and case on the AWS marketplace for DynamoDB.
OpenContent Search and Case – Provides for robust Search and Case Management features configurable for different object models and easily supportable. Currently available with OpenAnnotate on the AWS marketplace for DynamoDB.
OpenContent Forms – Provides form and workflow capabilities leveraging Activiti.
No Code Configuration – OpenContent Management Suite features configuration with a no code approach rather than low code approaches to allow for easy support and adoption of new documents and object models.
OpenMigrate – OpenMigrate supports DynamoDB and Hadoop as well as a host of other ECM repositories. Clients will need products for migration, ingestion and publishing. OpenMigrate also provides integration to Ephesoft to allow for capture. Sign up to attend webinar tomorrow with Tony Parzgnat – Senor Product Manager – OpenMigrate.
Elk Audit Trail – One of the more interesting add-ons for monitoring access and performance is to leverage the ELK (Elastic, Log Stach and Kibana) to record activity. See post and look for more in regards to a better monitoring and auditing tool with ELK.
Other – lots of other integration including Docusign, WorkShare Compare, Box, Office 365 and the rest are part of the OpenContent Management Suite.

Our 11 Billion document benchmark with AWS and DynamoDB leveraged all of the capabilities above with the exception of the OpenContent Forms component (although that would not be difficult to add).

Summary

For those looking to build their own AWS/DynamoDB or Hadoop solution, we shared some of the underlying concepts and recommendations from building our own NoSQL offerings including:

pushing security to the application so the repository can perform quicker with less complexity and management.
pushing for multiple, efficient search indices rather than one large index so content services can perform quicker with less complexity.
building content storage and retrieval with the content store in mind where content services can retrieve, store and stream content quicker with less complexity.
leveraging NoSQL rather than relational database approaches to improve flexibility and performance while simplifying and reducing cost.

One issue those building will have to address is the actual interface for content services. TSG has a substantial advantage over “build from scratch” efforts due to the portability of the all our products from current ECM tools to DynamoDB or Hadoop offerings. TSG has a jump start with interfaces for search, viewing, annotation, redacting, form and workflow, migration and other integrations that have been proven in production environments and offer “out of the box” configurable approaches for DynamoDB and Hadoop.

For those looking to build their own front-end, TSG would recommend trying to leverage other products where possible or simplify their development efforts to avoid building difficult to support custom interfaces.

Look for more posts this and next week as we enter Phase 2 of our 11 Billion Document benchmark and let us know your thoughts below.

Comments

Mike Waldrop says

May 28, 2019 at 7:24 am

Dave, this is great stuff. I’ve always suspected this approach would not only address scale concerns, but ultimately actually create a much cleaner architecture. I really wonder what your customers are asking for in relation to a single centralized content system vs. a distributed approach. It would seem you have an opportunity to approach things differently with this architecture – so I’m curious what feedback you are getting from customers about the desire or need to address requirements like a distributed user community that requires a multi-datacenter type implementation. This has always been very difficult with traditional pre-built ECM systems – but the elements you are using are more conducive to a distributed approach I would think.
Dave Giordano says

May 28, 2019 at 7:48 am

Mike, Most of our customers are still working their way to the cloud and even the ones in the cloud are doing single implementations. I would agree that a cleaner approach to ECM does open up the possibilities of a better multi-center approach, particularly what we are trying to do with AWS.
Sam says

June 4, 2019 at 1:33 am

Dave – This is a great article. What approach are you seeing customers taking in securing content or search services by departments or role level security for documents and folders? I have seen this to be one aspect that really complicated the ECM implementations.
Dave Giordano says

June 4, 2019 at 8:53 am

Sam – We would recommend adding security at the application level, whether that be limiting a department to the repository or adding it at the search layer. For example, only HR can access HR documents but maybe the application can let an employee see only parts of just their HR folder. Typically we will integrate LDAP.

Reader Interactions

Comments

Trackbacks

Leave a ReplyCancel reply