• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

DynamoDB – Database model for Document Management

You are here: Home / Amazon / DynamoDB – Database model for Document Management

November 16, 2018

One of the ongoing myths about DynamoDB for Document Management we hear too often is “but isn’t that just for big data?”  This post will try to explain the benefits of DynamoDB’s big data capabilities and data model in a Document Management context compared to traditional database systems.  Examples will include how we are currently building our own DynamoDB offering scheduled to be released at the end of 2018.

DynamoDB – Big Data versus traditional Big Database

At its core, DynamoDB provides a very robust, distributed data store that allows for powerful parallel processing capabilities with unique data storage capabilities.  To understand the difference between DynamoDB and traditional databases requires an understanding of the processes (and timing) of when they were created.

Relational databases first emerged in the 1980’s when disk speed was slow and disk space and CPU usage were at a premium.  Built for critical business systems, relational databases were built with the focus on storing known data in a static data model.  DBA’s were extensively employed to update the model and add indexing and other performance improvements.  Performance focused down to the hardware level of where individual fields were stored within the disk array, a very expensive component back in the day.  Emerging in the 1990’s and 2000’s, most modern ECM solutions rely on the critical document management fields to be stored in a relational database.

DynamoDB, like Hadoop and other “not only SQL (NOSQL)” repositories follows the more modern approach based on the huge gains in the economics of disk space, hardware cost, and new requirements with unstructured big data.  DynamoDB allows for very quick and distributed/parallel retrieval of a specific data file that can contain data in a variety of different formats.

DynamoDB – Schema-on-Read versus Schema-on-Write

One of the big differences between DynamoDB and traditional RDMS is how data is organized in a schema.  Traditional databases require Schema-on-Write where the DB schema is very static and needs to be well-defined before the data is loaded.  The process for Schema-on-Write requires:

  • Analysis of data processes and requirements
  • Data Modeling
  • Loading/Testing

If any of the requirements change, the process has to be repeated.

Schema-on-Read focuses on a less restrictive approach to allow storage of raw, unprocessed data to be stored immediately.  How the data is used is determined when the data is read.  Table below summarizes differences:

Traditional Database (RDMS) DynamoDB
·        Create static DB schema ·        Copy data in native format
·        Transform Data into RDMS ·        Create schema and parser
·        Query Data in RDMS format ·        Query Data native format
 ·       New columns must be added by a DBA before new data can be added  ·       New data can start flowing in any time

Schema-on-Read provides a unique benefit for Big Data in that data can be written to DynamoDB without having to know exactly how it will be retrieved.

DynamoDB – what it means for Big Data (and big Document issues)

In a Big Data world, data needs to be captured without the requirement of knowing the structure to hold the data.  As is often mentioned, things like DynamoDB can be used for consumer/social sites that need to store a huge amount of unstructured data quickly with the consumption of that data coming at a later time.

As a typical Big Data example, a social site stores all of the different links clicked on by a user.  The storing application might store date and time and other data in one HDFS file for the user that is updated each time the user returns.  Given the user, a retrieval application can quickly access a large, semi-structured file on that particular user’s activity over time.

Also, using Amazon allows leverage of S3 containers within the solution. While content could technically be stored as bytes within the DynamoDB, it makes more sense to use Amazon’s S3 content storage and store a simple S3 link within the DynamoDB metadata to connect the two. This allows DynamoDB to function in its intended design as a pure DB and not also rely on it as both a metadata and content store.

DynamoDB for Document Management – It’s not about Search

Schema-on-Write works very well for “known unknown” or what we would typically call in document management for a document search, something that Schema-on-Read does not handle as well.  To illustrate an example, let’s take a typical document management requirement, search for all documents requiring review by this date:

In an RDMS example, the date column would be joined in a query with the “required review” column to quickly provide a list of all the documents that would be needed to be reviewed.  If the performance is not acceptable, indexes could be added to the database.

In the DynamoDB example, ALL of the documents in the repository would need to first be retrieved and opened, once opened, the required review and date data retrieved to build the list.  There are no indices that could speed up performance.

After this example, at typical document management architect might conclude that DynamoDB doesn’t really fit this basic requirement.  What this opinion doesn’t take into account is the emergence of the “Search Appliance” and particularly Lucene/Solr/Elastic Search as the default indexing/searching engine for document management.

For a better performing search for BOTH the RDMS and DynamoDB implementations, we would recommend leveraging Lucene/Solr/Elastic Search to index the document’s fulltext AND meta-data for superior search performance.  All modern ECM/Document Management vendors (Documentum and Alfresco as examples) now leverage some type of Lucene/Solr/Elastic Search for all search.

Amazon provides a both AWS hosted SolrCloud and Elasticsearch services as an easy way to deploy a DynamoDB combination architecture. By having AWS manage and maintain the physical architecture, organizations can have easier integration for the combined environments.

DynamoDB – Big Data for Document Management – Schema-on-Read Example

The major advantage of Schema-on-Read is the ability to easily store all of the document’s meta-data without having to define column.  Some quick examples would include:

  • Versioning – One difficulty with most document management tools is, given a structured Schema-on-Write DB model, the ability to store different attributes on different versions is not always available or requires a work-around. With DynamoDB, each version would have its own data and different attributes could be stored with different documents.
  • Audit Trail – This is one we often see difficult to implement with one “audit” database table that ends up getting huge (and is a big data application). With DynamoDB, the audit trail could be maintained as a single entry for that given document row and quickly be retrieved and parsed.
  • Model Updates – Many times, meta-data needs to be added to a content model after a system has gone live and matures because the content it stores matures with it. With DynamoDB, new meta-data can always be written in as a new column.

DynamoDB – Building a Document Model

Object Model

{
  objectId (key)
  objectName
  title
  modifyDate
  creationDate
  creator
  contains
  audits
  ...
  content (S3 link)
  rendition (S3 link)
}

Objects table contains a single row for each document including:

  • metadata
  • content (reference to S3 object)
  • renditions (references to S3 object)

This is a rough schema of common data that could exist when a document is added, but since DynamoDB is schemaless, the system is not bound by the picture above and could evolve and change. Every document can add columns “on the fly” with no database schema updates when new metadata is to be captured. One relevant example of this approach is in the case of a corporate acquisition or merging of repositories, when new documents need to quickly be entered into the system from the old company.  The metadata can be dumped into the DynamoDB repository as-is, without having to worry about mapping individual fields to the existing schema.

Summary

DynamoDB has an advantage over traditional RDMS systems in that data can be stored and retrieved quickly in an unstructured file without requiring extensive analysis of how that data will typically be retrieved.  The drawbacks of a DynamoDB only approach is that search performance on any attributes other than the “documentId” do not perform well at scale.  We would advise that DynamoDB implementations utilize Lucene/Solr/Elastic Search as the query component of a Document Management solution to allow high performant searches against the repository.

Stay tuned to the blog for more developments on using DynamoDB as a document management repository. Please leave your comments below.

Filed Under: Amazon, DynamoDB

Reader Interactions

Trackbacks

  1. DynamoDB – ECM / Content Process Services Roadmap says:
    November 27, 2018 at 8:04 am

    […] our related post on DynamoDB – Database Model for Document Managemement for an understanding of how the structure is represented in the […]

    Reply
  2. DynamoDB – Amazon Web Services 11 Billion Document Benchmark says:
    May 13, 2019 at 6:57 am

    […] approach can perform ingestion at unbelievable rates of speed.  (See our related post on data model differences between relational databases and big data “Not Only SQL – NoSQL” […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • 11 Billion Documents, 12 Months Later – Thoughts and best practices 1 year after our industry leading document benchmark.
  • Amazon S3 – Viewing content fast and securely in-browser with the Alfresco Enterprise Viewer
  • ECM 2.0 – Can you build it yourself?
  • DynamoDB – 11 Billion Document Benchmark White Paper
  • DynamoDB and Hadoop/HBase as a Document Store
  • DynamoDB Benchmark – Building an 11 Billion Document DR Process
  • The Deep Analysis Podcast – The 11 Billion File Benchmark
  • DynamoDB 11 Billion Document Benchmark – Summary of Postings
  • DynamoDB 11 Billion Benchmark 11 Thousand Concurrent Users Success!!! – Lessons Learned
  • DynamoDB 11 Billion Benchmark Add Documents Success!!! – Lessons Learned

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2022 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT