• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Alfresco – Adding the ability to query on very large data sets

You are here: Home / Alfresco / Alfresco – Adding the ability to query on very large data sets

January 30, 2018

During a recent migration from an on-premise Documentum system to Alfresco on Amazon Web Services, a client wanted to check the target repository for document counts and other pertinent information to ensure that the migration executed smoothly.  For example, a client may want to query to know “how many documents of type X are in the repository”?  If the migration says that 50,000 documents have been migrated, then that query should return a count of 50,000.  With out of the box in Alfresco, however, queries are limited to a maximum return size of 2,000 results.  While this number can be increased via an API call for a particular query, we see many of our clients running into issues when attempting to query on large document sets (ex: more than 100,000 documents) such as queries timing out and returning an incomplete result (how many were returned before the timeout) that is often inconsistent from one query to the next.  In this post, we will explore how the OpenAdmin module of the OpenContent Management Suite has been improved to enhance Alfresco and address this requirement.

Why is this a Problem?

To be clear, Alfresco can support very large data sets and querying these documents out of the box.  For clients that require this functionality, Alfresco recommends to have multiple external Solr instances, load balanced and sharded to accommodate the query load.  However some clients, particularly those with large data sets and who are exclusively meta-data searching, don’t require or desire the extensive Solr  infrastructure.  These clients really only need to return very large result sets during migrations or during other administrative tasks.  For these clients, a single Alfresco database compliant query (check out our previous posts on utilizing the Alfresco database and Alfresco TMQ Queries for more information) that runs against the database does not perform well and can time out, causing the results to differ between executions.  This issue has been logged to Alfresco.

For many of our clients with large data sets, it makes sense to utilize the database as much as possible.  For these clients, it’s important to know when the Alfresco query will hit the database or Solr.  The posts linked in the previous paragraph have more detail about database vs. Solr searching and Solr eventual consistency issues, but some situations may not be expected.  For example, if your query has date or boolean searches, the query will execute against Solr rather than the database.  To avoid Solr issues altogether, some of our clients with very large data sets (hundreds of millions of documents) opt to forego using Solr at all.

OpenContent Admin to the Rescue

To solve this issue as well as offer the highest performance and flexibility, we added a feature to OpenContent Admin that allows an administrator to run a read-only SQL query directly against the Alfresco database.  If the query passes, we run it under a pre-configured read-only database user account to get the results.  Finally, we package the results in an Excel spreadsheet and stream it back to the admin.  The result is a relatively simple interface:

Admin Query to return Document Counts:

The following query simply counts the number of ‘claimsDocument’ objects in the repository.  To see examples of other types of queries, check out the OpenAdmin Query Console page on our product wiki.

Query Result Excel File

Note – we return the results as a spreadsheet in order to support queries that return many thousands of results.  In the future, we will add an enhancement that will display the results inline if less than a configured number of results (ex: 2,000) are present.

As for benchmarking, a current client was looking to query for document counts on a type that numbered just over 300,000 total documents.  With an Alfresco database query, the operation times out after 45 minutes and returns inconsistent results.  Utilizing the OpenAdmin direct SQL interface, the operation returns successfully in 8 seconds.

Summary

One item to keep in mind with our direct SQL approach is that since we are querying directly against the database, it is difficult to filter documents out based on security.  However, administrative users can run queries quickly against very large data sets to return overall document counts or large result sets, without needing to setup a load balanced and sharded external Solr cluster.

In the future, we are planning a number of enhancements to the direct SQL tool.  As mentioned above, we could eventually display a results table for smaller result sets.  Additionally, Saved Queries make sense to allow administrators to quickly run common queries.  We could further enhance the tool to allow for “build a query” type functionality where the administrator selects types and attributes via dropdowns rather than typing in the SQL directly.

Let us know your thoughts below.

 

Filed Under: Alfresco, OpenContent Web Services, Search

Reader Interactions

Trackbacks

  1. Alfresco Performance – Making Property Queries Faster says:
    June 13, 2018 at 2:13 pm

    […] OpenContent Web Services has typically just provided a front-end to either the Alfresco API or CMIS implementation doing the normal database or Solr Query.  OpenContent Web Services executes the Alfresco query and then calls the property API and passes the results back.  Back in January, for our large clients, TSG started executing database calls within our OpenAdmin tool to provide counts and other administrative access to the Alfresco database.  See related post – Alfresco – Adding the ability to query on very large data sets. […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Elastic Services for ECM – TSG OpenContent Roadmap
  • Content Service Platform Scaling – How Good Key Design and NoSQL can avoid the need for Elastic/Solr or other indexes
  • Alfresco No Code – Do More with OpenContent
  • Alfresco – Building a Search Interface – ADF vs. OpenContent Management Suite
  • Federated Search and Content Services – Is a Publishing Approach better?
  • Alfresco – ADF and OpenContent Search Comparison
  • Alfresco – ADF and OpenContent Comparison- REST Web Services
  • Alfresco Performance – Making Property Queries Faster
  • OpenContent Solr Services – New TSG Product Offering
  • Do More with Your Alfresco Deployment – HPI vs. a Custom Search Interface

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT