Alfresco – Adding the ability to query on very large data sets

During a recent migration from an on-premise Documentum system to Alfresco on Amazon Web Services, a client wanted to check the target repository for document counts and other pertinent information to ensure that the migration executed smoothly. For example, a client may want to query to know “how many documents of type X are in the repository”? If the migration says that 50,000 documents have been migrated, then that query should return a count of 50,000. With out of the box in Alfresco, however, queries are limited to a maximum return size of 2,000 results. While this number can be increased via an API call for a particular query, we see many of our clients running into issues when attempting to query on large document sets (ex: more than 100,000 documents) such as queries timing out and returning an incomplete result (how many were returned before the timeout) that is often inconsistent from one query to the next. In this post, we will explore how the OpenAdmin module of the OpenContent Management Suite has been improved to enhance Alfresco and address this requirement.

Why is this a Problem?

To be clear, Alfresco can support very large data sets and querying these documents out of the box. For clients that require this functionality, Alfresco recommends to have multiple external Solr instances, load balanced and sharded to accommodate the query load. However some clients, particularly those with large data sets and who are exclusively meta-data searching, don’t require or desire the extensive Solr infrastructure. These clients really only need to return very large result sets during migrations or during other administrative tasks. For these clients, a single Alfresco database compliant query (check out our previous posts on utilizing the Alfresco database and Alfresco TMQ Queries for more information) that runs against the database does not perform well and can time out, causing the results to differ between executions. This issue has been logged to Alfresco.

For many of our clients with large data sets, it makes sense to utilize the database as much as possible. For these clients, it’s important to know when the Alfresco query will hit the database or Solr. The posts linked in the previous paragraph have more detail about database vs. Solr searching and Solr eventual consistency issues, but some situations may not be expected. For example, if your query has date or boolean searches, the query will execute against Solr rather than the database. To avoid Solr issues altogether, some of our clients with very large data sets (hundreds of millions of documents) opt to forego using Solr at all.

OpenContent Admin to the Rescue

To solve this issue as well as offer the highest performance and flexibility, we added a feature to OpenContent Admin that allows an administrator to run a read-only SQL query directly against the Alfresco database. If the query passes, we run it under a pre-configured read-only database user account to get the results. Finally, we package the results in an Excel spreadsheet and stream it back to the admin. The result is a relatively simple interface:

Admin Query to return Document Counts:

The following query simply counts the number of ‘claimsDocument’ objects in the repository. To see examples of other types of queries, check out the OpenAdmin Query Console page on our product wiki.

Query Result Excel File

Note – we return the results as a spreadsheet in order to support queries that return many thousands of results. In the future, we will add an enhancement that will display the results inline if less than a configured number of results (ex: 2,000) are present.

As for benchmarking, a current client was looking to query for document counts on a type that numbered just over 300,000 total documents. With an Alfresco database query, the operation times out after 45 minutes and returns inconsistent results. Utilizing the OpenAdmin direct SQL interface, the operation returns successfully in 8 seconds.

Summary

One item to keep in mind with our direct SQL approach is that since we are querying directly against the database, it is difficult to filter documents out based on security. However, administrative users can run queries quickly against very large data sets to return overall document counts or large result sets, without needing to setup a load balanced and sharded external Solr cluster.

In the future, we are planning a number of enhancements to the direct SQL tool. As mentioned above, we could eventually display a results table for smaller result sets. Additionally, Saved Queries make sense to allow administrators to quickly run common queries. We could further enhance the tool to allow for “build a query” type functionality where the administrator selects types and attributes via dropdowns rather than typing in the SQL directly.

Let us know your thoughts below.

Trackbacks

Alfresco Performance – Making Property Queries Faster says:

June 13, 2018 at 2:13 pm

[…] OpenContent Web Services has typically just provided a front-end to either the Alfresco API or CMIS implementation doing the normal database or Solr Query. OpenContent Web Services executes the Alfresco query and then calls the property API and passes the results back. Back in January, for our large clients, TSG started executing database calls within our OpenAdmin tool to provide counts and other administrative access to the Alfresco database. See related post – Alfresco – Adding the ability to query on very large data sets. […]

Why is this a Problem?

OpenContent Admin to the Rescue

Summary

Reader Interactions

Trackbacks

Leave a ReplyCancel reply