During a recent migration from an on-premise Documentum system to Alfresco on Amazon Web Services, a client wanted to check the target repository for document counts and other pertinent information to ensure that the migration executed smoothly. For example, a client may want to query to know “how many documents of type X are in the repository”? If the migration says that 50,000 documents have been migrated, then that query should return a count of 50,000. With out of the box in Alfresco, however, queries are limited to a maximum return size of 2,000 results. While this number can be increased via an API call for a particular query, we see many of our clients running into issues when attempting to query on large document sets (ex: more than 100,000 documents) such as queries timing out and returning an incomplete result (how many were returned before the timeout) that is often inconsistent from one query to the next. In this post, we will explore how the OpenAdmin module of the OpenContent Management Suite has been improved to enhance Alfresco and address this requirement.
Why is this a Problem?
To be clear, Alfresco can support very large data sets and querying these documents out of the box. For clients that require this functionality, Alfresco recommends to have multiple external Solr instances, load balanced and sharded to accommodate the query load. However some clients, particularly those with large data sets and who are exclusively meta-data searching, don’t require or desire the extensive Solr infrastructure. These clients really only need to return very large result sets during migrations or during other administrative tasks. For these clients, a single Alfresco database compliant query (check out our previous posts on utilizing the Alfresco database and Alfresco TMQ Queries for more information) that runs against the database does not perform well and can time out, causing the results to differ between executions. This issue has been logged to Alfresco.
For many of our clients with large data sets, it makes sense to utilize the database as much as possible. For these clients, it’s important to know when the Alfresco query will hit the database or Solr. The posts linked in the previous paragraph have more detail about database vs. Solr searching and Solr eventual consistency issues, but some situations may not be expected. For example, if your query has date or boolean searches, the query will execute against Solr rather than the database. To avoid Solr issues altogether, some of our clients with very large data sets (hundreds of millions of documents) opt to forego using Solr at all.
OpenContent Admin to the Rescue
To solve this issue as well as offer the highest performance and flexibility, we added a feature to OpenContent Admin that allows an administrator to run a read-only SQL query directly against the Alfresco database. If the query passes, we run it under a pre-configured read-only database user account to get the results. Finally, we package the results in an Excel spreadsheet and stream it back to the admin. The result is a relatively simple interface:
Admin Query to return Document Counts:
The following query simply counts the number of ‘claimsDocument’ objects in the repository. To see examples of other types of queries, check out the OpenAdmin Query Console page on our product wiki.
Query Result Excel File
Note – we return the results as a spreadsheet in order to support queries that return many thousands of results. In the future, we will add an enhancement that will display the results inline if less than a configured number of results (ex: 2,000) are present.
As for benchmarking, a current client was looking to query for document counts on a type that numbered just over 300,000 total documents. With an Alfresco database query, the operation times out after 45 minutes and returns inconsistent results. Utilizing the OpenAdmin direct SQL interface, the operation returns successfully in 8 seconds.
One item to keep in mind with our direct SQL approach is that since we are querying directly against the database, it is difficult to filter documents out based on security. However, administrative users can run queries quickly against very large data sets to return overall document counts or large result sets, without needing to setup a load balanced and sharded external Solr cluster.
In the future, we are planning a number of enhancements to the direct SQL tool. As mentioned above, we could eventually display a results table for smaller result sets. Additionally, Saved Queries make sense to allow administrators to quickly run common queries. We could further enhance the tool to allow for “build a query” type functionality where the administrator selects types and attributes via dropdowns rather than typing in the SQL directly.
Let us know your thoughts below.