Recently TSG has assisted multiple clients with unique scenarios to make Alfresco queries more efficient. Like a good auto mechanic that has learned tips and tricks for getting better performance from a car without sacrificing quality, TSG has recently developed performance improvements for Alfresco in OpenContent to address clients’ needs. This post will summarize the approach and updates to OpenContent Web Services.
Alfresco Queries – How do they work?
For those unfamiliar with ECM, most ECM systems including Alfresco are built on a database to store meta-data, security, as well as pointers to the file locations. In TSG’s opinion, Alfresco has the most flexible database schema in the marketplace. By isolating property names and values, Alfresco can best support aspects to allow for hundreds of different document types and the ability to easily search across documents.
Alfresco has both Solr and direct database query options. Many of TSG’s clients that have large volumes of transactional documents have chosen the database query (Alfresco Transactional Metadata Query – TMDQ) access method for a variety of reasons including:
- No requirement for full-text search – as all searching relies on properties.
- Large volumes of documents – (ex: over 200 million) make setting up a sharded Solr environment costly.
- Performance – tests comparing Solr to TMDQ for metadata queries have resulted in TMDQ being comparable for certain queries.
See a more detailed discussion of the differences between Solr and TMDQ in a previous post – Alfresco Solr Issues – when does just leveraging the database make sense?
Performance issues for both TMDQ and Solr occur when queries return large amounts of metadata along with the documents. Users typically will require name of the document, tags, author, create date or a variety of other values returned about the document. Examples of client large queries have included:
- Administration Queries – For large volume clients that need to see a large list of documents. Examples could include how many documents were migrated today or last week. Another example is which documents haven’t been renditioned yet.
- Large User Queries – Searches that will return large number of results in a user scenario. For example a listing of all the documents (and metadata) in a folder when the folder contains 60,000 documents.
For some of our clients, OOTB the Alfresco queries above might run as long as 30 minutes and potentially time out. In working with alternatives for our clients, TSG was able to confirm that it wasn’t so much the issue with how long the query ran but more the formatting of results once queries returned. Because a query against the database would quickly return a list of objects that satisfied the query, performance improvements have focused on how to more quickly retrieve the values of the properties for the list of objects.
Alfresco Properties and Security – Understanding how they work
Typical SQL database queries for Alfresco with the API look something like:
SELECT node.id as id FROM alf_node node, etc… WHERE node.type_qname_id <> 149 AND node.store_id = 6 AND node.id IN ( SELECT PROP.node_id FROM alf_node_properties PROP WHERE (26 = PROP.qname_id) AND PROP.string_value = ‘123456’)
The above query would return the list of node ids that satisfied the value of the name being equal to “123456”. Once Alfresco has a list of node-ids, the API will then run security logic on the entire set to pare down the results to what the user executing the query can access. The API will then return the pared down set of node-ids. Regardless of whether the query is run by Solr or TMDQ, further API calls for each one of the returned node-ids would need to look up any properties that might be needed for display in the application. That API would look something like:
Map<QName, Serializable> props = serviceRegistry.getNodeService().getProperties(node);
For Alfresco and particularly large volume customers, two unique situations can occur based on this approach:
- For a large list returned values, the number of calls for the properties is very large. (1 call for each node)
- For large volume customers, the alf_node_properties table can get very large. For one client with 33 properties per document and 380 million documents, we estimated that the table would include 12.5 billion rows making it difficult to cache/tune the large table.
For a test benchmark, TSG built a 250,000 Alfresco repository with Solr enabled. To make sure we had a clean test, security was not applied to any of the queries. TSG had the following very similar results for Solr and TMDQ.
- 100 document results
- TMDQ (1.967 seconds) – Solr (1.814 seconds)
- 1,000 document results
- TMDQ (8.536 seconds) – Solr (5.471 seconds)
- 10,000 document results
- TMDQ (1 minute 12.427 seconds) – Solr (1 minute 10.555 seconds)
- 50,000 document results
- TMDQ (6 minutes 2.336 seconds) – Solr (5 minutes 8.269 seconds)
- 100,000 document results
- TMDQ (11 minutes 14.938 seconds) – Solr (10 minutes 33.680 seconds)
In running our tests, we were surprised just how close the timings were between Solr and TMDQ. TMDQ would execute the query quickly (4% of the total time) but most of the time was retrieving the properties (96% of the total time). Solr would take longer to execute the query (95% of the total time) but the property fetch (5% of the total time) was quicker due to Solr caching.
OpenContent Database Query Updates
OpenContent Web Services has typically just provided a front-end to either the Alfresco API or CMIS implementation doing the normal database or Solr Query. OpenContent Web Services executes the Alfresco query and then calls the property API and passes the results back. Back in January, for our large clients, TSG started executing database calls within our OpenAdmin tool to provide counts and other administrative access to the Alfresco database. See related post – Alfresco – Adding the ability to query on very large data sets.
To improve query performance and extending on the experience from January, the new capability built into OpenContent Web Services leverages the power of the database to put all of the property retrieval in one query avoiding the separate calls to retrieve the necessary property values. The resulting query looks something like the below:
SELECT node.id AS nodeID, props.boolean_value AS boolean_value, props.long_value AS long_value, props.float_value AS float_value, props.double_value AS double_value, props.string_value AS string_value, node.audit_creator AS creator, node.audit_created AS created, node.audit_modifier AS modifier, node.audit_modified AS modified, etc… FROM alf_node_properties props JOIN alf_node node ON node.id = props.node_id, etc… WHERE node.type_qname_id <> 149 AND node.store_id = 6 AND node.id IN ( SELECT PROP.node_id FROM alf_node_properties PROP WHERE (26 = PROP.qname_id AND PROP.string_value = ‘123456’) AND (29 = PROP.qname_id AND PROP.string_value = ‘Medical’)
Some of the reasons for the performance improvement include:
- Direct Database Query – Leveraging a read-only database user, OpenContent formats a more efficient database query and parsing of results.
- No Security – to support admin users as well as large volume clients that leverage a different security model passed to the query in regards to the search itself, the DB query does not have to check Alfresco security.
- Returning properties in query results – by restructuring the query, OpenContent is able to return properties from the initial query itself, negating the need for a second API call per document to retrieve properties for display in the application.
From our benchmark test against the same repository:
- 100 document results
- OpenContent (0.169 seconds) – TMDQ (1.967 seconds) – Solr (1.814 seconds)
- 1,000 document results
- OpenContent (1.361 seconds) – TMDQ (8.536 seconds) – Solr (5.471 seconds)
- 10,000 document results
- OpenContent (5.725 seconds) – TMDQ (1 minute 12.427 seconds) – Solr (1 minute 10.555 seconds)
- 50,000 document results
- OpenContent (25.209 seconds) – TMDQ (6 minutes 02.336 seconds) – Solr (5 minutes 08.269 seconds)
- 100,000 document results
- OpenContent (30.216 seconds) – TMDQ (11 minutes 14.938 seconds) – Solr (10 minutes 33.680 seconds)
From the numbers above, OpenContent was roughly 4 to 20 times faster depending on query size results.
Summary
OpenContent can now provide performance improvements for property queries for many client scenarios by making a more efficient database call against Alfresco regardless of the size of the repository or query result set. This performance is particularly important for queries that will have large result sets but would also increase performance for normal queries as well. For other scenarios or even great performance improvements, TSG would recommend looking at our Solr Services.
Let us know your thoughts below.
[…] the speed of certain Alfresco queries for large repositories, OpenContent Web Services now includes specific DB tuned queries not available through the Alfresco API to improve performance […]