The last post discussed the results of an HPI Lucene Search test compared to a Webtop FAST Search as part of a proof of concept for a client looking to provide a consumer interface. As we have often mentioned on this forum, we continually see clients looking for a better search interface than Webtop, as well as some content cached outside of Documentum for business continuity, performance, and licensing.
One accurate comment raised by the post was that our comparison of HPI/Lucene against a Webtop/FAST search wasn’t really comparing apples to apples as the Webtop search was running against Documentum with security, while the Lucene search was not. While the client’s goals were to show the benefits of the cached repository and Lucene against Documentum, many Documentum users would like to know how Lucene would perform directly against a Documentum repository (as with upcoming DSS).
For this post, we will discuss TSG’s strategy and initial proof of concept results in leveraging Lucene for a Documentum full text search engine.
In our typical consumer portal, we often have clients choose to either push only “World View” documents, or implement some type of light application security (ex: Only these users have access to these types of documents). For integration with Documentum, the Documentum search should leverage the existing ACL security layer already in place in the repository. The main security issue that needs to be addressed is that users without at least “browse” access (can see the document’s metadata, but can’t open it) on a certain document shouldn’t be able to see that document in the Lucene search results. Keep in mind that the Documentum API would be used to view the document (either from HPI or Webtop) and would check the ACL for “read” access so unauthorized viewing of a document is not an issue.
Lucene Integration to Documentum
The goal of the Lucene integration is to continue to return results quickly and avoiding Documentum API calls, if possible, while following the ACL requirements. One strategy would be to check each Lucene search result against Documentum for “browse” access. Although this approach will perform the same way regardless of the complexity of the repository security model, it was quickly determined that having a database hit for every search result would slow performance. An alternative method that we feel would be faster is depicted below:
One approach is to look up the Documentum user’s ACL rights before a search, determining which ACLs the user has at least “browse” access to in the repository. By indexing documents in Lucene with content, metadata (including ACL information), we were able leverage one Lucene search (as with the cached approach) without having to check ACLs on documents individually. Our thoughts were that document ACLs only change after certain events such as a lifecycle state change, and that we could capture and re-index these documents in Lucene after these changes. With this approach, the Lucene query ends up looking like this:
document_type:sop AND text:”Change Request” AND acl_name:(“Global Read ACL” OR “SOP Effective ACL” OR “SOP Approved ACL” OR “Drawing Read ACL”)
This approach works well for systems with a small and finite number of ACLs. Because many systems have a large number of ACLs that grant dm_world “read” access, it’s possible that the Lucene query could become very large for systems with complex security models. An alternative, and more hybrid approach would be to continue to look up the user’s “browse” ACLs before running the Lucene query, but rather than adding ACL clauses to the query, perform a security check on each search result against the list of the user’s “browse” ACLs. Because this list can be easily stored in memory, it eliminates the need for costly Documentum API calls or DQL queries for each search result.
Because the Documentum/Lucene integration utilizes OpenMigrate to perform the incremental publish to the Lucene index, a lot of flexibility is automatically built in. The publishing job is DQL query based, so it can easily be configured to only index the desired searchable documents, while ignoring others, including job logs and other repository documents that users should never see. This flexibility combined with options for security integration described above provide for a more tunable full text indexing solution designed with performance in mind.
Based on our preliminary results, we believe that integrating Documentum ACL security with Lucene search has minimal performance impact on the system. Additional test results will be posted here as they become available. Please comment below if you have any thoughts or questions.