Enterprise Document Search – A Publishing rather than Crawler/Federated Approach

I met with a client yesterday that was struggling with a typical ECM issue in regards to how to provide for searches against multiple ECM environments. While ECM vendors love to pitch the “one repository to rule them all”, most ECM users have content in a variety of different solutions that have evolved over time. This post will provide thoughts on how to publish to a common search/retrieval repository and benefits over a crawling or federated approach.

Enterprise Document Search – What is the Issue?

When departments within organizations capture documents electronically, typically these documents are stored in the departments’ content management system. As an example, typically Accounts Payable will receive and process all invoices and stores the invoice and supporting documents in the accounts payable content management application that is often tightly tied to the account system and indexed by voucher number, invoice date, invoice amount, company name and other relevant data. If a different department, example New Ventures Department, would like to leverage a vendor for a new engagement, New Ventures might like to review the relevant billings. In this case, New Ventures would have to ask for access to the Accounts Payable system or request the Accounts Payable department to provide a copy of the documents. Enterprise Search would allow the New Ventures group to retrieve the appropriate documents on their own.

While it would be easy to push for a single Enterprise Content Management (ECM) platform across all department systems, most clients have evolved, based on a variety of factors, to fall into a “Silo” approach where the content is not shared across departments or, even if a single ECM platform is used, the security and access is restricted to within the department. Typically the Silo approach results from:

Security – First and foremost, many times documents are not to be shared across departments. In securing these documents, security can focus on securing all documents. Coming up with security requirements outside of the department are typically not in scope.
Funding – While the department might have funding to implement their own content storage system, they don’t always have additional funding (or the incentive) to provide documents outside of the department.
Other Department Requirements – Typical implementations are not built to provide for other department search requirements. This could include not having the right company name/taxonomy to help with search requirements.

For our client, they had identified six potential systems that included multiple ECM vendors as well as index data not stored in the ECM system but in the departmental database system (example CRM, ERM) where the document is just being stored in ECM.

The desire to bring all the searching into one central location to simplify access and training is a common request.

More than just a Google Search

Before talking about different searches, as we described back in 2011, ECM customers want a robust search that includes:

Document Types – first selection will require to pick a document type (or grouping – ex: SOP) before selecting any attributes. Specific attributes available should be based on the selected document type.
Drop Downs rather than free form text – Many attributes are defined in the data dictionary or have a limited set of values – users should be able to easily select from this list rather than taking the time to key in text while risking a spelling error and having to do another search.
Equals – pick list selections should be equals for the operator for performance (rather than contains, starts with, ends with, greater than…..). Use “Contains” for the free form fields (example title).
Eliminate “Ors” – When pushed, users rarely need “Show me all the SOPs for this product or this product” search. Best practices would just use two different searches to simplify interface and training.
Search Results should allow for:
- Allow for sorting by columns (typically the header)
- Allow for display or removal of certain columns
- Allow for export to Excel/CSV

See the complete article on search, specifically for Documentum customers on why just full-text or a “Google” Search doesn’t fit ECM user needs.

Issues with a Federated Approach

One typical approach for providing cross repository search would be to set up a federated search where one interface would initiate searches across each repository and concatenate the results.

We presented this type of approach back in 2011 when discussing OpenSearch for SharePoint. For this approach, the index and content would remain in the separate systems and integration would just bring together multiple searches into one interface and results list. Benefits of this approach include:

Security – Since the source content management system is being used, any search would follow the security from that system. Other departmental users would need to be defined in the system with appropriate access rights.

A couple of issues with this approach include:

Integration – The search interface would need a standard (OpenSearch as presented for SharePoint) but would require integration into each the departmental solutions to initiate the search and concatenate the results.
Performance – Search performance would be limited by the system with the slowest response time. It would be difficult to tune the system or add indexes to improve performance without updating the departmental system.
Content Type – Often times, content in departmental systems can be stored in different formats (think TIFF for old imaging systems). Often times access to the system is not enough as the users will need viewers for what could be proprietary content files.
Administration – Every user would need to be defined and maintained in every departmental system with a user id and appropriate security.
Licensing – Tied to Administration, every user would need to have a license for the departmental system. In the case where ECM was tied to another system (CRM), every user would need a license for both systems.

Issues with a Crawler Approach

With the Crawler approach, a search engine would crawl each of the repositories to extract the meta-data and potentially full-text index components as depicted below:

For this approach, the content would remain in the separate systems and the index data would be contained in the overall search engine. This approach provides some benefits to the federated search in that:

Search Interface – Could be tuned and would not require integration into each of the departmental systems. Often times, search tools will have crawlers for different systems.
Performance – Could be tuned for search results but not for content retrieval as content would still be stored in the underlying systems.

Issues with this approach include:

Security – When crawling the different systems, the search engine would have to capture some of the security components. Of issue with many clients is making sure that other departments see confidential documents or documents that are still in a DRAFT state. Clients have had issues with a crawler approach when the user can see that a document exists (ex: HR Incident Write-up) in the search results even if that document can’t be retrieved from the departmental system.
Index Components – Often times crawlers are written just to retrieve full-text and not meta-data. Crawler capabilities might need to be tuned to capture both content and index values.
Licensing – Tied to Administration, every user would need to have a license for every departmental system ECM. In the case where ECM was tied to another system (CRM), the user would not need a license for both systems as they are only retrieving the content.

Publishing Approach

A publishing approach takes a different approach in that, when content is ready to be shared, it is “published” or copied to a new repository.

In this approach, a job is set up to monitor the business system looking for documents of a type and that have reached a stage that they can be pushed to the separate repository. With this push, the new repository will have all the meta-data as well as a copy of the document itself. Typically we see clients just publish a PDF of the document since it is to only be used for read access. The publishing job might also push a light version of security in the form of meta-data if required.

In this manner, the department can insure that access to their own system is still controlled and documents that are needed to be shared can be pushed to a common search and retrieval repository. Many of our ECM clients are currently implementing this approach for even just one repository as it provides:

Business Continuity – As it provides a search and retrieval capability even if the ECM is down or unavailable. See Business Continuity post here.
Consumer Only Interface – By removing consumers from the departmental system, the load on the ECM system for both performance and licensing, would be reduced.
Simplified Searching – by eliminating access to vendor interfaces. See Documentum comparison review here.

Besides business continuity, advantages of this approach over a Federated or Crawler approach include:

Integration – Rather than having to write real-time integration to the departmental repository, the integration would be required at the publishing job. The Search Interface could be written for just new repository and take advantage of all the capabilities of the repository.
Performance – Search performance is not limited by the system with the slowest response time.
Content Type – As part of the publishing job, content could be changed (typically to PDF) and also include additional items (headers/footers….) to provide consistency between systems.
Administration – Each user would need to be defined and maintained in the overall search repository rather than the departmental system.
Licensing – Tied to Administration, each user would only need to have a license for the search system.

TSG has implemented the publishing approach for multiple clients with our OpenMigrate tool. Several features include:

Ability to pull from a wide variety of ECM repositories including Documentum, FileNet, Alfresco, SharePoint as well as database driven systems (example Custom Oracle/SAP)
Ability to “poll” a repository and push content on a set interval (example 5 minutes or once a day).
Ability to transform content from a variety of formats into PDF.
Ability to store and index into a variety of repositories including Alfresco, Documentum as well as Lucene/Solr and Hadoop.
Ability to delete outdated or superseded documents from target repository.

Thoughts on the Publishing Repository

We typically recommend leveraging Lucene/Solr as the index component of the publishing repository with a pointer to a file system for the published PDFs. Other options could include full-feature ECM platforms like Alfresco or Documentum, or some of our newer work to leverage Hadoop. Look for an article next week on some of the benefits of Hadoop for a publishing solution.

If you have any thoughts please comment below.

Trackbacks

Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers | TSG Blog says:

January 19, 2015 at 1:32 pm

[…] « Enterprise Document Search – A Publishing rather than Crawler/Federated Approach […]
Hadoop – Data Model for ECM applications | TSG Blog says:

January 21, 2015 at 10:13 pm

[…] document’s fulltext AND meta-data for superior search performance. As we mentioned in the Enterprise Search article, most ECM vendors (Documentum and Alfresco as examples) now leverage Lucene/Solr for all […]
Federated Search and Content Services – Is a Publishing Approach better? says:

October 23, 2018 at 8:01 am

[…] When it comes to federated search or enterprise search, TSG sees parallels in the data warehouse approach. In a data warehouse approach, clients wanted access to data contained in other systems but did not want to replace those systems. Rather than a federated approach, the data warehouse focuses on publishing content from the legacy system to the data warehouse. With the cost of storage always getting cheaper and cheaper, TSG has been recommending a publishing approach for documents. As we recommended back in 2015 when Enterprise Search was being discussed, TSG will typically recommend a publishing approach rather than a crawler or federated search. […]
ECM Sales Myths for 2019 says:

March 22, 2019 at 10:03 am

[…] multiple clients attempt the overall search but have encountered some huge issues. See our blog post back in 2015 on why we recommended a publishing rather than a crawler or federated approac… and included an updated post last year specifically on a problems with a Federated approach where […]

Reader Interactions

Trackbacks

Leave a Reply to Hadoop – Data Model for ECM applications | TSG BlogCancel reply