• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Enterprise Document Search – A Publishing rather than Crawler/Federated Approach

You are here: Home / Alfresco / Enterprise Document Search – A Publishing rather than Crawler/Federated Approach

January 16, 2015

I met with a client yesterday that was struggling with a typical ECM issue in regards to how to provide for searches against multiple ECM environments.  While ECM vendors love to pitch the “one repository to rule them all”, most ECM users have content in a variety of different solutions that have evolved over time.  This post will provide thoughts on how to publish to a common search/retrieval repository and benefits over a crawling or federated approach.

Enterprise Document Search – What is the Issue?

When departments within organizations capture documents electronically, typically these documents are stored in the departments’ content management system.  As an example, typically Accounts Payable will receive and process all invoices and stores the invoice and supporting documents in the accounts payable content management application that is often tightly tied to the account system and indexed by voucher number, invoice date, invoice amount, company name and other relevant data.  If a different department, example New Ventures Department, would like to leverage a vendor for a new engagement, New Ventures might like to review the relevant billings.  In this case, New Ventures would have to ask for access to the Accounts Payable system or request the Accounts Payable department to provide a copy of the documents.  Enterprise Search would allow the New Ventures group to retrieve the appropriate documents on their own.

While it would be easy to push for a single Enterprise Content Management (ECM) platform across all department systems, most clients have evolved, based on a variety of factors, to fall into a “Silo” approach where the content is not shared across departments or, even if a single ECM platform is used, the security and access is restricted to within the department.  Typically the Silo approach results from:

  • Security – First and foremost, many times documents are not to be shared across departments. In securing these documents, security can focus on securing all documents.  Coming up with security requirements outside of the department are typically not in scope.
  • Funding – While the department might have funding to implement their own content storage system, they don’t always have additional funding (or the incentive) to provide documents outside of the department.
  • Other Department Requirements – Typical implementations are not built to provide for other department search requirements. This could include not having the right company name/taxonomy to help with search requirements.

For our client, they had identified six potential systems that included multiple ECM vendors as well as index data not stored in the ECM system but in the departmental database system (example CRM, ERM) where the document is just being stored in ECM.

Separate Search

The desire to bring all the searching into one central location to simplify access and training is a common request.

More than just a Google Search

Before talking about different searches, as we described back in 2011, ECM customers want a robust search that includes:

  • Document Types – first selection will require to pick a document type (or grouping – ex: SOP) before selecting any attributes. Specific attributes available should be based on the selected document type.
  • Drop Downs rather than free form text – Many attributes are defined in the data dictionary or have a limited set of values – users should be able to easily select from this list rather than taking the time to key in text while risking a spelling error and having to do another search.
  • Equals – pick list selections should be equals for the operator for performance (rather than contains, starts with, ends with, greater than…..). Use “Contains” for the free form fields (example title).
  • Eliminate “Ors” – When pushed, users rarely need “Show me all the SOPs for this product or this product” search. Best practices would just use two different searches to simplify interface and training.
  • Search Results should allow for:
    • Allow for sorting by columns (typically the header)
    • Allow for display or removal of certain columns
    • Allow for export to Excel/CSV

See the complete article on search, specifically for Documentum customers on why just full-text or a “Google” Search doesn’t fit ECM user needs.

Issues with a Federated Approach

One typical approach for providing cross repository search would be to set up a federated search where one interface would initiate searches across each repository and concatenate the results.

Federated Search

We presented this type of approach back in 2011 when discussing OpenSearch for SharePoint.  For this approach, the index and content would remain in the separate systems and integration would just bring together multiple searches into one interface and results list.  Benefits of this approach include:

  • Security – Since the source content management system is being used, any search would follow the security from that system. Other departmental users would need to be defined in the system with appropriate access rights.

A couple of issues with this approach include:

  • Integration – The search interface would need a standard (OpenSearch as presented for SharePoint) but would require integration into each the departmental solutions to initiate the search and concatenate the results.
  • Performance – Search performance would be limited by the system with the slowest response time. It would be difficult to tune the system or add indexes to improve performance without updating the departmental system.
  • Content Type – Often times, content in departmental systems can be stored in different formats (think TIFF for old imaging systems). Often times access to the system is not enough as the users will need viewers for what could be proprietary content files.
  • Administration – Every user would need to be defined and maintained in every departmental system with a user id and appropriate security.
  • Licensing – Tied to Administration, every user would need to have a license for the departmental system. In the case where ECM was tied to another system (CRM), every user would need a license for both systems.

Issues with a Crawler Approach

With the Crawler approach, a search engine would crawl each of the repositories to extract the meta-data and potentially full-text index components as depicted below:

Crawler Search

For this approach, the content would remain in the separate systems and the index data would be contained in the overall search engine.  This approach provides some benefits to the federated search in that:

  • Search Interface – Could be tuned and would not require integration into each of the departmental systems. Often times, search tools will have crawlers for different systems.
  • Performance – Could be tuned for search results but not for content retrieval as content would still be stored in the underlying systems.

Issues with this approach include:

  • Security – When crawling the different systems, the search engine would have to capture some of the security components. Of issue with many clients is making sure that other departments see confidential documents or documents that are still in a DRAFT state.   Clients have had issues with a crawler approach when the user can see that a document exists (ex:  HR Incident Write-up) in the search results even if that document can’t be  retrieved from the departmental system.
  • Index Components – Often times crawlers are written just to retrieve full-text and not meta-data. Crawler capabilities might need to be tuned to capture both content and index values.
  • Licensing – Tied to Administration, every user would need to have a license for every departmental system ECM. In the case where ECM was tied to another system (CRM), the user would not need a license for both systems as they are only retrieving the content.

Publishing Approach

A publishing approach takes a different approach in that, when content is ready to be shared, it is “published” or copied to a new repository.

Publish Search

In this approach, a job is set up to monitor the business system looking for documents of a type and that have reached a stage that they can be pushed to the separate repository.  With this push, the new repository will have all the meta-data as well as a copy of the document itself.  Typically we see clients just publish a PDF of the document since it is to only be used for read access.  The publishing job might also push a light version of security in the form of meta-data if required.

In this manner, the department can insure that access to their own system is still controlled and documents that are needed to be shared can be pushed to a common search and retrieval repository.  Many of our ECM clients are currently implementing this approach for even just one repository as it provides:

  • Business Continuity – As it provides a search and retrieval capability even if the ECM is down or unavailable. See Business Continuity post here.
  • Consumer Only Interface – By removing consumers from the departmental system, the load on the ECM system for both performance and licensing, would be reduced.
  • Simplified Searching – by eliminating access to vendor interfaces. See Documentum comparison review here.

Besides business continuity, advantages of this approach over a Federated or Crawler approach include:

  • Integration – Rather than having to write real-time integration to the departmental repository, the integration would be required at the publishing job. The Search Interface could be written for just new repository and take advantage of all the capabilities of the repository.
  • Performance – Search performance is not limited by the system with the slowest response time.
  • Content Type – As part of the publishing job, content could be changed (typically to PDF) and also include additional items (headers/footers….) to provide consistency between systems.
  • Administration – Each user would need to be defined and maintained in the overall search repository rather than the departmental system.
  • Licensing – Tied to Administration, each user would only need to have a license for the search system.

TSG has implemented the publishing approach for multiple clients with our OpenMigrate tool.   Several features include:

  • Ability to pull from a wide variety of ECM repositories including Documentum, FileNet, Alfresco, SharePoint as well as database driven systems (example Custom Oracle/SAP)
  • Ability to “poll” a repository and push content on a set interval (example 5 minutes or once a day).
  • Ability to transform content from a variety of formats into PDF.
  • Ability to store and index into a variety of repositories including Alfresco, Documentum as well as Lucene/Solr and Hadoop.
  • Ability to delete outdated or superseded documents from target repository.

Thoughts on the Publishing Repository

We typically recommend leveraging Lucene/Solr as the index component of the publishing repository with a pointer to a file system for the published PDFs.  Other options could include full-feature ECM platforms like Alfresco or Documentum, or some of our newer work to leverage Hadoop.  Look for an article next week on some of the benefits of Hadoop for a publishing solution.

If you have any thoughts please comment below.

Filed Under: Alfresco, Documentum, Hadoop, Lucene, Migrations, OpenMigrate Tagged With: Hadoop

Reader Interactions

Trackbacks

  1. Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers | TSG Blog says:
    January 19, 2015 at 1:32 pm

    […] « Enterprise Document Search – A Publishing rather than Crawler/Federated Approach […]

    Reply
  2. Hadoop – Data Model for ECM applications | TSG Blog says:
    January 21, 2015 at 10:13 pm

    […] document’s fulltext AND meta-data for superior search performance.  As we mentioned in the Enterprise Search article, most ECM vendors (Documentum and Alfresco as examples) now leverage Lucene/Solr for all […]

    Reply
  3. Federated Search and Content Services – Is a Publishing Approach better? says:
    October 23, 2018 at 8:01 am

    […] When it comes to federated search or enterprise search, TSG sees parallels in the data warehouse approach.  In a data warehouse approach, clients wanted access to data contained in other systems but did not want to replace those systems.  Rather than a federated approach, the data warehouse focuses on publishing content from the legacy system to the data warehouse.  With the cost of storage always getting cheaper and cheaper, TSG has been recommending a publishing approach for documents. As we recommended back in 2015 when Enterprise Search was being discussed, TSG will typically recommend a publishing approach rather than a crawler or federated search. […]

    Reply
  4. ECM Sales Myths for 2019 says:
    March 22, 2019 at 10:03 am

    […] multiple clients attempt the overall search but have encountered some huge issues.  See our blog post back in 2015 on why we recommended a publishing rather than a crawler or federated approac… and included an updated post last year specifically on a problems with a Federated approach where […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Redaction for AWS, Alfresco, Documentum and Hadoop – Bulk Redaction upon Ingestion or Migration
  • Data Visualization Dashboard for ECM Migrations
  • Hadoop – Why Hadoop as a Content Store when Caching Content for ECM Consumers
  • Next Generation ECMS – Architecture Thoughts
  • Documentum Migrations – Interview with OpenMigrate Product Manager
  • Documentum – Top 12 Tips
  • Migrating to Alfresco – Reducing Risk, Stress and Cost with a Rolling Migration
  • OpenContent Solr Services – New TSG Product Offering
  • Office 365 – Check-in and Check-out with Documentum, Alfresco or Hadoop
  • Ephesoft Accounts Payable Solution for Alfresco, Documentum and Hadoop

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT