• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
TSB Alfresco Cobrand White tagline

Technology Services Group

  • Home
  • Products
    • Alfresco Enterprise Viewer
    • OpenContent Search
    • OpenContent Case
    • OpenContent Forms
    • OpenMigrate
    • OpenContent Web Services
    • OpenCapture
    • OpenOverlay
  • Solutions
    • Alfresco Content Accelerator for Claims Management
      • Claims Demo Series
    • Alfresco Content Accelerator for Policy & Procedure Management
      • Compliance Demo Series
    • OpenContent Accounts Payable
    • OpenContent Contract Management
    • OpenContent Batch Records
    • OpenContent Government
    • OpenContent Corporate Forms
    • OpenContent Construction Management
    • OpenContent Digital Archive
    • OpenContent Human Resources
    • OpenContent Patient Records
  • Platforms
    • Alfresco Consulting
      • Alfresco Case Study – Canadian Museum of Human Rights
      • Alfresco Case Study – New York Philharmonic
      • Alfresco Case Study – New York Property Insurance Underwriting Association
      • Alfresco Case Study – American Society for Clinical Pathology
      • Alfresco Case Study – American Association of Insurance Services
      • Alfresco Case Study – United Cerebral Palsy
    • HBase
    • DynamoDB
    • OpenText & Documentum Consulting
      • Upgrades – A Well Documented Approach
      • Life Science Solutions
        • Life Sciences Project Sampling
    • Veeva Consulting
    • Ephesoft
    • Workshare
  • Case Studies
    • White Papers
    • 11 Billion Document Migration
    • Learning Zone
    • Digital Asset Collection – Canadian Museum of Human Rights
    • Digital Archive and Retrieval – ASCP
    • Digital Archives – New York Philharmonic
    • Insurance Claim Processing – New York Property Insurance
    • Policy Forms Management with Machine Learning – AAIS
    • Liferay and Alfresco Portal – United Cerebral Palsy of Greater Chicago
  • About
    • Contact Us
  • Blog

Documentum Search – Lucene versus FAST

You are here: Home / Documentum / Documentum Search – Lucene versus FAST

March 17, 2010

As mentioned in a previous article, many clients are moving to away from FAST in preparation for the eventual release of Documentum Search Services (DSS) slated for release in June that leverages the open source product, Apache Lucene.  This post will share the results from one client that executed a proof of concept test to compare the two search engines.

Proof of Concept Approach – As we have mentioned before, many clients have decided to implement an external cache outside of Documentum to address business continuity, performance and licensing issues.   For a large pharmaceutical client, TSG was tasked with performing a proof of concept on 156,000 documents in an external data source indexed by Lucene.  The proof of concept would compare search results of FAST within Documentum (Webtop) and Lucene (HPI) outside of Documentum in regards to search results.  The proof of concept additionally evaluated leveraging Lucene for metadata storage rather than storing in another database such as Oracle.

POC Findings – Lucene/HPI and the external repository was found to be considerably quicker that the existing FAST/Webtop implementation on most queries.  

Specific results:

Query

FAST/Webtop

Lucene/HPI

1200 Results 90 seconds 3 seconds
8 Results 5 seconds 3 seconds
10 Results 8 seconds 4 seconds
76 Results 10 seconds 5 seconds
5100 Results 72 seconds 5 seconds
65 Results 6 seconds 3 seconds

 Simple configuration of the Lucene index did a better job of returning a more complete search result set than the standard FAST/webtop configuration.  Examples included additional documents that were logical derivatives of the initial search word. For example – a search for “exception report” could return “exceptions report” or “exception reports”. The proof of concept data set also included German documents and Lucene demonstrated multilingual stemming capability.

Key Stats – Lucene

  • 156,000 Documents – 31.6 Gigabytes
  • Total Index Space – 521 MB
  • Total Index Build Time – 10 hours – The client was very interested in the time it took to index the content and metadata in Lucene because they had experience lengthy indexing times with FAST in their 5.3 upgrade. This was tracked as part of the proof of concept, however, the corresponding FAST data is no longer available from the 5.3 upgrade.

FAST and Lucene – Full Text Syntax Differences

  • FAST
    • “One Two” – will return documents with the exact phrase “One Two” in the document
    • One Two – will return documents with the words One OR Two in the document
    • One+Two – will return documents with the words One OR Two in the document
    • One and Two – will return documents with the words One AND Two in the documen
  • Lucene – Based on the Proof of Concept’s configuration
    • “One Two” – will return documents with the exact phrase “One Two” in the document
    • One Two – will return documents with the words One AND Two in the document
    • One OR Two – will return documents with the words One OR Two in the document
    • One and Two – will return documents with the words One AND Two in the document
    • One+Two  – will return documents with the exact phrase “One Two” in the document

Overall Thoughts

Overall the client was very satisfied with the findings and is moving forward with the solution.  The flexibility of Lucene to index both the metdata and full-text values allowed the client to avoid adding an additional Oracle database to their external cache for attribute storage.  The client also liked the more simple, intuitive search interface of HPI compared to the Webtop interface. 

In addition to leveraging Lucene for searching an external cache, we are also working to leverage Lucene for internal Documentum/Webtop search.

If you have any questions or would like more detailed information, please contact us or comment below:

Filed Under: Documentum, Lucene, OpenContent Management Suite, Product Suite, R&D, Search

Reader Interactions

Comments

  1. Anhtuan Doventry says

    March 18, 2010 at 8:27 am

    wow…that’s fast!

    Reply
  2. mikew says

    March 19, 2010 at 7:51 am

    Is that a valid comparison? A FAST search via webtop also has to process the security applied by all the ACLs specfied in Documentum for each result. Did the external repository also have this concept? What was the external repository?

    Reply
  3. bethtee says

    March 19, 2010 at 12:29 pm

    Mike – I would agree with your point, the comparison is not completely apples to apples and there are definitely processing tasks on the Webtop/FAST side of things that were not done on the Lucene/HPI side (e.g., ACL application). The client was frustrated with full text search performance and could expose documents out to an external read-only cache where all users have access. This approach enables consumers to quickly search for content without unnecessary Documentum overhead.

    We realize that this solution does not always apply – there are situations where ACL security needs to be honored. As mentioned in the post, we are also looking at integrating Lucene with Documentum which (to your point) would provide a more apples to apples comparison. Once we have more concrete findings, we will post on that.

    Reply
  4. Ramesh says

    March 22, 2010 at 12:51 am

    I agree that FAST is very bad with the results it provides, but trust me it is FAST. It takes about 1-2 seconds on a 2 mil doc repository of size few tera bytes. What takes longer is the ACL verification. You can validate yourself by going to the search interface provided by the index server and firing a FT-DQL query.

    Just saying, the comparison is not fair. But I like the community embracing open source search platforms.

    -Ramesh

    Reply
    • TSG Dave says

      March 24, 2010 at 10:39 am

      Ramesh,

      You bring up a great point in regards to security and ACL’s. For the client they were looking for something faster and the combination of the cache and Lucene was definately faster for what was loosely secured content. We have been doing the “web cache” approach for awhile just with attributes so adding Lucene wasn’t that much of a stretch. Like Bethany said – not really apples to apples.

      Look for another post here shortly on our approach to searching with Lucene in Documentum and preserving ACL security. I just saw the first draft so it should be out in a day or so.

      Dave

      Reply

Trackbacks

  1. Lucid Imagination » Actual mileage may vary says:
    March 19, 2010 at 9:57 am

    […] source content management solutions consultancy and integration shop based out of Chicago. In a blog post earlier this week, they describe a proof of concept for a large pharmaceutical client, benchmarking […]

    Reply
  2. Documentum Full Text Search with Lucene – Honoring ACL Security « TSG Blog says:
    March 30, 2010 at 1:20 pm

    […] About « Documentum Search – Lucene versus FAST […]

    Reply
  3. EMC Documentum Search Services (DSS) Beta Recap « TSG Blog says:
    May 5, 2010 at 8:03 am

    […] Documentum Search – Lucene versus FAST […]

    Reply
  4. Documentum 6.5 Upgrade – Character Encoding Issues « TSG Blog says:
    August 26, 2010 at 3:29 pm

    […] developing a consumer interface application leveraging Lucene (add link).  As we mentioned in a previous post, the client chose Lucene over FAST based on benchmarking results for over 150,000 […]

    Reply
  5. TSG Blog – 18 months and counting « TSG Blog says:
    September 13, 2010 at 6:16 am

    […] leveraging Lucene, Documentum Search Services,  as well as understanding the differences of FAST versus Lucene  continue to get lots of […]

    Reply
  6. Documentum Search – How to get around the user request of “I just want a search like Google” « TSG Blog says:
    February 22, 2011 at 2:52 pm

    […] Documentum Search – Lucene versus FAST […]

    Reply
  7. SearchHub, brought to you by LucidWorks » Actual mileage may vary says:
    October 17, 2012 at 4:06 pm

    […] source content management solutions consultancy and integration shop based out of Chicago. In a blog post earlier this week, they describe a proof of concept for a large pharmaceutical client, benchmarking […]

    Reply

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Search

Related Posts

  • Documentum – Top 12 Tips
  • Documentum Full Text Search with Lucene – Honoring ACL Security
  • Hadoop – OpenContent/HPI Product Plans
  • Documentum Cross-Repository Searching – an integrated open source approach
  • Documentum Performance – Search, Retrieval and Inbox
  • Working with Folders in Documentum User Interfaces – D2, Webtop, HPI, and xCP
  • Documentum 6.5 Upgrade – Character Encoding Issues
  • Lucene Integration, OpenSearch Compliant & User Interface Enhancements – New HPI Functionality
  • Documentum Search – Lucene, FAST, Verity, Google and upcoming DSS
  • Documentum Open Source Software

Recent Posts

  • Alfresco Content Accelerator and Alfresco Enterprise Viewer – Improving User Collaboration Efficiency
  • Alfresco Content Accelerator – Document Notification Distribution Lists
  • Alfresco Webinar – Productivity Anywhere: How modern claim and policy document processing can help the new work-from-home normal succeed
  • Alfresco – Viewing Annotations on Versions
  • Alfresco Content Accelerator – Collaboration Enhancements
stacks-of-paper

11 BILLION DOCUMENT
BENCHMARK
OVERVIEW

Learn how TSG was able to leverage DynamoDB, S3, ElasticSearch & AWS to successfully migrate 11 Billion documents.

Download White Paper

Footer

Search

Contact

22 West Washington St
5th Floor
Chicago, IL 60602

inquiry@tsgrp.com

312.372.7777

Copyright © 2023 · Technology Services Group, Inc. · Log in

This website uses cookies to improve your experience. Please accept this site's cookies, but you can opt-out if you wish. Privacy Policy ACCEPT | Cookie settings
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT