We commonly see clients struggle with reporting requirements from their ECM platform, whether that be Documentum, Alfresco or any other ECM repository. While typical ECM repositories have a relational database included in the infrastructure, the database is typically difficult to navigate with typical SQL based reporting tools. This post will discuss how TSG has been leveraging Open Source and Big Data tools to provide more “out of the box” reporting for ECM customers.
ECM Reporting – What are the issues?
Reporting requirements for typical ECM systems can be divided into two basic requirements:
1. Reporting on Documents
Reporting on Documents can include requirements like:
- How many documents were created, approved or retired last month? Faceted by specific object types or attribute values.
- What is my backlog of unprocessed documents?
These requirements typically fall as part of the document and attributes of the system but can be difficult to report on as the metadata that drives the reporting might be constantly changing.
2. Reporting on Actions
Reporting on Actions can include requirements like:
- How many times was this document viewed?
- Who has viewed this document?
- How long did this document approver’s task take to approve?
As the majority of legacy reporting tools focus on SQL reporting on a relational database (like Cognos, Pentaho), the difficulty with most ECM systems center around the inability to access the underlying abstracted relational database to retrieve reporting statistics. The way that Alfresco and Documentum abstract and normalize their database make it undesirable to point these tools at the database and be able to pull any meaningful data. Most ECM relational databases are laid out with a variety of different tables to provide the ability to add attributes easily. This structure makes it difficult to report on document metadata.
Another major issue is typically the ECM tools built in audit capabilities. While the audit capabilities exist, the resulting relational table within the database can quickly become overwhelmed (tens of millions of entries). As an example, Documentum provides a capability to turn on an audit trail for document viewing. While this would seem to address our requirement of who has viewed the document, Documentum counts a view as not when the document file was requested (API example) but when any of the metadata is viewed. For a search that returns 100 results in a list, the audit trial would have 100 entries quickly filling the audit trail.
ECM reporting with Big Data Tools – a New Approach
For multiple clients, TSG has developed solutions leveraging Big Data Open Source tools to construct robust “out of the box” reporting. The solutions are constructed in such a way as to not affect performance or flexibility of the underlying ECM system. The individual components we would recommend that make up the Open Source ELK stack:
- Elasticsearch – Robust Open Source index (built on Apache Lucene efforts) for indexing timestamped events/actions.
- Logstash – Robust Open Source data collection pipeline for pushing data into Elasticsearch.
- Kibana – Reporting tool for visualizing and navigating data indexed in Elasticsearch
To illustrate how these tools can be leveraged for ECM reporting, consider the following two scenarios.
- File Access – Clients have used the ELK stack to log an event into a standalone log file every time a file is retrieved. Logstash then captures and streams that log data to Elasticsearch. The event includes all of the attributes (ex: Document Status, Vendor Name, etc) as well as the username and time of the event. Leveraging the Kibana GUI, the business can quickly construct a report of how many documents of a certain type were retrieved as well as who is reading those documents.
- Performance – Some of our clients have included additional information in the logging including things like performance. While more difficult, the retrieval time was calculated between when the file (or search) was requested and how long it took to ultimately return the results to the user. In this manner, administrators can have a clear understanding of performance of the system based on user access time of day as well as user activity.
Adding Big Data reporting to Documentum or Alfresco
To add this functionality to your existing ECM repository, clients have added hooks at the following places to create the “event” entries
- Documentum – TBO Code (for low level events like create, checkout, checkin, getContent, etc)
- Alfresco – Alfresco Behaviour (for low level events like create, checkout, checkin, getContent, ect).
- Application specific events (for logging when users click on certain actions in the UI in Webtop/D2
Summary
ECM customers have always struggled with reporting. By adding Big Data reporting, users can get better reporting while not affecting their ECM infrastructure.