TSG conducted our annual client briefing on Monday June 5th. One of the more interesting presentations and discussions was on how clients are finding innovative ways to leverage the capabilities of the various object storage devices and technologies. This post will present some of our experiences and best practices in regards to object storage and ECM.
Object Storage and ECM
Most ECM products grew up in a pre-object storage world where the ECM system stored documents in mounted file stores or a SAN controlled by and only accessed by the ECM system. For the majority of systems, the file names and storage areas were created and managed exclusively by the ECM software. Typically the ECM interface or application would call an ECM API to create the database entries for meta-data as well as store the file in a mounted file store. In the old client-server days, the API would then ship the document file to the server containing the ECM software.
As browser systems evolved, different interfaces would use default browser transmissions capabilities to ship the document file to the application server and then call the ECM API to store the document file in the mounted file store. Lots of other activities might be kicked off based on the file store that could include PDF renditioning or full-text indexing.
Object Storage provides the ability to move beyond the mounted file system to a more secure and efficient object storage. Rather than a file path and file name, a storage device or application can just store the files and return a pointer that could be placed in the ECM system. Typically there are some extra added components (typically called connectors) that might be required for the ECM system to store and access the object store.
As object storage evolved, lots of capabilities are being included that can creatively be taken advantage of by the ECM system including:
- De-duplication – rather than store two copies of the same file, if the object store realizes that the files are exactly the same, it can remove one copy to reduce storage while making it appear that both files still exist with multiple pointers.
- High Speed Ingestion – As a separate layer or architecture, the separate hardware and software can allow speedy and parallel storage of files.
- Object Store and File Path – the Object Store can leverage both object storage with simple pointers as well as emulate an actual filesystem with folders/paths and metadata.
- Encryption – Object Stores can be set up with encryption embedded within the object store’s hardware, removing the performance impact of performing the encryption on the already overloaded ECM system.
Storing and Linking with ECM and Object Stores
For a typical migration or ingestion, most ECM programs call the ECM API to ingest the file and create the meta-data in the ECM repository. With Alfresco, one concept supported by our migration tool, OpenMigrate, is the method of leveraging the high-speed ingestion of the object store and then adding the link into Alfresco rather than having the Alfresco API do the file storage. By allowing the object store to handle the ingestion of the file itself, the process removes the overhead of the transmission of the file to the application server as well as by the Alfresco createNode API. From a repository perspective, the entry in the Alfresco DB is exactly the same as if the API had been called.
With the Alfresco API, typical client environments are limited by network and other bandwidth, memory and CPU issues within Alfresco. With the direct storage in the object store, we have seen our typically throughput grow from 20 documents/second to approach 250 documents/second or more. We are working on ways to embed the linking approach within our OpenContent webservices and JavaScript to add the builk upload linking capabilities to our typical user upload functions.
Preserving existing integrations with the Object Store file Mapping features
Another great use of the Object Store is the ability to have it work both as a mounted file system as well as an Object Store. For one of our clients, the ability to have existing integrations continue to use a pretty complex file system to store documents while allowing the object store to pass the object-id to ECM for linking provides a best of both world approach. Existing integrations can continue to remain unchanged while the ECM repository is not clogged up with long and unnecessary file locations and file names. As the integrations are changed to either leverage the file store or different directory/file naming, the ECM system will remain consistent with the object store ids.
Consumer and Browser integration to the object store rather than to the ECM system
Another innovative approach similar to the object linking involves having users or other systems store their content directly into the object store and posting the link either in the ECM system or another application. The scenario for one client involves the upload of large video files to Amazon’s S3 object store. The non-ECM application allows users from their smart phones or other devices to upload a video directly to Amazon to take advantage of Amazon’s CloudFront service to quickly store and stream the video without any ECM involvement. For our clients that are worried about network constraints of having hundreds of users stream video from within their ECM architecture, this is an easy way to leverage the scale of Amazon’s infrastructure to offload the storage and network strain off of the ECM repository.
From the ECM application, the browser interface queries the non-ECM system asking for any videos associated with this case and will stream them directly from S3. In talking with the clients about what TSG is doing with video annotations, we are considering storing the annotations outside of the ECM system in the Object Store with meta-data added to the non-ECM system about the video annotations.
Summary
Initial ECM systems had to manage both the meta-data as well as the file storing requirements for ECM. With the continue evolution, performance and capabilities of object stores including Amazon S3 and Hitachi Content Platform (HCP), TSG is recommending clients consider innovative ways to leverage these capabilities for increased performance and capabilities surrounding document ingestion and retrieval.
Let us know your thoughts below.
[…] would be accessed. Configurations could include whether the content is cached (published), accessed directly to the object store or accessed via the ECM […]