The Hadoop Distributed File System (HDFS) provides the ability to store an enormous quantity of files with redundancy. In our first release of OpenContent for Hadoop, we have included the ability to annotate PDF documents with OpenAnnotate and store and retrieve the PDF layers in Hadoop. This post will describe the integration with Hadoop as the ECM repository, as well as highlight some benefits of using an annotation tool that uses open specifications.
PDF Annotations – What is involved?
Too often, users with Acrobat or other client based PDF tools look at annotations as just something that can be accomplished with their desktop tools. In a Hadoop environment, the ability to store the annotation back in the Hadoop repository as a separate secure layer cannot be accomplished with Acrobat or other PDF client tools without software installation on the client machine. The IT support of a client based tool was one of the reasons Documentum, an ECM vendor, discontinued supporting their own Acrobat annotation software.
OpenAnnotate supports browser based adding of annotations leveraging the XFDF standard from Adobe. With OpenAnnotate, Hadoop users can:
- View a Document in the Browser Window.
- Add their own annotations
- See as other annotations are added as separate layers (real time collaboration)
- Store their annotations back in Hadoop
- Download embedded in a PDF document for printing/distribution
OpenAnnotate supports all modern browsers including IE9+, Chrome, Safari and Firefox.
Hadoop Repository – How are Annotations Stored?
XFDF files are stored in Hadoop as separate files for each reviewer that are assembled when a document is requested. This ensures that all users are only allowed to add and edit their own annotations. It also allows for users to be collaboratively working together on a document without having to worry who has the document “checked out”.
In this manner, each users’ separate XFDF file can also have separate security. As a quick example:
- User 1 might be able to see all annotations
- User 2 might be able to see all users’ annotations except User 1’s
- User 3 might not be able to see (or store) any annotations
XFDF versus other proprietary annotations
Many of the other available annotation tools do not follow the open XFDF specification and instead have their own proprietary formats. In working with our clients, converting these proprietary formats as part of a migration can be very difficult. With support from Adobe, XFDF is the recognized industry standard and has benefits not found in other formats including:
- Ability to view the annotations with Acrobat Reader
- Open specification which prevents vendor lock
- Ease of migration to/from other tools
Summary
Hadoop users storing PDF documents should look to add PDF Annotations into their business process to leverage their content for collaboration and review. For documents that are stored in Word or other print formats, please see our integration with Adlib to create PDF renditions of those documents in Hadoop.