Editor’s Note: This article was originally posted in August of 2010. We’ve been answering quite a few questions lately from businesses looking to migrate off of their older FileNet systems. Having been away from this area for a bit, I was looking back over some older blog entries I had written. I came across this write-up, which described a unique engagement where we ended up pulling some of the FileNet migration knowledge we had put INTO OpenMigrate, OUT OF OpenMigrate, and using it as the basis for a pretty complex—but successful—migration. Most of our projects do not require quite this much engineering, reverse-engineering, and nerdy problem-solving; but it’s nice to know we were up to the task when we needed to be.
I recently had the opportunity to work with a defense company who was looking to migrate data out of FileNet using our OpenMigrate solution. Compared to the other FileNet migrations we’ve done, at first this seemed much simpler, considering they only had 3 doc classes that they wanted to migrate, but pretty soon, we realized we had some challenges ahead of us.
Challenge One: A Very, Very Old AIX Server
First we found that the FileNet server they were using was a very old AIX machine, and that the latest version of Java supported on that version of AIX was Java 1.1. OpenMigrate (OM), on the other hand, had only been run on Java 1.4 and Java 1.5. It would have been a stretch, but we could have tried updating OM to work with Java 1.3, but anything lower than that would probably not have worked since OM is built on the Spring framework. What we decided to do instead was take OpenMigrate’s FileNet logic and execute the steps manually.
The first step was to execute queries against the database to extract out all the metadata for each doc class into Excel spreadsheets (or we could have used a database) for the client. Here are a few things we encountered that are good to remember when querying records in FileNet:
- The F_DOCNUMBER column is populated by FileNet in sequential order, so a higher doc number signifies a record that was created later in time.
- FileNet date fields, like F_ENTRYDATE, are integer Julian-type dates, the number of days since 1970. So, in our case, we had to convert the entry dates to a readable date format to figure out how many documents were in a given year in order to assist with capacity planning. All we did was add the value of F_ENTRYDATE to the date 1/1/1970, in days, to figure out what date it represented. For example, the date 1/1/2010 would be represented in the F_ENTRYDATE column as 14610.
- Some FileNet servers store multi-page documents as one document and some store them as multiple documents, one per page. In our case, by looking at some sample data, we found that FileNet stored the documents as multiple documents, one per page, and merged them together when serving it up to the user for viewing. The number of pages in the document is recorded in the F_PAGES column. The one exception is that if a document only has one page, FileNet does not store a 1 in that column as you would expect. Instead it will store a NULL value in that column for any document that only has 1 page.
After all metadata had been retrieved, we generated Korn shell scripts to run on the FileNet server and download the FileNet content from the database to the filesystem. The scripts leveraged the FileNet system tools to store each page of the document as a separate TIFF file in a folder specific to each document. If we had been able to use OM, it would have taken care of creating these scripts, running them, and deleting them once they were completed.
Challenge Two: Annotations
Now that we had all the metadata and content extracted, we had one more hurdle to face: annotations. The last requirement was to extract out all the annotations that were on each of the documents so that they could be maintained when moving into the new system, in their case, OpenText. We had hoped that the annotation information, such as the text, width, height, angle, color, etc, would just be stored in different columns within the same Oracle database table as the records of the documents. However, we soon found they were stored separately from the documents table, and were actually stored in their own annotations table, in the FileNet MKF proprietary database. This was a step back for us since it meant we couldn’t just update our original queries to pull this annotation information. Instead, we had to launch the MKF tool in order to query for the annotations.
We went ahead with using this tool, but found that most of the relevant information we needed in order to reproduce the annotation in the new system, was stored as a hex string that could be up to 800 characters long. It was definitely not an easy task, but after much trial and error, we were able to crack the hex code enough to at least get the text of the annotation out of the hex string, which after talking to the business, seemed to be enough. There wasn’t much of a need to replicate annotations such as arrows and highlights in the new system. Here’s an example of what one of the hex values in the annotation table looked like:
1f00104119f15c8f01d011a87a00a0246922a504000200010100040190023a0b0004014c00130d000101100006ff000000ffff2400010
123000700020002ff0000250006417269616c002600050c0100000003002b6e65656473207375627061636b657220666f722065787065
6469746520656c6220313033303031202000002b00020000
From here, we found that this long hex string could be decoded into key/length/value combinations. The first 4 characters signify the key, or field identifier, the next 2 characters, once converted to decimal, represent the length of the value, and the next number of hex digits that match this length, once converted to ASCII, is the value of the field.
Our analysis did not take us deep enough to find out what all the fields in the hex string represent, but we did find that 0300 is the field identifier for the annotation’s text. The other field that we know for sure is in the hex string is the name of the font that is used for the text. It seems that the font name and the annotation text are the only variable length fields in the hex string. I would imagine the rest of the hex string somehow contains the positioning of the annotation, colors, etc, but it would require some further analysis to really confirm that.
In the end, we delivered the content and metadata, plus annotations, which the business needed to get out of the stone age and into a more modern CMS. While we didn’t actually run OpenMigrate, we were able to leverage its “knowledge” to get the job done.
I have just the same request to get out content from Filenet image Services to local file system. As you mentioned that you use Korn shell scripts to download content from Database to file system and we have so much file(*.dat) to handle , could you tell us the speed of getting content ? thanks.
Hi Susan,
We’ve had varying degrees of success, ranging from roughly 1 doc per second, up to 10 docs per second. It all depends on the power of the hardware and the network infrastructure. The lag is never with the database, but always with getting the content off the platters.
We always run some benchmark migrations as early as possible to assist with the migration planning–that has proved quite valuable in the past. Not only can we get some timings down, but we can get the business users some of their images and metadata to look at in order to help define exactly what needs to be migrated and what can be bypassed.
Please drop me a line, tpierzina@tsgrp.com, if you’d like to discuss this further.
Is there a faster way to get annotations off the system ?
Currently we are using the command to get a range of annotations
select annotations * where doc_id >10,000,000 and doc_id < 20,000,000
because of the '*' I believe it is doing a table scan.
Is there a way to use a range of annot_key ?
Thanks