Documentum 6.5 Upgrade – Character Encoding Issues

Special Note: Anyone that is planning an upgrade from Documentum 5.3 to 6.5 should look closely at this note as some types of upgrades (clone or in-place) could result in content that was retrievable from 5.3 not being available in 6.5.

This post was developed based on recent work for a major pharmaceutical client. The client, on Documentum 5.3, was developing a consumer interface application leveraging Lucene. As we mentioned in a previous post, the client chose Lucene over FAST based on benchmarking results for over 150,000 documents.

Background

For the application, the client was leveraging OpenMigrate with DFC 6.5 to retrieve content and metadata for nearly 1,000,000 documents from their 5.3 docbase to be indexed in Lucene. Per the product release notes, using DFC 6.5 to access a 5.3 repository is a supported configuration. An issue was identified when around 5,000 documents failed to migrate. In reviewing the error logs from OpenMigrate, the DFC call IDfSession.getObject() to retrieve documents from the repository resulted in errors. After reviewing the stack trace, it was apparent that the error was being thrown from within the DFC code. The team was surprised by the error since the documents were able to be retrieved without a problem using client applications working with a 5.3 DFC, such as Webtop and Samson. The DFC error messages that were encountered are shown below:

[DFC_OBJPROTO_BAD_NUMBER_FORMAT] Invalid number format for string length in serialized object

[DFC_OBJPROTO_BAD_STRING_FORMAT] Unknown string format in serialized object

After some further investigation, the team noticed some similarities in many of the documents that were failing to migrate. All of the documents contained metadata with special characters. After duplicating the error in a development environment, the team removed the special characters from the metadata, retried the migration, and the documents were retrieved successfully with DFC 6.5.

DFC 6.5 and Character Encoding

Upon review with Documentum support, it was noted that DFC 6.5 enforces character encoding more strictly than DFC 5.3. This explained why the documents could be retrieved successfully with 5.3 client applications but not with DFC 6.5. The team wondered how these documents were ever stored in the repository with invalid character encoding. Our best guess was:

The documents were moved into the repository as part of a migration effort that took place a long time ago. Most likely the loose enforcement of character encoding by legacy versions of the DFC was the culprit.
Users may have set metadata values on documents by copying and pasting from other applications, such as Microsoft Word, that may have used a different character encoding.

Since the client wasn’t upgrading, only indexing the content in Lucene, the client decided to swap out the DFC 6.5 that OpenMigrate was using for DFC 5.3 in order to complete the migration. Unfortunately using DFC 5.3 requires a more invasive installation process that the client was trying to avoid. When the client upgrades to 6.5, the issues with the 5,000+ documents will be addressed.

Character Encoding and Affect on Upgrade

This particular client was fortunate enough to be able to “test drive” DFC 6.5 with their process that indexed to Lucene. This migration uncovered an issue that would have been significantly more serious had the client upgraded their entire Documentum system to 6.5. Had the upgrade been completed, users would not have been able to access these documents with the upgraded Documentum client tools such as Webtop, or any other custom applications utilizing the DFC. Since the number of documents with the character encoding problem is relatively small in relation to the total number of documents in the system, they might have gone unnoticed during testing. Because of the migration, the client is now able to come up with a proactive plan to rectify the issue prior to their full Documentum 6.5 upgrade.

Possible Resolutions

To identify the errors issues with existing data such as the character encoding problem described above prior to an upgrade, TSG would recommend several alternatives:

Consider leveraging OpenMigrate or a similar application to “scan” your data with DFC 6.5 to determine if any encoding errors exist prior to the upgrade.
During the upgrade, use OpenMigrate to migrate data into a clean repository instead of performing a typical in-place upgrade or dump and load. Migrations are a great opportunity to “scrub” and validate existing data. Because every document is touched during a migration, corrupt data can be more easily identified.
Utilize database tools to help identify potential problems. Oracle has a Character Set Scanner Utility that can scan an entire database to verify that all data stored in the database use the correct character encoding.

Check out TSG’s free Documentum upgrade planning guide for additional upgrade tips.

Comments

Sorin Marinescu says

December 15, 2010 at 10:05 am

Hello,

You suggest using OpenMigrate to “scan” the 5.3 repository with DFC 6.5 to determine if any encoding errors exist prior to the upgrade.

I’ve been reading the documentation but I haven’t found a way to do that…

Could you please offer more information about this?

Thanks,
Sorin.
TParz says

December 15, 2010 at 10:13 pm

Sorin,

We’ve actually recently packaged a pre-configured version of OpenMigrate to perform the metadata validation with DFC 6.5. It can be downloaded from TSG’s download site. You’ll find a link to download the Documentum Metadata Validator at the bottom of the page in the Useful Tools section.

Tony
Sorin Marinescu says

December 16, 2010 at 2:52 am

Hello Tony,

Thank you for your prompt answer.
I didn’t notice the Useful Tools section, you guys are great 🙂

Regards,
Sorin

Reader Interactions

Comments

Trackbacks

Leave a ReplyCancel reply