This week we are urgently reminding clients, as part of their upgrade evaluation, to look seriously for character encoding issues in regards to their current Documentum content and the affect on upgrades.
This is an update to the original article that was written in August. While the post highlighted character encoding issues and DFC 6.5, we are not quite sure readers fully realized the impact to their upgrade efforts.
The Scenario
There are two scenarios that could result in bad characters in the docbase
- Over time, users will “cut and paste” from Word or other applications into Webtop fields, Custom applications or other Documentum interfaces. Within the Browser (Internet Explorer, Firefox, Netscape for the old timers) the character string will look fine, but in reality, the field could contain “special charcters” that end up being passed through to the database.
- Migration efforts from previous upgrades/consolidations resulted in character encoding issues that were not identified.
Documentum, before version 6.5, allowed storage and retrieval of these characters without an error. As noted in the previous post, version 6.5 of the DFC does not support these formats and will throw errors on regtrieval such as
[DFC_OBJPROTO_BAD_NUMBER_FORMAT] Invalid number format for string length in serialized object
[DFC_OBJPROTO_BAD_STRING_FORMAT] Unknown string format in serialized object
Why is this a Big Deal?
The potential critical issues for clients would be
- An upgrade to 6.5/6.6 (either Migration, DB Clone or Upgrade in Place) that leaves these characters in the database.
- Any 6.5 interface (Webtop, xCP) throws an error when it tries to retrieve content with character encoding issues.
- xPlore will index (but very slowly) any content with Character Encoding Issues.
The tough part – garbage in/garbage out – the thought would be to clean up all the meta-data before either the upgrade or the use of DFC 6.5 or 6.6.
We should point out that we have only seen this issue for Oracle. We cannot either verify or deny that SQL Server clients would have the same issues.
Possible Resolution
Consistent with the previous post – we recommend the following:
- Consider leveraging OpenMigrate or a similar application to “scan” your data with DFC 6.5 to determine if any encoding errors exist prior to the upgrade. DFC 6.5 is compatible with 5.3
- During the upgrade, use OpenMigrate to migrate data into a clean repository instead of performing a typical in-place upgrade or dump and load. Migrations are a great opportunity to “scrub” and validate existing data. Because every document is touched during a migration, corrupt data can be more easily identified. We are working on adding a character encoding check for typical errors.
- Utilize database tools to help identify potential problems. Oracle has a Character Set Scanner Utility (CSSCAN) that can scan an entire database to verify that all data stored in the database use the correct character encoding.
As one last push – we are reaching out to Documentum to ask the simple question – “Hey – why not return the string with the bad character encoding rather than throwing the error – consistent with what pre-DFC 6.5 did?” . Given DFC eventually going away for DFS – it is worth asking.
Please comment below with any thoughts….
Paras Jethwani says
Hi,
Can you give some examples of ‘special characters’?
What locales does this issue impact? English or internatinal as well?
– Paras
Chris3192 says
Hi Paras.
The characters that typically cause the issue are not usually ones that are easily seen by eye. I believe this could occur in international character sets as well. I did see it with some Chinese characters at a client. Basically anywhere you might be changing from one character coding to another could produce bad data.
We find that many of the characters end up coming from fields where users have copy and pasted data from another application into the Documentum attribute. This is why it’s important to do a very through test of the migration data, including retrieving it through a client app to view it.
For one client, we ran migrated data through a routine that checked to make sure the characters were valid ASCII characters for the target system. This did take awhile and the client then had to remediate any documents and meta data that was deemed “invalid” and manually process them.
Thank you for your comment and please let us know if you found this helpful.