Unicode Database Conversion And Usage

August 19, 2008

Introduction

Long-time ClarifyCRM users may find themselves in a situation driven by their business international expansion which calls for ability to handle client data in languages other than English and character set other than ASCII. If the database was not originally planed to store non-ASCII data, it needs to be ‘converted’ to be able to use Unicode.

The conversion may involve entire database, or only selected columns, depending on the needs.

Once the decision is made to store Unicode characters in existing non-Unicode Clarify database, there are several conversion scenarios to consider, depending on:

database system: MS SQL or Oracle
whether all the information stored must be in Unicode or only selected data items.

Dovetail Software, Inc. researched four scenarios how to make a database which was not originally intended to store Unicode data capable of just that, and allow manipulation by both Clarify Classic Client and Dovetail Software Agent.

Terms

Database Conversion means the process of modifying a database such that Unicode characters can be properly handled by Clarify and Dovetail applications that use that database for all string columns.

Partial Database Conversion means the process of modifying a database such that Unicode characters can be properly handled by Clarify and Dovetail applications that use that database for selected string columns.

Clarify documentation describes only entire database conversion process and does not mention any possibility of partial conversion or its adverse impact. Because entire database conversion is going to result in significant size increase of the database, partial conversion scenarios may be of interest to some users.

Dovetail Software applications are fully capable of handling Unicode databases, however, Dovetail Software doesn’t currently offer any product that performs the process of converting a database to Unicode.

Summary Of Results

	ClarifyCRM v. 11.1 or higher	Dovetail Software
supports Unicode data before database conversion	no	MS SQL – no Oracle – yes (*)
supports Unicode data after entire database conversion	yes	yes
supports Unicode data after partial database conversion	yes	yes
ability to store/display Unicode via GUI	via multiline textbox fields only (**)	yes
relies on Clarify Unicode flag	yes	yes

(*) Need to be aware of issues with multibyte characters in regards to defined column sizes. As long as the data will fit in the column, all is well, see Oracle Database Concepts, explanation of VARCHAR2 datatype dependencies.

(**) During our testing, Unicode data only seemed to work properly if the UI control was a multiline textbox. Unicode data in simple textboxes did not work properly. Further research on this issue is suggested.

Oracle Database Concepts

Oracle databases support Unicode by default. Therefore, in this article we understand ‘Oracle non-Unicode database’ in Clarify terms, not Oracle terms.

Oracle supports a reliable Unicode datatype through NCHAR, NVARCHAR2, and NCLOB. The character sets used by NCHAR/NVARCHAR2 can be either UTF8 or AL16UTF16, depending on the setting of the national character set when the database is created. These datatypes allow character data in Unicode to be stored in a database that may or may not use Unicode as database character set. The NCHAR datatype stores fixed-length character strings that correspond to the national character set. The NVARCHAR2 datatype stores variable length character strings. When you create a table with an NCHAR or NVARCHAR2 column, the maximum size specified is always in character length semantics which is the default and only length semantics for NCHAR or NVARCHAR2.

VARCHAR2 and VARCHAR Datatypes: the VARCHAR2 datatype stores variable-length character strings. When you create a table with a VARCHAR2 column, you specify a maximum string length (in bytes or characters) between 1 and 4000 bytes for the VARCHAR2 column. For each row, Oracle stores each value in the column as a variable-length field unless a value exceeds the column’s maximum length, in which case Oracle returns an error. Using VARCHAR2 and VARCHAR saves on space used by the table. Columns of both types can store Unicode characters, you need to be careful when specifying the size if you use byte length semantics to make sure there is enough room for expected data.

CLOB and NCLOB datatypes store single-byte and multibyte character strings of up to (4 gigabytes – 1) * (the value obtained from DBMS_LOB.GETCHUNKSIZE). Columns of both types can store Unicode characters.

LONG datatype stores variable-length character strings containing up to two gigabytes, but with many restrictions. This datatype is provided only for backward compatibility with existing applications. In general, new applications should use CLOB and NCLOB datatypes to store large amounts of character data, and BLOB and BFILE to store large amounts of binary data.

As a summary: CHAR, VARCHAR2, CLOB, and LONG columns can also store Unicode text. If the VARCHAR2 column is defined in character length semantics then it can store Unicode text up to the length specified. If the VARCHAR2 column is defined in byte length semantics then the length of the Unicode text it can handle (in Unicode characters) varies.

Database Character Set: An Oracle database has a database character set defined either at the installation time or as UTF-8 by default. This database character set can be changed after the database has been created. In most cases, a full export/import is needed to properly convert all data to the new character set. However, if, and only if, the new character set is a strict superset of all of the schema data, then it is possible to use the CSALTER script to expedite the change in the database character set. This article does not elaborate on the use of CSALTER script, refer to Oracle documentation for details.

Character Set Conversion Between Clients and the Server: If the database character set is different from the character set on the client operating system, then the Oracle database can convert the operating system character set to the database character set. Character set conversion increases the overhead and can sometimes cause data loss. Any characters that are not available in target character set are converted to a replacement character. The replacement character is often specified as a question mark or as a linguistically related character. The best way to prevent character conversion problems is to set a character set on Client system the same as the database character set. You set up the character set on Client MS Windows system by updating HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE\KEY_OraClient10g_home1\NLS_LANG registry key. For example, AMERICAN_AMERICA.AL32UTF8 value defines AL32UTF8 as Client system character set.

The best way to minimize problems with future conversions of a database to Unicode is to make it the most Unicode-ready by setting these values at the database creation time:

NLS_CHARACTERSET as AL32UTF8
NLS_NCHAR_CHARACTERSET as AL16UTF16

These are the most recent versions of Unicode character set and should also be used on Client systems accessing the database. Oracle Database 10g Release 2 supports the Unicode Standard, Version 4.0.

Note: All Oracle conversion scenarios described in this article have been performed on Oracle Database 10g Express Edition Release 10.2.0.1.0 database, ClarifyCRM e|FrontOffice 12.5, Schema Revision 60.

MS SQL Database Concepts

In MS SQL Server, only NCHAR, NVARCHAR, and NTEXT data types support Unicode data. Use of NCHAR, NVARCHAR, and NTEXT is the same as CHAR, VARCHAR, and TEXT, respectively, except that:

– Unicode supports a wider range of characters.

– More space is needed to store Unicode characters.

– The maximum size of NCHAR columns is 4,000 characters, not 8,000 characters like CHAR and VARCHAR.

– The maximum size for NVARCHAR columns, using the max specifier, is 2^31-1 bytes.

– NTEXT columns store variable-length Unicode data with a maximum length of (2^30–2)/2 (536,870,911) characters. Storage size, in bytes, is two times the number of characters entered. NTEXT is no longer supported in string functions.

– All Unicode data uses the character set defined by the Unicode standard. Unicode collations used for Unicode columns are based on attributes such as case sensitivity, accent sensitivity, Kana sensitivity, width sensitivity, and binary.

SQL Server 2005 supports the Unicode Standard, Version 3.2.

Unicode data types store each character with two bytes instead of one byte. SQL Server column limit is 8000 bytes for both Non-Unicode and Unicode column. The maximum size of a Unicode column is 4000 characters. When migrating to Unicode, all column data beyond 4000 characters will be truncated. Column data should be scanned to prevent data loss. If a Unicode column needs to hold more than 4000 characters, the column should be converted to NVARCHAR(max). Using NVARCHAR(max) also prevents problems related to SQL Server page overflow – the server will automatically allow page overflow in case of a record length greater than 8060 bytes, serious performance factor.

Note: All MS SQL conversion scenarios described in this article have been performed on MS SQL Server 2005 database, ClarifyCRM e|FrontOffice 12.5, Schema Revision 60.

ClarifyCRM Concepts

Database Flags

A column named ‘flags’ of type INT (on MS SQL Server) or NUMBER (on Oracle) in ADP_DB_HEADER table:

bit number from right to left	hexadecimal mask	decimal mask	meaning
1	0x0001	1	case insensitive (‘searchable’ columns added for certain columns with names starting with S_ )
2	0x0002	2	password encrypted
3	0x0004	4	DWE enabled
4	0x0008	8	secure query enabled
5	0x0010	16	reserved
6	0x0020	32	Traveler enabled
7	0x0040	64	Unicode database

Unicode Flag

The 7th bit from right in ADP_DB_HEADER.FLAGS column. If set to 1 it indicates that this database is Unicode-enabled in ClarifyCRM terms. Both ClarifyCRM and Dovetail Software products rely on this flag when handling Unicode data.

The Unicode Flag is the only bit in ADP_DB_HEADER.FLAGS of concern for the purpose of this article.

Entire Database Conversion To Unicode

MS SQL Database

The goal: convert entire MS SQL non-Unicode database to a Unicode-enabled one and allow both Clarify Classic Client and Dovetail Agent to store/retrieve Unicode text.

Steps to perform:

Use Clarify dataex utility to export database schema:

dataex -gen_sc dbname.sch

Use Clarify ddcomp utility to upgrade and convert the database (note that there will be no actual upgrade unless the schema file changes):

ddcomp dbname servername userid password dbname.sch -unicode -upgrade -sqllog

the run should finish with the following message: “*** Schema Upgrade SUCCEEDED! ***”

Verify successful conversion to Unicode database:

Run SQL statement:

SELECT flags FROM adp_db_header

Expected result:

The Unicode Flag is set to 1

Examine table column types: former char, varchar, and text columns should now have types of nchar, nvarchar, and ntext.

Prepare Clarify Classic Client:

Run SQL statement: select * from table_locale
Find the row corresponding to the target locale. For Japanese text look for iso_cntry = ‘JP’, note the value of win32_lcid column (should be equal to 1041).
Delete Clarify Classic Client cache files.
Update Clarify Classic Client command line parameters: add -lcid 1041 (1041 should be the value of table_locale.win32_lcid column for Japanese)

Execute Verification Test for both Clarify Classic Client and Dovetail Agent.

Oracle Database

The goal: convert entire Oracle non-Unicode database to a Unicode-enabled one and allow both Clarify Classic Client and Dovetail Agent to store/retrieve Unicode text.

Initial verification:

Run SQL statement:

SELECT * from NLS_DATABASE_PARAMETERS;

Expected result:

NLS_LANGUAGE = AMERICAN

NLS_TERRITORY = AMERICA

NLS_CHARACTERSET and NLS_NCHAR_CHARACTERSET values are not set.

Run SQL statement:

SELECT flags FROM adp_db_header

Expected result:

The Unicode Flag is set to 0

Steps to perform:

Use Clarify dataex utility to export database schema:

dataex -gen_sc dbname.sch

Use Clarify ddcomp utility to upgrade and convert the database (note that there will be no actual upgrade unless the schema file changes):

ddcomp dbname servername userid password dbname.sch -unicode -upgrade -allow_drops -sqllog

the run should finish with the following message: “*** Schema Upgrade SUCCEEDED! ***”

Verify successful conversion to Unicode database:

Run SQL statement:

SELECT flags FROM adp_db_header

Expected result:

The Unicode Flag is set to 1

Run SQL statement:

SELECT * from NLS_DATABASE_PARAMETERS;

Expected result:

NLS_LANGUAGE = AMERICAN

NLS_TERRITORY = AMERICA

NLS_CHARACTERSET = AL32UTF8

NLS_NCHAR_CHARACTERSET = AL16UTF1

Examine table column types: VARCHAR2 columns should now have their sizes increased threefold, for example table_act_entry.addnl_info column size increases from 255 to 765.

Change NLS_LANG for Client Windows:

Update Windows registry key:

HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE\KEY_OraClient10g_home1

Change NLS_LANG entry to AMERICAN_AMERICA.AL32UTF8

Prepare Clarify Classic Client:

Run SQL statement: select * from table_locale
Find the row corresponding to the target locale. For Japanese text look for iso_cntry = ‘JP’, note the value of win32_lcid column (should be equal to 1041).
Delete Clarify Classic Client cache files.
Update Clarify Classic Client command line parameters: add -lcid 1041 (1041 should be the value of table_locale.win32_lcid column for Japanese)

Execute Verification Test for both Clarify Classic Client and Dovetail Agent.

What Happens During Conversion Of Entire Database To Unicode

To convert entire database to Unicode, Clarify tool ddcomp does the following:

for every user table (with name starting with table_):

– creates a temporary table with text columns enlarged (i.e. VARCHAR2 for Oracle) or text columns types changed (i.e. VARCHAR to NVARCHAR for MS SQL, the declared length remains unchanged),

– grants access privileges on the newly created table,

– copies the data from original table to the temporary one,

– drops the original table,

– renames the temporary table to original name,

– recreates all indexes defined for the table,

– drops all the views referring to the table,

– creates all the views referring to the table,

– drops all the csp_get_name stored procedures referring to the table or a view,

– creates all the csp_get_name stored procedures referring to the table or a view

for adp tables

– updates adp_db_header with new timestamp (last_modify, clarify_schema_date, and cust_schema_date fields),

– updates adp_db_header: the Unicode Flag is set to 1,

– recreates adp_sch_index entry for every index,

– recreates adp_tbl_name_map entry for every table,

– recreates adp_sch_info entry for every table column (ARRAY_SIZE value remains unchanged),

– recreates adp_sch_rel_info entry for every table relation,

– adds new entry to adp_db_comments table with comment = ‘Upgraded database to Clarify schema revision 60, customer revision ‘

for internal DBMS tables (Oracle only)

sets NLS_DATABASE_PARAMETERS.NLS_CHARACTERSET and NLS_DATABASE_PARAMETERS.NLS_NCHAR_CHARACTERSET fields.

Partial Database Conversion To Unicode

Clarify does not support partial database conversion process through its products. Such possibility is not even mentioned anywhere in Clarify literature.

Dovetail Software has researched Partial Unicode Database Conversion scenarios and executed them successfully in test environment. The key to success lies in proper determination of all the data elements that need to be Unicode-enabled. For example, if the goal is to have all the notes Unicode-enabled the user must determine the list of all table columns which must be converted.

Partial conversion scenarios described here are based on the following assumptions:

– tables are not recreated,

– table columns types are redefined,

– indexes are not recreated,

– views are not recreated,

– stored procedures are not recreated,

– adp tables are not recreated,

– adp_db_header.flags field is updated with Unicode Flag set to 1,

– no Clarify schema changes are needed, the values of all CMN_TYPE, DB_TYPE, and ARRAY_SIZE columns for all adp_sch_info rows remain unchanged,

– no export/import of entire database or even affected tables is required,

– Oracle CSALTER script is not used,

– no Clarify tools (i.e. ddcomp) are used,

– no verification has been made whether the database after partial conversion remains compatible with other ClarifyCRM components, like: Rule Manager, DDEditor, ddcomp, dataex.

MS SQL Database

The goal: convert selected columns in MS SQL non-Unicode database and allow both Clarify Classic Client and Dovetail Agent to store/retrieve Unicode text in these columns.

Initial verification:

Run SQL statement:

SELECT flags FROM adp_db_header

Expected result:

The Unicode Flag is set to 0

Steps to perform:

Change column types:

For each columns being converted execute SQL statement:

ALTER TABLE table_name ALTER COLUMN column_name new_type

where new_type =

NVARCHAR(max) when converting from TEXT

NVARCHAR(n) when converting from VARCHAR(n)

NCHAR(n) when converting from CHAR(n)

where n = original column width

For each column being converted with original type of TEXT also execute SQL statement:

ALTER TABLE table_name ALTER COLUMN column_name NTEXT

(a column would not directly convert from TEXT to NTEXT, must go through intermediate state of NVARCHAR(max))

Set Unicode Flag

Run SQL statement:

UPDATE adp_db_header SET flags = (SELECT flags + 64 FROM adp_db_header)

Convert table_source.code_chunk column to NVARCHAR(max)

Run SQL statement:

ALTER TABLE table_source ALTER COLUMN code_chunk NVARCHAR(max)

Without this conversion Clarify Client would not start.

Prepare Clarify Classic Client:

Run SQL statement: select * from table_locale
Find the row corresponding to the target locale. For Japanese text look for iso_cntry = ‘JP’, note the value of win32_lcid column (should be equal to 1041).
Delete Clarify Classic Client cache files.
Update Clarify Classic Client command line parameters: add -lcid 1041 (1041 should be the value of table_locale.win32_lcid column for Japanese)

Execute Verification Test for both Clarify Classic Client and Dovetail Agent.

Oracle Database

The goal: convert selected columns in Oracle non-Unicode database and allow both Clarify Classic Client and Dovetail Agent to store/retrieve Unicode text in these columns.

Initial verification:

Run SQL statement:

SELECT * from NLS_DATABASE_PARAMETERS;

Expected result:

NLS_LANGUAGE = AMERICAN

NLS_TERRITORY = AMERICA

NLS_CHARACTERSET and NLS_NCHAR_CHARACTERSET values are not set.

Run SQL statement:

SELECT flags FROM adp_db_header

Expected result:

The Unicode Flag is set to 0

Steps to perform:

Change column types:

For each columns being converted execute SQL statement:

ALTER TABLE table_name MODIFY (column_name VARCHAR2(n))

where n = 3 * ARRAY_SIZE

Set Unicode Flag

Run SQL statement:

UPDATE adp_db_header SET flags = (SELECT flags + 64 FROM adp_db_header)

Change NLS_LANG for Client Windows:

Update Windows registry key:

HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE\KEY_OraClient10g_home1

Change NLS_LANG entry to AMERICAN_AMERICA.AL32UTF8

Prepare Clarify Classic Client:

Run SQL statement: select * from table_locale
Find the row corresponding to the target locale. For Japanese text look for iso_cntry = ‘JP’, note the value of win32_lcid column (should be equal to 1041).
Delete Clarify Classic Client cache files.
Update Clarify Classic Client command line parameters: add -lcid 1041 (1041 should be the value of table_locale.win32_lcid column for Japanese)

Execute Verification Test for both Clarify Classic Client and Dovetail Agent.

Verification Test

After each scenario is executed, the following test is performed to verify successful conversion (use Japanese text sample which may be obtained here: http://www.lorem-ipsum.info/generator3 ):

Start Clarify Classic Client

Test Clarify application:

Open a screen which accesses the column(s) modified, in edit mode,
Paste a sample of Japanese text into these fields. It should show Japanese characters. Save the data.
Open the same screen again. These fields should show the text in Japanese.

Prepare Dovetail Agent:

Unload the Dovetail Agent web page in IIS.

Start Dovetail Agent

Test Dovetail Agent:

Open a screen which accesses the column(s) modified, in edit mode,
Paste a sample of Japanese text into these fields. It should show Japanese characters. Save the data.
Open the same screen again. These fields should show the text in Japanese.

Test inter-application compatibility:

Save a sample of Japanese text into the modified fields in Clarify Client.
Go to Dovetail Agent, access the same information. It should show Japanese characters.
Save a sample of Japanese text into the modified fields in Dovetail Agent.
Go to Clarify Client, access the same information. It should show Japanese characters.

References

For more information on ClarifyCRM, MS SQL Server, and Oracle refer to the following documentation:

Clarify Upgrade Guide
Clarify System Administration Guide
MS SQL Server Database Engine Guide
MSDN Articles on MS SQL Server
Oracle Database Concepts Guide
Oracle Database Application Developer’s Guide
Oracle Globalization Support Guide
Oracle Unicode Support Guide

Introduction

Terms

Summary Of Results

Oracle Database Concepts

MS SQL Database Concepts

ClarifyCRM Concepts

Database Flags

Unicode Flag

Entire Database Conversion To Unicode

MS SQL Database

Oracle Database

What Happens During Conversion Of Entire Database To Unicode

Partial Database Conversion To Unicode

MS SQL Database

Oracle Database

Verification Test

References

Related

Dovetail

HR SOLUTIONS

CLARIFY SOLUTIONS

Unicode Database Conversion And Usage

Introduction

Terms

Summary Of Results

Oracle Database Concepts

MS SQL Database Concepts

ClarifyCRM Concepts

Database Flags

Unicode Flag

Entire Database Conversion To Unicode

MS SQL Database

Oracle Database

What Happens During Conversion Of Entire Database To Unicode

Partial Database Conversion To Unicode

MS SQL Database

Oracle Database

Verification Test

References

Share this:

Related