User experience: migrating databases to Unicode

User experience: migrating databases to Unicode • May 2010 • Marek Läll • Swedbank Baltics [email protected] • +372 50 250 27 What is Unicode? ...
Author: Sibyl Lambert
1 downloads 2 Views 487KB Size
User experience: migrating databases to Unicode



May 2010

• Marek Läll • Swedbank Baltics [email protected] • +372 50 250 27

What is Unicode? • Unicode is a universal encoded character set that allows you to store information from any language. Unicode defines properties for each character, standardizes script behavior, provides a standard algorithm for bi-directional text, and defines cross-mappings to other standards. Unicode provides a unique code value for every character, regardless the platform, program or language.

© Swedbank

Motivation • • • • •

1998 – Bank was opened in Lithuania 2000 – Bank was bought in Latvia 2002 – Bank was opened in Moscow Data (customers) need group level consolidation Web applications (internet bank) supports Unicode by default • Java supports Universal Character Set (UCS-2) natively

© Swedbank

Motivation - Non ASCII letters EE:

ŠŽÕÄÖÜ ŠŽ

LV: LT: RU:

© Swedbank

ŠŽ

Unicode implementations • Common Unicode encoding formats are: – UTF-8 (variable width, 8 bit, 1, 2, 3 , 4, 5 or 6 bytes long) – UCS-2 (fixed width, 16 bit, 2 bytes long) – UTF-16 (variable width, 16 bit, 2 or 4 bytes long)

• Conversion between them is algorithm based (simple bitwise operation), an expensive mapping table is unnecessary

© Swedbank

Unicode Character Sets In The Oracle Database [Metalink note: 260893.1]

Characterset Name AL24UTFFSS UTF8 UTFE

RDBMS Unicode version version 7.2-8.1 1.1 8.0-11g 2.1 (8.0-8.1.6) , 3.0 (8.1.7-11g) 8.0-11g 2.1 (8.0-8.1.6) , 3.0 (8.1.7-11g)

AL32UTF8*

9.0-11g

3.0 (9.0), 3.1 (9.2), 3.2 (10.1), 4.01 (10.2), 5.0 (11.1)

AL16UTF16**

9.0-11g

3.0 (9.0), 3.1 (9.2), 3.2 (10.1), 4.01 (10.2), 5.0 (11.1)

* cannot be used as NLS_NCHAR_CHARACTERSET , can only be used as NLS_CHARACTERSET ** can only be used as NLS_NCHAR_CHARACTERSET, cannot be used as NLS_CHARACTERSET

© Swedbank

Expectations (around year ~2002) • • • • •

Use latest available database version: 8.1 Transparent migration / no need to changes apps Just create new UTF8 database Export / import structure / data Use it!

• But … it did not fly. Why?

© Swedbank

Configuration change planned • Old – Windows encoding: cp1257 – Oracle Forms/Client encoding: NEE8ISO8859P4 – Database encoding: NEE8ISO8859P4

• New – Windows encoding: cp1257 – Oracle Forms/Client encoding: BLT8MSWIN1257 – Database encoding: UTF8

• NB! If client and database encoding is the same then CHAR/VARCHAR2 columns can be used as raw/binary columns. There is no any checking or conversion of character/byte values © Swedbank

Important facts • Version 8.x – VARCHAR2(1) – means 1 BYTE! It is not in characters

• Version 9.0 – New init.ora parameter: nls_length_semantics=BYTE | CHAR – New syntax: VARCHAR2( [BYTE | CHAR]) – 3 levels to set semantics: database / init.ora parameter; alter session; object level: create table ... varchar2(5 char) – column DATA_LENGTH in view *_tab_columns shows length in BYTES! (opinion: major conceptual mistake) – new columns in *_tab_columns view: CHAR_LENGTH, CHAR_USED – Limitation: VARCHAR2(4000 CHAR) – max limit is 4000 bytes © Swedbank

Issues discovered • several applications went crazy - they saw that all varchar2 columns are 3 times longer because DATA_LENGTH is in bytes • chr() function does not work. Functions unistr(‘‘) or convert() are alternatives • Make sure you do not store "binary" data in character datatypes (CHAR,VARCHAR2,CLOB). • Export file includes source db nls_length_semantics value. Default is BYTE. Not found a way to override it.

© Swedbank

How did we fix wrong encoding • Old db encoding was nee8iso8859p4 but characters inside were actually in blt8mswin1257 encoding • Upgrade was done using exp / imp – Exp | | imp

• 3rd byte in export file was replaced: – Value 179 – represented encoding blt8mswin1257 – Value 34 – represented encoding nee8iso8859p4

© Swedbank

Conclusions / Suggestions • Create database; encoding AL32UTF8 (or UTF8); one char can take up to 4 bytes (or 3 bytes) • Init.ora: nls_length_semantics=CHAR • 4000 byte limit: varchar2( 4000 CHAR) column theoretical minimum is only 1000 characters • There are ways to enter byte sequence which is not valid Unicode character

© Swedbank

Tools / functions • Csscan/Csalter - utility for checking data before migrating character sets • Function: convert( string1, char_set_to, [char_set_from]) • length(), lengthb(), lengthc(); instr*(); substr*() • select dump(owner, 1016) from all_objects … Typ=1 Len=3 CharacterSet=AL32UTF8: 53,59,53 • SQL*Plus> desc all_objects Name Null? Type -------------- ---------------- ---------------------------OWNER NOT NULL VARCHAR2(30 BYTE) © Swedbank

References • ID 788156.1: AL32UTF8 / UTF8 (Unicode) Database Character Set Implications • ID 144808.1: Examples and limits of BYTE and CHAR semantics usage (NLS_LENGTH_SEMANTICS) • ID 179133.1: The correct NLS_LANG in a Windows Environment • http://www.oracle.com/technology/tech/globalization/pdf/T WP_AppDev_Unicode_10gR2.pdf • http://www.unicode.org/standard/standard.html • http://en.wikipedia.org/wiki/Russian_alphabet © Swedbank

Marek Läll [email protected] +37 250 250 27 © Swedbank

Standards / UCS • ISO 10646 defines the Universal Character Set (UCS) http://en.wikipedia.org/wiki/Universal_Character_Set • UCS is a superset of all other character set standards • UCS-2 and UCS-4 are mostly used • PDF copy of ISO 10646:2003 http://standards.iso.org/ittf/PubliclyAvailableStandards/c03 9921_ISO_IEC_10646_2003(E).zip (~80Mb)

© Swedbank

Standards / UTF-8 • UTF-8 : UTF stands for UCS Transformation Format • http://en.wikipedia.org/wiki/UTF-8 • Mapping between UCS and UTF-8 UCS U-00000000 – U-0000007F: U-00000080 – U-000007FF: U-00000800 – U-0000FFFF: U-00010000 – U-001FFFFF: U-00200000 – U-03FFFFFF: U-04000000 – U-7FFFFFFF:

© Swedbank

UTF-8 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

BLT8MSWIN1257 vs NEE8ISO8850P4 • • • • • •

Euro sign – € Trade mark sign – ™ Cent sign - ¢ Pound sign - £ Copyright sign - © Registered sign - ®

© Swedbank

Encodings available for Baltics (around year 2000) • BLT8EBCDIC1112 - EBCDIC Code Page 1112 8-bit Baltic Multilingual • BLT8PC775 - IBM-PC Code Page 775 8-bit Baltic • BLT8MSWIN1257 - MS Windows Code Page 1257 8-bit Baltic • BLT8CP921 - Latvian Standard LVS8-92(1) Windows/Unix 8-bit Baltic

© Swedbank

letters that have the same code in both encoding Letter

õ Õ ä Ä ö Ö ü Ü © Swedbank

Name nee8iso8859p4 blt8mswin1257 EE LV LT C with caron 200 200 Y Y c with caron 232 232 Y Y o with tilde 245 245 Y O with tilde 213 213 Y a with diaeresis 228 228 Y A with diaeresis 196 196 Y o with diaeresis 246 246 Y O with diaeresis 214 214 Y u with diaeresis 252 252 Y U with diaeresis 220 220 Y

Letters that have different code Letter

ž Ž š Š

© Swedbank

Name nee8iso8859p4 blt8mswin1257 EE LV LT U with ogonek 217 216 Y u with ogonek 249 248 Y U with macron 222 219 Y Y u with macron 254 251 Y Y z with caron 190 254 Y Y Y Z with caron 174 222 Y Y Y s with caron 185 240 Y Y Y S with caron 169 208 Y Y Y R with cedilla 163 170 Y r with cedilla 179 186 Y O with macron 210 212 Y o with macron 242 244 Y