User experience: migrating databases to Unicode

User experience: migrating databases to Unicode • May 2010 • Marek Läll • Swedbank Baltics [email protected] • +372 50 250 27 What is Unicode? ...

Author: Sibyl Lambert

1 downloads 2 Views 487KB Size

Report

Download PDF

Recommend Documents

AN EXPERIENCE IN MIGRATING FROM ASP TO ASP.NET

Migrating to Windows 7

Migrating to the Cloud

GreekKeys Unicode 2008 USER S GUIDE

Conversion of Blaise Databases to Relational Databases *

Text Conversion Tool: Non Unicode to Unicode Text

Intro to User Experience & UI Design

Migrating from SharePoint to SharePoint

Migrating ExcelForms to the.net Framework

How to Type Khmer Unicode

Transitioning to Unicode : What To Expect

An Intro to Databases

Access to databases (JDBC)

Introduction to Databases

COMPLETE USER EXPERIENCE (UX) CHECKLIST. Guide to helping create the best user experience for your brand

Faktoren der User Experience

The Cooltown User Experience

Measuring the User Experience

Workday User Experience

User Experience bei Softwareanbietern

User Experience Management

USER EXPERIENCE STORIES

Observing the User Experience

Migrating Borland Database Engine Applications to dbexpress

User experience: migrating databases to Unicode

•

May 2010

• Marek Läll • Swedbank Baltics [email protected] • +372 50 250 27

What is Unicode? • Unicode is a universal encoded character set that allows you to store information from any language. Unicode defines properties for each character, standardizes script behavior, provides a standard algorithm for bi-directional text, and defines cross-mappings to other standards. Unicode provides a unique code value for every character, regardless the platform, program or language.

© Swedbank

Motivation • • • • •

1998 – Bank was opened in Lithuania 2000 – Bank was bought in Latvia 2002 – Bank was opened in Moscow Data (customers) need group level consolidation Web applications (internet bank) supports Unicode by default • Java supports Universal Character Set (UCS-2) natively

© Swedbank

Motivation - Non ASCII letters EE:

ŠŽÕÄÖÜ ŠŽ

LV: LT: RU:

© Swedbank

ŠŽ

Unicode implementations • Common Unicode encoding formats are: – UTF-8 (variable width, 8 bit, 1, 2, 3 , 4, 5 or 6 bytes long) – UCS-2 (fixed width, 16 bit, 2 bytes long) – UTF-16 (variable width, 16 bit, 2 or 4 bytes long)

• Conversion between them is algorithm based (simple bitwise operation), an expensive mapping table is unnecessary

© Swedbank

Unicode Character Sets In The Oracle Database [Metalink note: 260893.1]

Characterset Name AL24UTFFSS UTF8 UTFE

RDBMS Unicode version version 7.2-8.1 1.1 8.0-11g 2.1 (8.0-8.1.6) , 3.0 (8.1.7-11g) 8.0-11g 2.1 (8.0-8.1.6) , 3.0 (8.1.7-11g)

AL32UTF8*

9.0-11g

3.0 (9.0), 3.1 (9.2), 3.2 (10.1), 4.01 (10.2), 5.0 (11.1)

AL16UTF16**

9.0-11g

3.0 (9.0), 3.1 (9.2), 3.2 (10.1), 4.01 (10.2), 5.0 (11.1)

* cannot be used as NLS_NCHAR_CHARACTERSET , can only be used as NLS_CHARACTERSET ** can only be used as NLS_NCHAR_CHARACTERSET, cannot be used as NLS_CHARACTERSET

© Swedbank

Expectations (around year ~2002) • • • • •

Use latest available database version: 8.1 Transparent migration / no need to changes apps Just create new UTF8 database Export / import structure / data Use it!

• But … it did not fly. Why?

© Swedbank

Configuration change planned • Old – Windows encoding: cp1257 – Oracle Forms/Client encoding: NEE8ISO8859P4 – Database encoding: NEE8ISO8859P4

• New – Windows encoding: cp1257 – Oracle Forms/Client encoding: BLT8MSWIN1257 – Database encoding: UTF8

• NB! If client and database encoding is the same then CHAR/VARCHAR2 columns can be used as raw/binary columns. There is no any checking or conversion of character/byte values © Swedbank

Important facts • Version 8.x – VARCHAR2(1) – means 1 BYTE! It is not in characters

• Version 9.0 – New init.ora parameter: nls_length_semantics=BYTE | CHAR – New syntax: VARCHAR2( [BYTE | CHAR]) – 3 levels to set semantics: database / init.ora parameter; alter session; object level: create table ... varchar2(5 char) – column DATA_LENGTH in view *_tab_columns shows length in BYTES! (opinion: major conceptual mistake) – new columns in *_tab_columns view: CHAR_LENGTH, CHAR_USED – Limitation: VARCHAR2(4000 CHAR) – max limit is 4000 bytes © Swedbank

Issues discovered • several applications went crazy - they saw that all varchar2 columns are 3 times longer because DATA_LENGTH is in bytes • chr() function does not work. Functions unistr(‘‘) or convert() are alternatives • Make sure you do not store "binary" data in character datatypes (CHAR,VARCHAR2,CLOB). • Export file includes source db nls_length_semantics value. Default is BYTE. Not found a way to override it.

© Swedbank

How did we fix wrong encoding • Old db encoding was nee8iso8859p4 but characters inside were actually in blt8mswin1257 encoding • Upgrade was done using exp / imp – Exp | | imp

• 3rd byte in export file was replaced: – Value 179 – represented encoding blt8mswin1257 – Value 34 – represented encoding nee8iso8859p4

© Swedbank

Conclusions / Suggestions • Create database; encoding AL32UTF8 (or UTF8); one char can take up to 4 bytes (or 3 bytes) • Init.ora: nls_length_semantics=CHAR • 4000 byte limit: varchar2( 4000 CHAR) column theoretical minimum is only 1000 characters • There are ways to enter byte sequence which is not valid Unicode character

© Swedbank

Tools / functions • Csscan/Csalter - utility for checking data before migrating character sets • Function: convert( string1, char_set_to, [char_set_from]) • length(), lengthb(), lengthc(); instr*(); substr*() • select dump(owner, 1016) from all_objects … Typ=1 Len=3 CharacterSet=AL32UTF8: 53,59,53 • SQL*Plus> desc all_objects Name Null? Type -------------- ---------------- ---------------------------OWNER NOT NULL VARCHAR2(30 BYTE) © Swedbank

References • ID 788156.1: AL32UTF8 / UTF8 (Unicode) Database Character Set Implications • ID 144808.1: Examples and limits of BYTE and CHAR semantics usage (NLS_LENGTH_SEMANTICS) • ID 179133.1: The correct NLS_LANG in a Windows Environment • http://www.oracle.com/technology/tech/globalization/pdf/T WP_AppDev_Unicode_10gR2.pdf • http://www.unicode.org/standard/standard.html • http://en.wikipedia.org/wiki/Russian_alphabet © Swedbank

Marek Läll [email protected] +37 250 250 27 © Swedbank

Standards / UCS • ISO 10646 defines the Universal Character Set (UCS) http://en.wikipedia.org/wiki/Universal_Character_Set • UCS is a superset of all other character set standards • UCS-2 and UCS-4 are mostly used • PDF copy of ISO 10646:2003 http://standards.iso.org/ittf/PubliclyAvailableStandards/c03 9921_ISO_IEC_10646_2003(E).zip (~80Mb)

© Swedbank

Standards / UTF-8 • UTF-8 : UTF stands for UCS Transformation Format • http://en.wikipedia.org/wiki/UTF-8 • Mapping between UCS and UTF-8 UCS U-00000000 – U-0000007F: U-00000080 – U-000007FF: U-00000800 – U-0000FFFF: U-00010000 – U-001FFFFF: U-00200000 – U-03FFFFFF: U-04000000 – U-7FFFFFFF:

© Swedbank

UTF-8 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

BLT8MSWIN1257 vs NEE8ISO8850P4 • • • • • •

Euro sign – € Trade mark sign – ™ Cent sign - ¢ Pound sign - £ Copyright sign - © Registered sign - ®

© Swedbank

Encodings available for Baltics (around year 2000) • BLT8EBCDIC1112 - EBCDIC Code Page 1112 8-bit Baltic Multilingual • BLT8PC775 - IBM-PC Code Page 775 8-bit Baltic • BLT8MSWIN1257 - MS Windows Code Page 1257 8-bit Baltic • BLT8CP921 - Latvian Standard LVS8-92(1) Windows/Unix 8-bit Baltic

© Swedbank

letters that have the same code in both encoding Letter

õ Õ ä Ä ö Ö ü Ü © Swedbank

Name nee8iso8859p4 blt8mswin1257 EE LV LT C with caron 200 200 Y Y c with caron 232 232 Y Y o with tilde 245 245 Y O with tilde 213 213 Y a with diaeresis 228 228 Y A with diaeresis 196 196 Y o with diaeresis 246 246 Y O with diaeresis 214 214 Y u with diaeresis 252 252 Y U with diaeresis 220 220 Y

Letters that have different code Letter

ž Ž š Š

© Swedbank

Name nee8iso8859p4 blt8mswin1257 EE LV LT U with ogonek 217 216 Y u with ogonek 249 248 Y U with macron 222 219 Y Y u with macron 254 251 Y Y z with caron 190 254 Y Y Y Z with caron 174 222 Y Y Y s with caron 185 240 Y Y Y S with caron 169 208 Y Y Y R with cedilla 163 170 Y r with cedilla 179 186 Y O with macron 210 212 Y o with macron 242 244 Y