Character Encoding, Conversion and

Character Encoding, g, Conversion and DataFlex Presented by John Tuohy PRESENTED BY ORYX 1 2 History of Character Encoding • In the beginning we...
Author: Meghan Garrett
31 downloads 0 Views 340KB Size
Character Encoding, g, Conversion and DataFlex Presented by John Tuohy

PRESENTED BY ORYX

1

2

History of Character Encoding •

In the beginning we were all single character – ASCII • Defines 1-127. Was a US set of characters

– Extended ASCII • Uses the additional hi bit 128 characters for whatever

– Code-pages • Formalized the extended definitions. All (most?) Code-pages use the ASCII subset

– OEM • Is one set of code-page definitions used mostly by IBM, MS-DOS and non-Windows

– ANSI •

Is another of code-page definitions used by Windows

3

History of Character Encoding •

Then a multi-byte character emerged – Unicode • Unicode attempts to define all possible characters in a single set. No code pages • Each E h possible ibl character h iis assigned i d a ““code-point” d i ” ((a number). b ) • Most characters, but not all, assigned within the first 64K range • Therefore most characters but not all can be represented with a word 16 bits

4

History of Character Encoding •

Unicode Encoding – Encoding refers to how a Unicode code point is represented (i (i.e. e serialized) – First came UTF-16 encoding – This was followed by UTF-8 encoding – Currently they co-exist – Unicode is not an encoding format • However it often talked about as if it were

5

History of Character Encoding •

UTF-16 – 2-6 2 6 bytes to represent a character – Normally it’s 2 bytes – Not compatible with anything (ASCII, ANSI, CString) – If limited to 2-bytes, it is UCS-2 • Wide-strings are good for processing • Used this way in windows – It can contain embedded zeros – It doubles, at least, storage/transmission requirements.

6

History of Character Encoding •

UTF-8 Encoding – Uses 1-6 1 6 bytes to represent a character – Is ASCII compatible (i.e., first 128 characters) – Is CString compatible (no embedded 0) – Is the encoding of choice for Web and XML – Is difficult to use for string function processing

7

History of Character Encoding •



So computer processing must deal with – ANSI code pages – OEM code pages – UTF-16 – UTF-8 You handle this via character conversions

8

Encoding and Conversions •

Character Encoding and Conversions – First you must know the encoding • You cannot process character text if you don’t know the encoding – Functions exist to convert between all encodings – Character conversion always works as designed – Mostly character encoding works perfectly – When it does not, it can result in “lossy” lossy data

9

Encoding and Conversions •

Encoding conversions can lead to lossy data • Unicode to Unicode – always works • ANSI/OEM to Unicode– always works • Unicode to ANSI/OEM – possible loss • OEM to ANSI – possible loss • ANSI to OEM – possible loss • Any code-page conversion – possible loss

10

Code page madness



The sad, sad tale of the €

11

Encoding and Conversions •

Understanding the differences between – Encoding vs vs. Conversion – Processing vs. Transmission / Storage – Conversion Loss vs. Programming Errors

12

Encoding and Conversions •

Let’s see some examples

13

Encoding in DataFlex •

Our Environment – Windows Controls are Unicode • They have an ANSI history • Most but not all controls have an ANSI (A) interface • All controls have a Unicode (W) interface – COM is Unicode – The Internet's default transmission format is UTF-8 – Databases can support pp various formats • Embedded database assumes OEM with an appropriate OEM collating sequence • SQL Q databases can be ANSI,, OEM or Unicode

14

Encoding in DataFlex •

• • • • •

DataFlex Strings – DataFlex strings are single byte – DataFlex strings are OEM – Much of DataFlex relies on zero terminated strings Windows Controls are Unicode with (mostly) an ANSI portal COM uses Unicode via Variant Strings Variant strings are Unicode Flexml internally is Unicode Flexml reads and writes (normally) as UTF-8

15

Encoding in DataFlex • •

WebApp transmits as UTF-8 Sequential file handling interface does no conversions – This implies OEM – It can be any format



Th There are unavoidable id bl legacy l inconsistencies i i t i throughout th h t the th framework f k

16

Encoding in DataFlex •

The Development Environment – We expect all DataFlex source code to be OEM – All language text not enclosed within quotes should be ASCII – The Studio encodes all source DataFlex source in OEM – The Studio shows other source as ANSI

17

Encoding in DataFlex • Typical Windows control with ANSI interface DF OEM -- (OemToAnsi) -- ANSI -- (AnsiToUnicode) -- Windows Unicode • Windows control with only a Unicode interface DF OEM -- (OemToUnicode) -- Windows Unicode • COM Control DF OEM -- (OemToUnicode) -- COM control • Variant Handling DF OEM -- (OemToUnicode) -- Variant

18

Encoding in DataFlex • Database to variable OEM Database -- DF OEM ANSI Database -- (AnsiToOem) -- DF OEM Unicode Database -- (UnicodeToOem) -- DF OEM

19

Encoding in DataFlex • ANSI Database to Windows “A” control ANSI Database -- (AnsiToOem) -- DF OEM -- (OemToAnsi) – ANSI -- (AnsiToUnicode) -Windows Unicode • ANSI Database D b to Windows Wi d “W” controll or COM controll ANSI Database -- (AnsiToOem)--- DF OEM ---(OemToUnicode)--Windows Unicode

20

Encoding in DataFlex •

How to Store Character Data – String – Memory (Address) – Variant

21

Encoding in DataFlex •

String – DF expects this is OEM – It is best to treat this as a zero terminated string – Length limited by Argument_Size – String Functions – Can cast between all data types – Allocation and disposal is automatic – Will work with ANSI with limitations – Will work with UTF-8 with limitations

22

Encoding in DataFlex •

Memory – Works with pointers – Must Allocate and dispose yourself – Can be any length – Very limited ability to cast to other types – Format is undefined - can be anything – No string functions – Can use low level memory functions – Tedious to work with

23

Encoding in DataFlex •

Variant String – We created this to support COM – Internal format is Unicode – Allocation and disposal is automatic – Can be any length – Limited ability to cast to other types - goes through strings – No string library - limited variant and helper functions

24

Encoding and Conversion in DataFlex •

Global Functions – Utf8ToOemBuffer – OemtoUtf8Buffer – ToOem – ToAnsi – VariantStringLength – CStringLength



XML V Variant i FFunctions i – Get / Set pvNodeValue – Get pvXML – LoadXMLFromVariant

25

Encoding and Conversion: cCharTranslate class •



Conversion between UTF-16 buffer and a buffer/string of ANSI, OEM, UTF-8 – UTF16FromBuffer – UTF16ToBuffer – UTF16FromStr – UTF16ToStr Conversion between UTF-8 buffer and a buffer/string of ANSI, OEM – UTF8FromBuffer – UTF8ToBuffer UTF8T B ff – UTF8FromStr – UTF8ToStr

26

Encoding and Conversion: cCharTranslate class •



Conversion between Variant String (Unicode) and a buffer/string of ANSI, OEM, UTF 8 UTF-8 – VariantStrFromBuffer – VariantStrToBuffer – VariantStrFromStr V i S F S – VariantStrToStr Conversion between Variant String g (Unicode) and a buffer of UTF-16 – VariantStrFromUTF16 – VariantStrToUTF16

27

Encoding and Conversion: cCharTranslate class •



Buffer Base64 Encode/Decode to/from a string or variant – Base64EncodeToStr – Base64DecodeFromStr – Base64EncodeToVariantStr – Base64DecodeFromVariantStr Plus global Base64 Encode/Decode to buffer – Base64Encode – Base64Decode

28

Encoding and Conversion: cSeqFileHelper •

Reads or writes a binary file to/from a buffer – WriteBinFileFromBuffer – ReadBinFileToBuffer



Reads or writes a OEM, ANSI, UTF-8 or UTF-16 file to/from a variant string (Unicode) – WriteFileFromVariantStr – ReadFileToVariantStr

29

DataFlex: Encoding •

Let’s see some examples

30

Thank You