An Introduction to Unicode’s Role in XML Alex Brown Technical Director Griffin Brown Digital Publishing Ltd. 21 December, 2001
frequently subject to search-and-replace operations when SGML data was processed. An SGML declaration allows an SGML apUnicode is an essential part of the XML jigsaw, yet those unready for it can have a nasty shock plication to use different character sets (even when the need to process ‘special characters’ Unicode). At least in my experience, however, this is rarely done – most SGML systems in XML content arises. This article is aimed at users with some used minimal SGML documents and a set of knowledge of XML and/or SGML. It gives SDATA entities, usually the ISO entity sets some background on Unicode and discusses that appeared with the SGML standard. the practicalities of using it correctly in XML content.
The XML Way: Unicode
In accord with the general principles of XML, its character set handling is more simply defined. Simply put, all XML is Unicode. In SGML, the most usual way of treating speUnicode defines a character set and encodcial characters was to use SDATA (‘specific ing scheme designed to cater for the display character data’). A DTD would typically de- of characters in all languages and disciplines. fine a number of entities as SDATA entities, These two subjects are worth dealing with sepfor example: arately, and indeed confusion between them is often a difficulty for Unicode newcomers.
The SGML Way
and then in an instance, an SGML processor would be able to recognise references to this entity as representing SDATA, and take appropriate action depending on the special meaning of the content. In practice, however, the scarcity of true SGML processors meant that the entity reference itself (e.g., ‘é’) often came to have a special meaning and was
The Unicode Character Set
The Unicode 3.0 character set itself contains 49,194 pre-defined characters. This is sometimes called the ‘universal character set’ (ISO 10646) and it is soon expected to replace all other ASCII extensions. Most other character sets are today already defined in terms of the ISO 10646 character code positions. 1
What this means in practice is that every character you might normally need to use has been defined in Unicode, and that it has a set ‘code position’ (i.e., number) to identify it. For example, the letter ‘a’ is character number 97, the Greek letter ‘α’ is character number 945, and the dagger symbol ‘†’ is character number 8224 (these are decimal numbers). Any XML processor (for example, an XML parser) will represent XML data internally as a sequence of these numbers, which represent characters within Unicode.
storage unit for computer systems is the byte, an 8-bit switch capable of being in 28 (= 256) distinct states, and typically used to represent the numbers 0 . . . 255. In the past we could often get by using those 256 states to store a useful sample of the characters we need – for Western users on desktop PCs this was generally the main alphanumeric and punctuation symbols of US ASCII, and a handful of common accented characters and symbols. However, we clearly cannot store one character in one byte if we wish to be able to represent the whole Unicode character set. This is the problem that encoding addresses: 2.2 Extending the Character Set how to represent the full diversity of Unicode For some applications of XML, for example character positions given the limitations of the scientific publishing or linguistic work, there byte as a storage unit. may on occasion be a need to use characters which are not pre-defined in Unicode. To allow for this, Unicode has a Private Use Area 3 An Example within which additional characters can be defined. There are mechanisms to define nearly The following XML will be used for some ena million additional characters. In a produccoding examples: tion environment, the most important aspect of such extensions is documenting them and α making sure their meanings are agreed and understood. In an ideal future world we would have input software and fonts such that we might en2.3 Notation ter and see this symbol as easily (with luck more easily) than we can with modern wordBy convention, Unicode characters are comprocessing software like Microsoft Word, and monly referred to in documentation with think no more about it. However, in a typithe form ‘U+nnnn’, where nnnn is a fourcal present-day XML environment, we will be character hexadecimal number giving a charprobably be forced to have some knowledge of acter’s position. So the character ‘a’ is Unicode. U+0061, and the character ‘α’ is U+03B1. Since XML is always Unicode, we need to This notation will be used for the remainder ask two questions when faced with a special of this article. character.
1. what is the Unicode character position for the character?
The size of the Unicode character set raises a problem. It has 50,000-odd characters predefined (which is good), yet the fundamental
2. how shall it be encoded? 2
The characters in this fragment have Unicode positions as listed below:
(177 units, 3 multiples of 256). To complicate matters further, historically computer systems have taken a different ap< U+003C proach to storing multi-byte numbers. Some g U+0067 – so called little-endian systems – have elected r U+0072 to store units before the higher-powers of the a U+0061 base (in which case ‘a’ would be represented by d U+0064 the bytes 61 and 00), others – big-endian syse U+0065 tems – have taken the view that the units come > U+003E last (in which case the same character would α U+03B1 be represented by the bytes 00 and 61).1 For < U+003C interoperability, Unicode permits both meth/ U+002F ods of storage, and to allow for this differg U+0067 ence UTF-16 data starts with a ‘byte order r U+0072 mark’ (BOM) which describes which ordering a U+0061 is used. This takes the form of a special ‘ZERO d U+0064 WIDTH NO-BREAK SPACE’ code U+FEFF. e U+0065 XML is stricter than Unicode here: The Uni> U+003E code Standard states this mark is a ‘strong hint’ as to endianism; the XML RecommenThis mapping will never alter. To represent dation however states that ‘entities encoded this text these Unicode characters always have in UTF-16 must begin with the Byte Order to be used. The ‘a’ is always character number Mark’ [XML, §4.3.3]. This adds an extra two U+0061, the Greek alpha is always character bytes to the storage requirement. number U+03B1. Stored in memory, the 34 byte sequence for The method for encoding this text, however, this XML entity encoded using little-endian varies according to the scheme we decide to UTF-16 is therefore: use. FF FE 3C 00 67 00 72 00 61 00 64 00 65 00 3E 00 B1 03 3C 00 2F 00 67 00 72 00 3.1 UTF-16 61 00 64 00 65 00 3E 00 UTF-16 is a scheme which uses two bytes to encode each character. Since each byte is Gratifyingly, opening this file with Microsoft 8 bits this allows 216 (= 65, 536) combina- Internet Explorer (version 5.5) shows the XML tions, more than enough to encompass the pre- document with a Greek alpha correctly rendefined Unicode characters. dered. The text of the above 16 character fragment Generally, however, UTF-16 still has little encoded as UTF-16 should therefore take 32 support in systems or applications. However, bytes. Each character is represented by a two 1 The terms little-endian and big-endian come from byte sequence, one byte representing units – the other, multiples of 256. So for example Swift’s Gulliver’s Travels, in which two countries go war over which end of a hard-boiled egg should be ‘a’ would be represented by the (hexadecimal) to eaten first – the little end or the big end. In computbyte values 61 and 00 (97 units, 0 multiples of ing, arguments over the merits of the two schemes are 256); the ‘α’ by the hex byte values B1 and 03 comparably important. 3
Windows NT/2000/XP has Unicode support (not yet exploited by many applications) and many modern programming languages including Java, C# and Python support Unicode ‘out of the box’.
XML processors must be able to read entities in both the UTF-8 and UTF16 encodings. [Unicode §4.3.3] So, strictly speaking, any application which claims to be XML-conformant must accept UTF-8 or UTF-16 encoded Unicode XML documents. In practice, this is not always so.
A glance at the UTF-16 entity shows it is a wasteful format – much space is taken storing zeroes. UTF-8 is an encoding scheme which attempts to be less wasteful. Using this scheme common characters (‘a’, ‘b’, ‘c’, ‘1’, ‘2’, ‘3’, etc.) are encoded using a single byte. More unusual characters are encoded using multiple bytes. Using UTF-8 encoding, our entity is stored as these 17 bytes:
Avoiding Encoding Troubles
There is no doubt that UTF-16 and UTF-8 encoded data can pose particular problems for XML systems, particularly legacy components – perl scripts maybe – built in the SGML days, which expect XML to be nothing more that a benign mix of text and angled brackets. The lack of widespread tools for cleanly handling UTF-8 and UTF-16 encoded text mean 3C 67 72 61 64 65 3E CE B1 3C 2F 67 that a pragmatic strategy to ‘special charac72 61 64 65 3E ters’ can pay dividends – especially since this is possible without moving away from the benefits of Unicode. As you can see, the common characters are To avoid any multi-byte sequences, it is posrepresented by single bytes (the initial 3C 67 sible to use character references [XML, §4.1]. 72 represent the ‘