Unicode for Rails. Dominic Mitchell

Unicode for Rails Dominic Mitchell Introduction to Unicode What Is Unicode? Unicode provides a unique number for every character, no matter what t...
Author: Byron Davis
22 downloads 0 Views 1MB Size
Unicode for Rails Dominic Mitchell

Introduction to Unicode

What Is Unicode? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. http://www.unicode.org/standard/WhatIsUnicode.html

More often...

Characters A, A, A == U+0041 a, a, a == U+0061 Ā == U+0100 ffi == U+FB03 ☺ == U+263A ☂, ☃, ✄, ☠, …

ASCII “In the beginning...” Created in 1967. 7-bit.

ISO-8859 Mid ‘80s 8-bit ASCII superset 16 different, related standards ISO-8859-1 (aka Latin-1) is most common

Windows-1252 Like ISO-8859-1, but with extra characters e.g. smart quotes, em dash The bane of your life

Unicode 21-bit Pretty much all characters in use, in the same character set.

But There’s More! Unicode also specifies: Character Properties Encodings Algorithms

Sounds Complex? It is. The real world is complex. But, you can get by on fairly minimal subset...

Encodings How do you turn characters into octets? It’s simple for ASCII & ISO-8859. Unicode has three different schemes.

UTF-32 code point


UTF-32 code value(s)


122 (7A)

small Z (Latin)

00 00 00 7A


27700 (6C34)

water (Chinese)

00 00 6C 34

4 octets (32 bits) per character. Very inefficient. Not used much.

UTF-16 code point


UTF-16 code value(s)


122 (7A)

small Z (Latin)

00 7A


27700 (6C34)

water (Chinese)

6C 34

2 octets (16 bits) per character (mostly). Common on Windows & Java. Somewhat wasteful for mostly Western text.

UTF-8 code point


UTF-8 code value(s)


122 (7A)

small Z (Latin)



27700 (6C34)

water (Chinese)

E6 B0 B4

Multi-Byte, but ASCII compatible. Very common in Internet protocols. Reliably recognisable.

Which Encoding? By default, pick UTF-8. Choose UTF-16 when Lots of non-Western text. Interfacing with other UTF-16 systems.

Accents Some are built-in (e.g. é) But you can build your own with “combining characters” (e.g. ĵ) U+006A LATIN SMALL LETTER J U+0302 COMBINING CIRCUMFLEX ACCENT

Normalisation How can I spot é if There’s More Than One Way To Do It? “normalize” all strings before use Four forms of normalisation NFC, NFD, NFKC, NFKD But only NFC matters, “says W3C”

Why bother? It’s more work now... But it opens everything up in the future! The rest of the world is heading this way

Unicode in Rails

Where to start? Examine a typical request. Where to begin thinking about Unicode?

Model? Controller? HTTP Headers? URI? Domain name!

IDN International Domain Names Punycode iñtërnâtiônàlizætiøn.net xn--itrntinliztinvdb0a5exd8ewcye.net

URIs Called IRIs when used with Unicode Must use percent-encoded UTF-8 Not %uXXXX (IE only)

Browsers HTML Forms as input Uses page charset unless... Form has @accept-charset Patchy support...

Finally We get to your code, in the controller i.e. Ruby

Ruby & Unicode Bad reputation Somewhat deserved: Ruby understands bytes Not characters

But! There’s a magic flag! -K kcode Specifies KANJI (Japanese) encoding. -Ku turns on UTF-8 mode $KCODE = “UTF8”

$KCODE =~ /^u/i Sets encoding in Tk Allows CGI::unescapeHTML to output UTF-8 SOAP libs use it here and there Big user is the regex engine /./u matches a UTF-8 char

pack / unpack The only other place Ruby understands UTF-8. [0x100, 0x64, 0x61, 0x6d].pack("U*") =>"Ādam" "Ādam".unpack("U*") => [256, 100, 97, 109]

Unicode affects... Any character processing. In String: [] []= =~ == capitalize casecmp center chomp chop count delete downcase dump each eql? gsub index insert length ljust lstrip replace reverse rindex rjust rstrip scan slice split squeeze strip sub succ swapcase tr upcase upto

And regexes

jcode.rb Core library Enhances String “Ādam”.length => 5 “Ādam”.jlength => 4 Not very complete

iconv Another core library Converts between character encodings conv = Iconv.new("UTF-8", "WINDOWS-1252") conv.iconv "\223foo\224" => "“foo”"

Many Alternatives icu4r, unicode, utf8proc, characterencodings But they’re less relevant as of Rails 1.2

ActiveSupport::MultiByte In Rails 1.2 (see RC1 blog post) Adds .chars method to all strings "Ādam".chars.length => 4 Optional C extension for speed

Controllers Ensure all parameters are Unicode Use a filter in ApplicationController e.g. convert from Windows-1252 to UTF-8.

1 require 'iconv' 2 class ApplicationController < ActionController::Base 3 @@conv = Iconv.new "UTF-8", "WINDOWS-1252" 4 5 before_filter :fix_windows_1252 6 def fix_windows_1252 7 fix_windows_1252_in_hash request.parameters 8 unless is_utf8(request.parameters.to_s) 9 end 10 11 def fix_windows_1252_in_hash(h) 12 h.each do |k,v| 13 if v.is_a?(Hash) 14 fix_windows_1252_in_hash(v) 15 elsif v.is_a?(Array) 16 v.map! { |item| @@conv.iconv(item) } 17 else 18 h[k] = @@conv.iconv(v) 19 end 20 end 21 end 22 end

Models As with Controllers, most issues are in Ruby itself But keep an eye out for String processing

Validation validates_format_of validates_length_of

Databases Most of the model Need to ensure we can get out what we put in


Powered by

ALTER DATABASE ‘dev’ CHARACTER SET ‘UTF8’; SET NAMES ‘UTF8’; encoding: UTF8 in database.yml

PostgreSQL CREATE DATABASE foo ENCODING = 'UTF-8'; SET client_encoding encoding: UTF-8

= 'UTF-8';

in database.yml

SELECT name,setting FROM pg_settings WHERE name LIKE 'lc_%';

Controllers (again) Have to tell HTTP what characterencoding you are sending Content-Type: text/html; charset=UTF-8

NB: Content-Length is bytes, not characters

1 class ApplicationController < ActionController::Base 2 after_filter :fix_charset 3 def fix_charset 4 headers["Content-Type"] ||= "text/html; charset=UTF-8" 5 6 if headers["Content-Type"].include?('text/') && \ 7 !headers["Content-Type"].include?('charset') 8 headers["Content-Type"] += "; charset=UTF-8" 9 end 10 end 11 end

View Specify encoding in as well In case page is saved Watch out for helpers that go near Strings e.g. excerpt, highlight, truncate

View link_to Safe! If using UTF-8 everywhere

View JavaScript Should be Unicode safe in all browsers

Apache Good tip for .htaccess AddDefaultCharset UTF-8

Conclusion Unicode is hard Ruby doesn’t have much support Rails is better Use UTF-8 everywhere Test, test, test

ⓐⓢⓒⓘⓘ ⓢⓣⓤⓟⓘⓓ ⓠⓤⓔⓢⓣⓘⓞⓝ ⒢⒠⒯ ⒜ ⒮⒯⒰⒫⒤⒟ ⒜⒩⒮⒤