Unicode for Rails Dominic Mitchell
Introduction to Unicode
What Is Unicode? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. http://www.unicode.org/standard/WhatIsUnicode.html
More often...
Characters A, A, A == U+0041 a, a, a == U+0061 Ā == U+0100 ffi == U+FB03 ☺ == U+263A ☂, ☃, ✄, ☠, …
ASCII “In the beginning...” Created in 1967. 7-bit.
ISO-8859 Mid ‘80s 8-bit ASCII superset 16 different, related standards ISO-8859-1 (aka Latin-1) is most common
Windows-1252 Like ISO-8859-1, but with extra characters e.g. smart quotes, em dash The bane of your life
Unicode 21-bit Pretty much all characters in use, in the same character set.
But There’s More! Unicode also specifies: Character Properties Encodings Algorithms
Sounds Complex? It is. The real world is complex. But, you can get by on fairly minimal subset...
Encodings How do you turn characters into octets? It’s simple for ASCII & ISO-8859. Unicode has three different schemes.
UTF-32 code point
character
UTF-32 code value(s)
glyph
122 (7A)
small Z (Latin)
00 00 00 7A
z
27700 (6C34)
water (Chinese)
00 00 6C 34
水
4 octets (32 bits) per character. Very inefficient. Not used much.
UTF-16 code point
character
UTF-16 code value(s)
glyph
122 (7A)
small Z (Latin)
00 7A
z
27700 (6C34)
water (Chinese)
6C 34
水
2 octets (16 bits) per character (mostly). Common on Windows & Java. Somewhat wasteful for mostly Western text.
UTF-8 code point
character
UTF-8 code value(s)
glyph
122 (7A)
small Z (Latin)
7A
z
27700 (6C34)
water (Chinese)
E6 B0 B4
水
Multi-Byte, but ASCII compatible. Very common in Internet protocols. Reliably recognisable.
Which Encoding? By default, pick UTF-8. Choose UTF-16 when Lots of non-Western text. Interfacing with other UTF-16 systems.
Accents Some are built-in (e.g. é) But you can build your own with “combining characters” (e.g. ĵ) U+006A LATIN SMALL LETTER J U+0302 COMBINING CIRCUMFLEX ACCENT
Normalisation How can I spot é if There’s More Than One Way To Do It? “normalize” all strings before use Four forms of normalisation NFC, NFD, NFKC, NFKD But only NFC matters, “says W3C”
Why bother? It’s more work now... But it opens everything up in the future! The rest of the world is heading this way
Unicode in Rails
Where to start? Examine a typical request. Where to begin thinking about Unicode?
Model? Controller? HTTP Headers? URI? Domain name!
IDN International Domain Names Punycode iñtërnâtiônàlizætiøn.net xn--itrntinliztinvdb0a5exd8ewcye.net
URIs Called IRIs when used with Unicode Must use percent-encoded UTF-8 Not %uXXXX (IE only)
Browsers HTML Forms as input Uses page charset unless... Form has @accept-charset Patchy support...
Finally We get to your code, in the controller i.e. Ruby
Ruby & Unicode Bad reputation Somewhat deserved: Ruby understands bytes Not characters
But! There’s a magic flag! -K kcode Specifies KANJI (Japanese) encoding. -Ku turns on UTF-8 mode $KCODE = “UTF8”
$KCODE =~ /^u/i Sets encoding in Tk Allows CGI::unescapeHTML to output UTF-8 SOAP libs use it here and there Big user is the regex engine /./u matches a UTF-8 char
pack / unpack The only other place Ruby understands UTF-8. [0x100, 0x64, 0x61, 0x6d].pack("U*") =>"Ādam" "Ādam".unpack("U*") => [256, 100, 97, 109]
Unicode affects... Any character processing. In String: [] []= =~ == capitalize casecmp center chomp chop count delete downcase dump each eql? gsub index insert length ljust lstrip replace reverse rindex rjust rstrip scan slice split squeeze strip sub succ swapcase tr upcase upto
And regexes
jcode.rb Core library Enhances String “Ādam”.length => 5 “Ādam”.jlength => 4 Not very complete
iconv Another core library Converts between character encodings conv = Iconv.new("UTF-8", "WINDOWS-1252") conv.iconv "\223foo\224" => "“foo”"
Many Alternatives icu4r, unicode, utf8proc, characterencodings But they’re less relevant as of Rails 1.2
ActiveSupport::MultiByte In Rails 1.2 (see RC1 blog post) Adds .chars method to all strings "Ādam".chars.length => 4 Optional C extension for speed
Controllers Ensure all parameters are Unicode Use a filter in ApplicationController e.g. convert from Windows-1252 to UTF-8.
1 require 'iconv' 2 class ApplicationController < ActionController::Base 3 @@conv = Iconv.new "UTF-8", "WINDOWS-1252" 4 5 before_filter :fix_windows_1252 6 def fix_windows_1252 7 fix_windows_1252_in_hash request.parameters 8 unless is_utf8(request.parameters.to_s) 9 end 10 11 def fix_windows_1252_in_hash(h) 12 h.each do |k,v| 13 if v.is_a?(Hash) 14 fix_windows_1252_in_hash(v) 15 elsif v.is_a?(Array) 16 v.map! { |item| @@conv.iconv(item) } 17 else 18 h[k] = @@conv.iconv(v) 19 end 20 end 21 end 22 end
Models As with Controllers, most issues are in Ruby itself But keep an eye out for String processing
Validation validates_format_of validates_length_of
Databases Most of the model Need to ensure we can get out what we put in
MySQL
Powered by
ALTER DATABASE ‘dev’ CHARACTER SET ‘UTF8’; SET NAMES ‘UTF8’; encoding: UTF8 in database.yml
PostgreSQL CREATE DATABASE foo ENCODING = 'UTF-8'; SET client_encoding encoding: UTF-8
= 'UTF-8';
in database.yml
SELECT name,setting FROM pg_settings WHERE name LIKE 'lc_%';
Controllers (again) Have to tell HTTP what characterencoding you are sending Content-Type: text/html; charset=UTF-8
NB: Content-Length is bytes, not characters
1 class ApplicationController < ActionController::Base 2 after_filter :fix_charset 3 def fix_charset 4 headers["Content-Type"] ||= "text/html; charset=UTF-8" 5 6 if headers["Content-Type"].include?('text/') && \ 7 !headers["Content-Type"].include?('charset') 8 headers["Content-Type"] += "; charset=UTF-8" 9 end 10 end 11 end
View Specify encoding in as well In case page is saved Watch out for helpers that go near Strings e.g. excerpt, highlight, truncate
View link_to Safe! If using UTF-8 everywhere
View JavaScript Should be Unicode safe in all browsers
Apache Good tip for .htaccess AddDefaultCharset UTF-8
Conclusion Unicode is hard Ruby doesn’t have much support Rails is better Use UTF-8 everywhere Test, test, test
ⓐⓢⓒⓘⓘ ⓢⓣⓤⓟⓘⓓ ⓠⓤⓔⓢⓣⓘⓞⓝ ⒢⒠⒯ ⒜ ⒮⒯⒰⒫⒤⒟ ⒜⒩⒮⒤