Unicode 101
-
Upload
davidfstr -
Category
Technology
-
view
115 -
download
0
Transcript of Unicode 101
Unicode 101How to avoid corrupting international
text
ß
�!David Foster
Goal
Learn just enough to:– Avoid corrupting international text in your code
Out of Scope
• Internationalization (i18n)– Extending a program to emit messages in
multiple languages
• Localization (l10n)– Extending a program to emit messages in a specific language, such as German
• Manipulating Unicode characters within strings
Problems
• Customer A writes some text to a file or app.Customer B reads it back, but it is different.In particular it has a bunch of ??? or .���– ß ➔�
• UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
Bytes vs. Characters
77 101 105 110 32 70 117 195 159
M e i n F u ß
Byte Stream
Decode utf-8
CharacterStream
CharacterEncoding
Bytes vs. Characters
77 101 105 110 32 70 117 195 159
M e i n F u ß
Byte Stream
Decode utf-8
CharacterStream
CharacterEncoding
︎ Multiple bytes wide!
☝
︎ Often forgotten!
☟
What is the character encoding?
• There is usually some signal (sometimes out-of-band) that specifies the encoding that should be used to interpret a byte stream as characters.
– HTTP: Content-Type: text/html; charset=UTF-8– HTML: <meta charset="UTF-8"/> – XML: <?xml encoding="UTF-8">– Python: # -*- coding: utf-8 -*-– POSIX: LANG=en_US.UTF-8
What is the character encoding?
• Unfortunately some types of files don't contain any information about their encoding.
– Text files (*.txt)• Usually the OS default character encoding is assumed,
which depends on its locale. Yikes.
– JSON files (*.json)• Usually UTF-8 is assumed, but other Unicode encodings are permitted by
RFC 4627.
– Java source files (*.java)• Encoding is derived from the -encoding compiler flag.
Big Mistake #1
You cannot interpret a byte sequence as a character sequence
without knowing the character encoding.
What's wrong with this code? (A1)
#!/usr/bin/python2.7with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())
What's wrong with this code? (A1)
#!/usr/bin/python2.7with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())
• No character encoding is specified!– Python will fallback to the OS default character encoding,
which depends on its locale.– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!
• Reads byte strings instead of character strings!
What's wrong with this code? (A1)
#!/usr/bin/python2.7import codecswith codecs.open("names.txt", "r", "utf-8") as f: for name in f: print(u'Hello ' + name.strip())
• Fixed. Will always read character strings, and as UTF-8.
What's wrong with this code? (A2)
#!/usr/bin/python3.4with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())
What's wrong with this code? (A2)
#!/usr/bin/python3.4with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())
• No character encoding is specified!
What's wrong with this code? (A2)
#!/usr/bin/python3.4with open("names.txt", "r", encoding="utf-8") as f: for name in f: print('Hello ' + name.strip())
• Fixed. Will always read as UTF-8.
What's wrong with this code? (B)
<!DOCTYPE html><html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>
What's wrong with this code? (B)
<!DOCTYPE html><html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>
• No character encoding is specified!
What's wrong with this code? (B)
<!DOCTYPE html><html> <head> <meta charset="UTF-8"/> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>
• Fixed. Declares self as UTF-8 encoded.
What's wrong with this code? (C)
<?xml version="1.0"><messages> <message>Mein Fuß tut weh!</message></messages>
What's wrong with this code? (C)
<?xml version="1.0"><messages> <message>Mein Fuß tut weh!</message></messages>
• No character encoding is specified!
What's wrong with this code? (C)
<?xml version="1.0" encoding="UTF-8"><messages> <message>Mein Fuß tut weh!</message></messages>
• Fixed. Declares self as UTF-8 encoded.
What's wrong with this code? (D)
// C#// TextReader is a character stream// OpenText always assumes UTF-8 encodingusing (TextReader r = File.OpenText("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(r); ...}
What's wrong with this code? (D)// C#// TextReader is a character stream// OpenText always assumes UTF-8 encodingusing (TextReader r = File.OpenText("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(r); ...}
• The encoding declaration in the XML is ignored! UTF-8 is always forced.
What's wrong with this code? (D)
// C#// Stream is a byte streamusing (Stream s = File.OpenRead("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(s); ...}
• Fixed. XmlDocument will internally determine the encoding based on the declaration in the byte stream.
Big Mistake #2
Bytes and charactersare not the same thing.
Do not mix them.
Unfortunately many languages blur the line between byte strings and character strings.
– Python 2.x• All strings are byte strings by default.• Byte and ASCII character strings are implicitly convertible.
– C / C++• String functions in the C standard library manipulate
byte strings by default.
What's wrong with this code? (E1)
#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')
What's wrong with this code? (E1)
#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')
• A byte string (with international chars) was printed.Only character strings should be printed.– On OS X, which has the UTF-8 locale by default rather than
Windows-1252, the second word will be printed as "Fu?" instead of "Fuß".
What's wrong with this code? (E1)
#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print(u'Mein Fuß tut weh!')
• This is the smallest possible fix.
What's wrong with this code? (E1)
#!/usr/bin/python2.7# -*- coding: windows-1252 -*-from __future__ import unicode_literalsprint('Mein Fuß tut weh!')
• A better fix, since it avoids adding u'…' everywhere.
What's wrong with this code? (E2)
#!/usr/bin/python3.4# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')
What's wrong with this code? (E2)
#!/usr/bin/python3.4# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')
• Nothing!– Python 3.x interprets string literals as character strings
by default.
What's wrong with this code? (F)
#!/usr/bin/python2.7# -*- coding: utf-8 -*-import codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)
What's wrong with this code? (F)
#!/usr/bin/python2.7# -*- coding: utf-8 -*-import codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)
• Mixing a byte string literal with character input.– Python 2.x interprets string literals as bytes by default.
What's wrong with this code? (F)
#!/usr/bin/python2.7# -*- coding: utf-8 -*-from __future__ import unicode_literalsimport codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)
• Fixed. All strings are character strings now.
Summary: Special Considerations• Python 2.x
– String literals are byte strings by default rather than characters.– Implicitly converts between byte strings and ASCII character strings.
• HTML, CSS, JavaScript– Must declare an encoding in HTML.
• XML files– Must declare an encoding in XML. Must honor such a declaration.– Feed bytes to XML parsers rather than characters.
• Text files– Must always assume an encoding. Usually UTF-8.
Don't Forget
1. You cannot interpret a byte sequence as a character sequence without knowing the character encoding.
2. Bytes and characters are not the same thing. Do not mix them.
Thank You
More broken programs…
What's wrong with this code? (#1)
// JavaReader r = new FileReader("names.txt");
What's wrong with this code? (#1)
// JavaReader r = new FileReader("names.txt");
• No character encoding is specified!– Java will fallback to the OS default character encoding,
which depends on its locale.– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!
What's wrong with this code? (#1)
// JavaReader r = new FileReader( "names.txt", "UTF-8");
• Fixed. Will always read as UTF-8.
What's wrong with this code? (#2)
// C#Reader r = new StreamReader("names.txt");
What's wrong with this code? (#2)
// C#Reader r = new StreamReader("names.txt");
• Nothing!– C#'s StreamReader always uses UTF-8 encoding if no
encoding is specified.– You must always read the documentation. Don't assume.
What's wrong with this code? (#2)
// C#Reader r = new StreamReader( "names.txt", Encoding.UTF8);
• Nevertheless, always explicitly specifying the encoding is still a good idea.