Unicode 101

Unicode 101How to avoid corrupting international

text

ß

�!David Foster

Goal

Learn just enough to:– Avoid corrupting international text in your code

Out of Scope

• Internationalization (i18n)– Extending a program to emit messages in

multiple languages

• Localization (l10n)– Extending a program to emit messages in a specific language, such as German

• Manipulating Unicode characters within strings

Problems

• Customer A writes some text to a file or app.Customer B reads it back, but it is different.In particular it has a bunch of ??? or .��– ß ➔�

• UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)

Bytes vs. Characters

77 101 105 110 32 70 117 195 159

M e i n F u ß

Byte Stream

Decode utf-8

CharacterStream

CharacterEncoding

Bytes vs. Characters

77 101 105 110 32 70 117 195 159

M e i n F u ß

Byte Stream

Decode utf-8

CharacterStream

CharacterEncoding

︎ Multiple bytes wide!

☝

︎ Often forgotten!

☟

What is the character encoding?

• There is usually some signal (sometimes out-of-band) that specifies the encoding that should be used to interpret a byte stream as characters.

– HTTP: Content-Type: text/html; charset=UTF-8– HTML: <meta charset="UTF-8"/> – XML: <?xml encoding="UTF-8">– Python: # -*- coding: utf-8 -*-– POSIX: LANG=en_US.UTF-8

What is the character encoding?

• Unfortunately some types of files don't contain any information about their encoding.

– Text files (*.txt)• Usually the OS default character encoding is assumed,

which depends on its locale. Yikes.

– JSON files (*.json)• Usually UTF-8 is assumed, but other Unicode encodings are permitted by

RFC 4627.

– Java source files (*.java)• Encoding is derived from the -encoding compiler flag.

Big Mistake #1

You cannot interpret a byte sequence as a character sequence

without knowing the character encoding.

What's wrong with this code? (A1)

#!/usr/bin/python2.7with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())



• No character encoding is specified!– Python will fallback to the OS default character encoding,

which depends on its locale.– Therefore a customer running this program on a

Japanese OS will read different text than an English OS!

• Reads byte strings instead of character strings!


#!/usr/bin/python2.7import codecswith codecs.open("names.txt", "r", "utf-8") as f: for name in f: print(u'Hello ' + name.strip())

• Fixed. Will always read character strings, and as UTF-8.



• No character encoding is specified!


#!/usr/bin/python3.4with open("names.txt", "r", encoding="utf-8") as f: for name in f: print('Hello ' + name.strip())

• Fixed. Will always read as UTF-8.

What's wrong with this code? (B)

<!DOCTYPE html><html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>


<!DOCTYPE html><html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>



<!DOCTYPE html><html> <head> <meta charset="UTF-8"/> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>

• Fixed. Declares self as UTF-8 encoded.

What's wrong with this code? (C)

<?xml version="1.0"><messages> <message>Mein Fuß tut weh!</message></messages>


<?xml version="1.0"><messages> <message>Mein Fuß tut weh!</message></messages>



<?xml version="1.0" encoding="UTF-8"><messages> <message>Mein Fuß tut weh!</message></messages>

• Fixed. Declares self as UTF-8 encoded.

What's wrong with this code? (D)

// C#// TextReader is a character stream// OpenText always assumes UTF-8 encodingusing (TextReader r = File.OpenText("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(r); ...}

What's wrong with this code? (D)// C#// TextReader is a character stream// OpenText always assumes UTF-8 encodingusing (TextReader r = File.OpenText("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(r); ...}

• The encoding declaration in the XML is ignored! UTF-8 is always forced.

What's wrong with this code? (D)

// C#// Stream is a byte streamusing (Stream s = File.OpenRead("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(s); ...}

• Fixed. XmlDocument will internally determine the encoding based on the declaration in the byte stream.

Big Mistake #2

Bytes and charactersare not the same thing.

Do not mix them.

Unfortunately many languages blur the line between byte strings and character strings.

– Python 2.x• All strings are byte strings by default.• Byte and ASCII character strings are implicitly convertible.

– C / C++• String functions in the C standard library manipulate

byte strings by default.

What's wrong with this code? (E1)

#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')



• A byte string (with international chars) was printed.Only character strings should be printed.– On OS X, which has the UTF-8 locale by default rather than

Windows-1252, the second word will be printed as "Fu?" instead of "Fuß".


#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print(u'Mein Fuß tut weh!')

• This is the smallest possible fix.


#!/usr/bin/python2.7# -*- coding: windows-1252 -*-from __future__ import unicode_literalsprint('Mein Fuß tut weh!')

• A better fix, since it avoids adding u'…' everywhere.



• Nothing!– Python 3.x interprets string literals as character strings

by default.

What's wrong with this code? (F)

#!/usr/bin/python2.7# -*- coding: utf-8 -*-import codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)


#!/usr/bin/python2.7# -*- coding: utf-8 -*-import codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)

• Mixing a byte string literal with character input.– Python 2.x interprets string literals as bytes by default.


#!/usr/bin/python2.7# -*- coding: utf-8 -*-from __future__ import unicode_literalsimport codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)

• Fixed. All strings are character strings now.

Summary: Special Considerations• Python 2.x

– String literals are byte strings by default rather than characters.– Implicitly converts between byte strings and ASCII character strings.

• HTML, CSS, JavaScript– Must declare an encoding in HTML.

• XML files– Must declare an encoding in XML. Must honor such a declaration.– Feed bytes to XML parsers rather than characters.

• Text files– Must always assume an encoding. Usually UTF-8.

Don't Forget

1. You cannot interpret a byte sequence as a character sequence without knowing the character encoding.

2. Bytes and characters are not the same thing. Do not mix them.

Thank You

More broken programs…

What's wrong with this code? (#1)

// JavaReader r = new FileReader("names.txt");


// JavaReader r = new FileReader("names.txt");

• No character encoding is specified!– Java will fallback to the OS default character encoding,

which depends on its locale.– Therefore a customer running this program on a

Japanese OS will read different text than an English OS!


// JavaReader r = new FileReader( "names.txt", "UTF-8");

• Fixed. Will always read as UTF-8.


// C#Reader r = new StreamReader("names.txt");


// C#Reader r = new StreamReader("names.txt");

• Nothing!– C#'s StreamReader always uses UTF-8 encoding if no

encoding is specified.– You must always read the documentation. Don't assume.


// C#Reader r = new StreamReader( "names.txt", Encoding.UTF8);

• Nevertheless, always explicitly specifying the encoding is still a good idea.

Unicode 101

Technology

Transcript of Unicode 101