Unicode 101

45
Unicode 101 How to avoid corrupting international text ß �! David Foster

Transcript of Unicode 101

Page 1: Unicode 101

Unicode 101How to avoid corrupting international

text

ß

�!David Foster

Page 2: Unicode 101

Goal

Learn just enough to:– Avoid corrupting international text in your code

Page 3: Unicode 101

Out of Scope

• Internationalization (i18n)– Extending a program to emit messages in

multiple languages

• Localization (l10n)– Extending a program to emit messages in a specific language, such as German

• Manipulating Unicode characters within strings

Page 4: Unicode 101

Problems

• Customer A writes some text to a file or app.Customer B reads it back, but it is different.In particular it has a bunch of ??? or .���– ß ➔�

• UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)

Page 5: Unicode 101

Bytes vs. Characters

77 101 105 110 32 70 117 195 159

M e i n F u ß

Byte Stream

Decode utf-8

CharacterStream

CharacterEncoding

Page 6: Unicode 101

Bytes vs. Characters

77 101 105 110 32 70 117 195 159

M e i n F u ß

Byte Stream

Decode utf-8

CharacterStream

CharacterEncoding

︎ Multiple bytes wide!

︎ Often forgotten!

Page 7: Unicode 101

What is the character encoding?

• There is usually some signal (sometimes out-of-band) that specifies the encoding that should be used to interpret a byte stream as characters.

– HTTP: Content-Type: text/html; charset=UTF-8– HTML: <meta charset="UTF-8"/> – XML: <?xml encoding="UTF-8">– Python: # -*- coding: utf-8 -*-– POSIX: LANG=en_US.UTF-8

Page 8: Unicode 101

What is the character encoding?

• Unfortunately some types of files don't contain any information about their encoding.

– Text files (*.txt)• Usually the OS default character encoding is assumed,

which depends on its locale. Yikes.

– JSON files (*.json)• Usually UTF-8 is assumed, but other Unicode encodings are permitted by

RFC 4627.

– Java source files (*.java)• Encoding is derived from the -encoding compiler flag.

Page 9: Unicode 101

Big Mistake #1

You cannot interpret a byte sequence as a character sequence

without knowing the character encoding.

Page 10: Unicode 101

What's wrong with this code? (A1)

#!/usr/bin/python2.7with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())

Page 11: Unicode 101

What's wrong with this code? (A1)

#!/usr/bin/python2.7with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())

• No character encoding is specified!– Python will fallback to the OS default character encoding,

which depends on its locale.– Therefore a customer running this program on a

Japanese OS will read different text than an English OS!

• Reads byte strings instead of character strings!

Page 12: Unicode 101

What's wrong with this code? (A1)

#!/usr/bin/python2.7import codecswith codecs.open("names.txt", "r", "utf-8") as f: for name in f: print(u'Hello ' + name.strip())

• Fixed. Will always read character strings, and as UTF-8.

Page 13: Unicode 101

What's wrong with this code? (A2)

#!/usr/bin/python3.4with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())

Page 14: Unicode 101

What's wrong with this code? (A2)

#!/usr/bin/python3.4with open("names.txt", "r") as f: for name in f: print('Hello ' + name.strip())

• No character encoding is specified!

Page 15: Unicode 101

What's wrong with this code? (A2)

#!/usr/bin/python3.4with open("names.txt", "r", encoding="utf-8") as f: for name in f: print('Hello ' + name.strip())

• Fixed. Will always read as UTF-8.

Page 16: Unicode 101

What's wrong with this code? (B)

<!DOCTYPE html><html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>

Page 17: Unicode 101

What's wrong with this code? (B)

<!DOCTYPE html><html> <head> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>

• No character encoding is specified!

Page 18: Unicode 101

What's wrong with this code? (B)

<!DOCTYPE html><html> <head> <meta charset="UTF-8"/> <title>Krankenzimmer</title> </head> <body>Mein Fuß tut weh!</body></html>

• Fixed. Declares self as UTF-8 encoded.

Page 19: Unicode 101

What's wrong with this code? (C)

<?xml version="1.0"><messages> <message>Mein Fuß tut weh!</message></messages>

Page 20: Unicode 101

What's wrong with this code? (C)

<?xml version="1.0"><messages> <message>Mein Fuß tut weh!</message></messages>

• No character encoding is specified!

Page 21: Unicode 101

What's wrong with this code? (C)

<?xml version="1.0" encoding="UTF-8"><messages> <message>Mein Fuß tut weh!</message></messages>

• Fixed. Declares self as UTF-8 encoded.

Page 22: Unicode 101

What's wrong with this code? (D)

// C#// TextReader is a character stream// OpenText always assumes UTF-8 encodingusing (TextReader r = File.OpenText("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(r); ...}

Page 23: Unicode 101

What's wrong with this code? (D)// C#// TextReader is a character stream// OpenText always assumes UTF-8 encodingusing (TextReader r = File.OpenText("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(r); ...}

• The encoding declaration in the XML is ignored! UTF-8 is always forced.

Page 24: Unicode 101

What's wrong with this code? (D)

// C#// Stream is a byte streamusing (Stream s = File.OpenRead("names.xml")){ XmlDocument doc = new XmlDocument(); doc.Load(s); ...}

• Fixed. XmlDocument will internally determine the encoding based on the declaration in the byte stream.

Page 25: Unicode 101

Big Mistake #2

Bytes and charactersare not the same thing.

Do not mix them.

Page 26: Unicode 101

Unfortunately many languages blur the line between byte strings and character strings.

– Python 2.x• All strings are byte strings by default.• Byte and ASCII character strings are implicitly convertible.

– C / C++• String functions in the C standard library manipulate

byte strings by default.

Page 27: Unicode 101

What's wrong with this code? (E1)

#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')

Page 28: Unicode 101

What's wrong with this code? (E1)

#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')

• A byte string (with international chars) was printed.Only character strings should be printed.– On OS X, which has the UTF-8 locale by default rather than

Windows-1252, the second word will be printed as "Fu?" instead of "Fuß".

Page 29: Unicode 101

What's wrong with this code? (E1)

#!/usr/bin/python2.7# -*- coding: windows-1252 -*-print(u'Mein Fuß tut weh!')

• This is the smallest possible fix.

Page 30: Unicode 101

What's wrong with this code? (E1)

#!/usr/bin/python2.7# -*- coding: windows-1252 -*-from __future__ import unicode_literalsprint('Mein Fuß tut weh!')

• A better fix, since it avoids adding u'…' everywhere.

Page 31: Unicode 101

What's wrong with this code? (E2)

#!/usr/bin/python3.4# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')

Page 32: Unicode 101

What's wrong with this code? (E2)

#!/usr/bin/python3.4# -*- coding: windows-1252 -*-print('Mein Fuß tut weh!')

• Nothing!– Python 3.x interprets string literals as character strings

by default.

Page 33: Unicode 101

What's wrong with this code? (F)

#!/usr/bin/python2.7# -*- coding: utf-8 -*-import codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)

Page 34: Unicode 101

What's wrong with this code? (F)

#!/usr/bin/python2.7# -*- coding: utf-8 -*-import codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)

• Mixing a byte string literal with character input.– Python 2.x interprets string literals as bytes by default.

Page 35: Unicode 101

What's wrong with this code? (F)

#!/usr/bin/python2.7# -*- coding: utf-8 -*-from __future__ import unicode_literalsimport codecswith codecs.open('hurts.txt', 'r', 'utf-8') as f: status = f.read().strip()print('Schädigung: ' + status)

• Fixed. All strings are character strings now.

Page 36: Unicode 101

Summary: Special Considerations• Python 2.x

– String literals are byte strings by default rather than characters.– Implicitly converts between byte strings and ASCII character strings.

• HTML, CSS, JavaScript– Must declare an encoding in HTML.

• XML files– Must declare an encoding in XML. Must honor such a declaration.– Feed bytes to XML parsers rather than characters.

• Text files– Must always assume an encoding. Usually UTF-8.

Page 37: Unicode 101

Don't Forget

1. You cannot interpret a byte sequence as a character sequence without knowing the character encoding.

2. Bytes and characters are not the same thing. Do not mix them.

Page 38: Unicode 101

Thank You

Page 39: Unicode 101

More broken programs…

Page 40: Unicode 101

What's wrong with this code? (#1)

// JavaReader r = new FileReader("names.txt");

Page 41: Unicode 101

What's wrong with this code? (#1)

// JavaReader r = new FileReader("names.txt");

• No character encoding is specified!– Java will fallback to the OS default character encoding,

which depends on its locale.– Therefore a customer running this program on a

Japanese OS will read different text than an English OS!

Page 42: Unicode 101

What's wrong with this code? (#1)

// JavaReader r = new FileReader( "names.txt", "UTF-8");

• Fixed. Will always read as UTF-8.

Page 43: Unicode 101

What's wrong with this code? (#2)

// C#Reader r = new StreamReader("names.txt");

Page 44: Unicode 101

What's wrong with this code? (#2)

// C#Reader r = new StreamReader("names.txt");

• Nothing!– C#'s StreamReader always uses UTF-8 encoding if no

encoding is specified.– You must always read the documentation. Don't assume.

Page 45: Unicode 101

What's wrong with this code? (#2)

// C#Reader r = new StreamReader( "names.txt", Encoding.UTF8);

• Nevertheless, always explicitly specifying the encoding is still a good idea.