Understand unicode & utf8 in perl (2)

19
Understand Unicode & UTF8 in Perl avoid common issues and gain guru status. (You too can be John)

description

I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about it. So here's yet another set of slides about Unicode/UTF8 in Perl. It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.

Transcript of Understand unicode & utf8 in perl (2)

Page 1: Understand unicode & utf8 in perl (2)

Understand Unicode & UTF8 in Perl

avoid common issues and gain guru status. (You too can be John)

Page 2: Understand unicode & utf8 in perl (2)

Characters and Glyphs

A character: 'é'

Combination of 2 glyphs:

e (LATIN SMALL LETTER E)

Followed by:

´ (ACUTE ACCENT)

Page 3: Understand unicode & utf8 in perl (2)

Characters and Glyphs

A character: 'é'

Or a combined glyph:

é (LATIN SMALL LETTER E WITH ACUTE)

Page 4: Understand unicode & utf8 in perl (2)

So what is Unicode (in this context)?

A collection of glyphs (mainly) called Codepoints with a unique number and a set of properties.

Example: E ( U+0045 )

Source : fileformat.info

Name LATIN CAPITAL LETTER E

Block Basic Latin

Category Letter, Uppercase [Lu]

Combine 0

BIDI BIDI

Lower case U+0065

Page 5: Understand unicode & utf8 in perl (2)

What is a String?

An ordered collection of glyphs i.e. an ordered collection of Unicode point.

In Perl:

my $s = "he";

or

my $s = "\N{U+0068}\N{U+0065}";

Page 6: Understand unicode & utf8 in perl (2)

What is a String ? - The glyph Pitfall

An ordered collection of glyphs. There's more that one way to write it.

In Perl:

my $s = "é"

is

my $s = "\N{U+00E9}"; OR..

my $s = "\N{U+0065}\N{U+00B4}";

In practice, software prefer the first way (pffui), but not always. See Unicode::Normalize

Page 7: Understand unicode & utf8 in perl (2)

How does Perl represent Strings?

Short answer: It's not your business.

Long answer: It depends :(

Only "latin1 characters" -> Latin1. Anything outside that -> UTF-8.

Feeling fiddly, bug fixing? use utf8::* function.

Bedtime read: perldoc perlunicode

Page 8: Understand unicode & utf8 in perl (2)

Not my business? So what's this fuss about UTF-8 encoding?

How strings are represented internally is not your business.

How they are transmitted from/to the outside world is.

The outside world doesn't understand 'Strings'. It understands 'bytes'.

An encoding is a bijection:

Unicode Points (glyphs) <-> bytes

Page 9: Understand unicode & utf8 in perl (2)

UTF-8 encoding

Unicode Points (glyphs) <-> bytes

Variable number of bytes per unicode point. Examples:

a <-> \x{61} ,

☭ <-> \x{E2}\x{98}\x{AD} (gdrive FAIL)

Sometimes, the bytes begin with a BOM.

Page 10: Understand unicode & utf8 in perl (2)

The encoding law

Never transfer Strings. Always transfer Bytes.

But inside Perl: You want to work with Strings as much as possible.

Sending: Encode as LATE as possible.

Receiving: Decode as EARLY as possible.

Page 11: Understand unicode & utf8 in perl (2)

Common outside worlds: STDOUT

Latin1 encoding by default :(

-> You can only output 'Latin1 compliant Strings'. And your shell should expect Latin1.

In the modern world:

# Set STDOUT to encode as UTF8

binmode STDOUT , ':utf8';

Page 12: Understand unicode & utf8 in perl (2)

Common outside worlds: A text file

if you know the file encoding:

open(my $fh, "<:encoding(UTF-8)", "filename");

if you don't know.

Maybe you can count on the BOM byte.

But you don't want that. You want to know for sure -> set a convention.

Page 13: Understand unicode & utf8 in perl (2)

Common outside worlds: XML file

Encoding specified in the preamble:

<?xml version="1.0" encoding="utf-8"?>

If not specified -> utf8 is assumed.

Feed your XML parser with BYTES.

Write XML files in binary mode.

XML::LibXML:: Calls bytes 'Strings'.. People are confused. Trust no one.

Page 14: Understand unicode & utf8 in perl (2)

Common outside worlds: WWW

From a given page, browsers send parameters in the encoding of the page.

Correctly encode your binary responses.

Decode $c->params()

In Catalyst: Catalyst::Plugin::Unicode::Encoding

Page 15: Understand unicode & utf8 in perl (2)

Common outside worlds: Your own

Every time you communicate with a system, you will send/receive bytes. Never strings.

Think about encoding/decoding your strings to/from bytes, according to what your system expects/provides.

Sometime, it's done automagically through some library options.

Page 16: Understand unicode & utf8 in perl (2)

Bug avoiding guidelines.

Test everything with Unicode characters.

English keyboard? chartables.de, unicode lorem ipsum.

Unit test => "\N{U+262D}"

Never i/o strings. Never. i/o is about bytes. Choose encodings explicitly.

Page 17: Understand unicode & utf8 in perl (2)

Bonus: Escaping

What if you want to represent your nice shiny UTF8 bytes as part of something else?

You need to escape them!

Example in URI, escaping parameters: (URI::Escape):

http://foo.com/?q=%E2%98%AD

Page 18: Understand unicode & utf8 in perl (2)

Bonus: Escaping for email headers

Encode AND Escape for Email subjects (Encode with MIME-Q):

Encode::encode('MIME-Q', "a\N{U+262D}c");

=?UTF-8?Q?a=E2=98=ADb?=

It encodes and escapes at the same time. Beware of confusion.

Keep string for as long as you can.

Page 19: Understand unicode & utf8 in perl (2)

Conclusion

Make sure you make a difference Strings and Bytes. In Perl, it must come from discipline.

Make sure you always encode/decode on i/o as explicitly as possible. Don't let confused others confuse you.

Always wonder: What does this thing operates on. Bytes or Strings? In doubt, investigate.