Understand Unicode & UTF8 in Perl
avoid common issues and gain guru status. (You too can be John)
Characters and Glyphs
A character: 'é'
Combination of 2 glyphs:
e (LATIN SMALL LETTER E)
Followed by:
´ (ACUTE ACCENT)
Characters and Glyphs
A character: 'é'
Or a combined glyph:
é (LATIN SMALL LETTER E WITH ACUTE)
So what is Unicode (in this context)?
A collection of glyphs (mainly) called Codepoints with a unique number and a set of properties.
Example: E ( U+0045 )
Source : fileformat.info
Name LATIN CAPITAL LETTER E
Block Basic Latin
Category Letter, Uppercase [Lu]
Combine 0
BIDI BIDI
Lower case U+0065
What is a String?
An ordered collection of glyphs i.e. an ordered collection of Unicode point.
In Perl:
my $s = "he";
or
my $s = "\N{U+0068}\N{U+0065}";
What is a String ? - The glyph Pitfall
An ordered collection of glyphs. There's more that one way to write it.
In Perl:
my $s = "é"
is
my $s = "\N{U+00E9}"; OR..
my $s = "\N{U+0065}\N{U+00B4}";
In practice, software prefer the first way (pffui), but not always. See Unicode::Normalize
How does Perl represent Strings?
Short answer: It's not your business.
Long answer: It depends :(
Only "latin1 characters" -> Latin1. Anything outside that -> UTF-8.
Feeling fiddly, bug fixing? use utf8::* function.
Bedtime read: perldoc perlunicode
Not my business? So what's this fuss about UTF-8 encoding?
How strings are represented internally is not your business.
How they are transmitted from/to the outside world is.
The outside world doesn't understand 'Strings'. It understands 'bytes'.
An encoding is a bijection:
Unicode Points (glyphs) <-> bytes
UTF-8 encoding
Unicode Points (glyphs) <-> bytes
Variable number of bytes per unicode point. Examples:
a <-> \x{61} ,
☭ <-> \x{E2}\x{98}\x{AD} (gdrive FAIL)
Sometimes, the bytes begin with a BOM.
The encoding law
Never transfer Strings. Always transfer Bytes.
But inside Perl: You want to work with Strings as much as possible.
Sending: Encode as LATE as possible.
Receiving: Decode as EARLY as possible.
Common outside worlds: STDOUT
Latin1 encoding by default :(
-> You can only output 'Latin1 compliant Strings'. And your shell should expect Latin1.
In the modern world:
# Set STDOUT to encode as UTF8
binmode STDOUT , ':utf8';
Common outside worlds: A text file
if you know the file encoding:
open(my $fh, "<:encoding(UTF-8)", "filename");
if you don't know.
Maybe you can count on the BOM byte.
But you don't want that. You want to know for sure -> set a convention.
Common outside worlds: XML file
Encoding specified in the preamble:
<?xml version="1.0" encoding="utf-8"?>
If not specified -> utf8 is assumed.
Feed your XML parser with BYTES.
Write XML files in binary mode.
XML::LibXML:: Calls bytes 'Strings'.. People are confused. Trust no one.
Common outside worlds: WWW
From a given page, browsers send parameters in the encoding of the page.
Correctly encode your binary responses.
Decode $c->params()
In Catalyst: Catalyst::Plugin::Unicode::Encoding
Common outside worlds: Your own
Every time you communicate with a system, you will send/receive bytes. Never strings.
Think about encoding/decoding your strings to/from bytes, according to what your system expects/provides.
Sometime, it's done automagically through some library options.
Bug avoiding guidelines.
Test everything with Unicode characters.
English keyboard? chartables.de, unicode lorem ipsum.
Unit test => "\N{U+262D}"
Never i/o strings. Never. i/o is about bytes. Choose encodings explicitly.
Bonus: Escaping
What if you want to represent your nice shiny UTF8 bytes as part of something else?
You need to escape them!
Example in URI, escaping parameters: (URI::Escape):
http://foo.com/?q=%E2%98%AD
Bonus: Escaping for email headers
Encode AND Escape for Email subjects (Encode with MIME-Q):
Encode::encode('MIME-Q', "a\N{U+262D}c");
=?UTF-8?Q?a=E2=98=ADb?=
It encodes and escapes at the same time. Beware of confusion.
Keep string for as long as you can.
Conclusion
Make sure you make a difference Strings and Bytes. In Perl, it must come from discipline.
Make sure you always encode/decode on i/o as explicitly as possible. Don't let confused others confuse you.
Always wonder: What does this thing operates on. Bytes or Strings? In doubt, investigate.
Top Related