Unicode 101

69
1 Unicode 101 Bouvet søkekollokvie, 2011-10-26 Lars Marius Garshol, <[email protected]> http://twitter.com/larsga

description

An overview of character sets and encodings, scripts and languages, display issues, and how to deal with these in programming.

Transcript of Unicode 101

Page 1: Unicode 101

1

Unicode 101

Bouvet søkekollokvie, 2011-10-26Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

Page 2: Unicode 101

2

Agenda

• Scripts and languages– background for what follows

• Character encodings– basic concepts

• Unicode– it’s bigger than you think

• Programming– some practical lessons

Page 3: Unicode 101

3

A quick intro to grammatology

Scripts and languages

Page 4: Unicode 101

4

The beginning of writing

4000 BCE 3000 BCE 2000 BCE 1000 BCE

Sumerian cuneiform

Egyptian hieroglyphics

Chinese writing

Page 5: Unicode 101

5

Logographic scripts

• All of these first scripts are logographic– sometimes called “ideographic”, too

• One character – one word– 山 = mountain– 口 = mouth / door– 門 = gate

• Compounds for complicated concepts– 問 = question (mouth inside gate)– 酒店 = hotel (literally: alcohol shop)– 日本 = Japan (literally: sun root)

Page 6: Unicode 101

6

Simplified Chinese

• In the 1950s and 60s the Communist Chinese government simplified the shapes of many characters– these modified characters are used in

mainland China– Japan and Taiwan still use the original shapes

Character Traditional Simplified

Gate 門 门Country 國 国Vehicle 車 车East 東 东

Page 7: Unicode 101

7

Logographic scripts (2)

Good• Iconic and striking• Compact• Language-independent

(to some degree)

Bad• Hard to learn• Hard to write on

computers• Works poorly for

inflected languages• Sorting is hard

Page 8: Unicode 101

8

The next step: abjads

• The Egyptians later developed a script called “hieratic”– meaning “priestly writing”– oldest known example from 1600 BCE, but

must be older– actual origin kind of obscure

• It is an abjad– that is, an alphabet with only consonant signs– (and some logographic elements)

• A precursor to our own alphabet

Page 9: Unicode 101

9

Why only consonants?

• In Semitic languages everything revolves around “word roots”– these consist of three consonants– many, many words can be derived from the same root in

a systematic way– (the same applies to Egyptian)

• Example– s l m = peace– salaam = peace (related to Hebrew shalom)– islam = to have peace– moslem = one who has peace

• Abjads therefore work well for Semites– and not quite as well for others

Page 10: Unicode 101

10

Abjad family tree

Hieratic

Proto-Sinaitic

Ethiopic Phoenician

Greek Aramaic

Arabic HebrewKharoshthi

Much simplified. This familyhas many, many more scripts.

Page 11: Unicode 101

11

The invention of the alphabet

• The Greeks didn’t much like not having signs for vowels– so they invented them, thus giving us the

alphabet

• Salaam– in Arabic: سالم (m l s)– in Greek: σαλαμ (s a l a m)

• Grammatologists consider only scripts with both consonant and vowel signs alphabets– thus, there is no ”Arabic alphabet” or

”Chinese alphabet”

Page 12: Unicode 101

12

Alphabet family tree

Greek

Etruscan

Latin Armenian

Again much simplified.

Cyrillic Georgian

Page 13: Unicode 101

13

Alphabetic order

Page 14: Unicode 101

14

ABG?

• The Etruscans had no g sound, so they pronounced the third character k– the Romans inherited this

• Thus, in Latin “C” was used for both k and g– this left Spurius Carvilius Ruga with a bit of a

problem– he had to write his surname as “Ruca”– (Latin “ruca”: to fart)– so he invented “G” (C with a stroke)

Page 15: Unicode 101

15

Backtracking

Hieratic

Proto-Sinaitic

Ethiopic Phoenician

Greek Aramaic

Arabic HebrewKharoshthi

Page 16: Unicode 101

16

Kharoshthi

• A script developed for writing Sanskrit in India– 4th century BCE, or thereabouts

• Sanskrit is Indo-European, like Greek– hence, the absence of vowels is a major

problem– however, the Indians came up with a different

solution

• This is an abugida– Ethiopic, too

Page 17: Unicode 101

17

The descendants of Kharoshthi

• The Indian subcontinent has a large number of scripts– all are descended from Kharoshthi and follow

the same basic model

• These scripts also spread into Indo-China and beyond to Indonesia– Thai, Khmer, Javanese, Tibetan, etc etc

• Total number of users must approach 2 billion

Page 18: Unicode 101

18

Back to China

• Japanese is an inflected language– therefore developed extra characters for writing

grammatical endings

• The Vietnamese later moved on to Latin characters under French influence

• The Koreans did something completely different...• Leaving out a number of minor ethnic scripts here

Chinese

Japanese Korean Vietnamese

Page 19: Unicode 101

19

Hangul

• On October 9 1446, king Sejong announced the introduction of a new script: Hangul– it was designed to be easier to learn than

Chinese characters– it was also designed to be a better fit to the

Korean language

• Hangul is like an alphabet, but better– wikipedia: Numerous linguists have praised

Hangul for its featural design, describing it as "remarkable", "the most perfect phonetic system devised", and "brilliant, so deliberately does it fit the language like a glove.”

Page 20: Unicode 101

20

Hangul design

• Vowels have different shapes from consonants– these follow the shape of the vocal cords when

speaking the vowel

• Consonants have a different system

Follows Chinese convention of equal-sizedboxes for all characters. Each box containshangul letters for one syllable.

Page 21: Unicode 101

21

Syllabaries

• Like abugidas, but unsystematic– that is, every letter combination must be

learned by rote

Page 22: Unicode 101

22

1800 – present

• Cover this?

Page 23: Unicode 101

23

Kinds of scripts

wordfeature

what basic shapes correspond to

letter syllable

featural alphabet syllabary logographic

abjad

abugida

Page 24: Unicode 101

24

Text directions

• LTR top-down– Latin, Greek, Cyrillic, ...

• RTL top-down– Arabic, Hebrew,

Syriac, ...

• Top to bottom, left to right– Monglian, Uighur,

Buryat, ...

• Top to bottom, right to left– Nushu, Rong

• Bottom up, left to right– Some minor Indonesian

abugidas

• Bottom up, right to left– one minor Moslem

Chinese abjad

• Upwards boustrophedon– Ogham (ancient Irish

runes)

Page 25: Unicode 101

25

Combining characters

• In Arabic, the same letter can have up to four different shapes– depending on its position in the word

Page 26: Unicode 101

26

Basics

Character encoding

Page 27: Unicode 101

27

Two key concepts

• Character set– a function number -> character– usually with a limited, fixed number of characters

• Character encoding– a mapping from a bit stream to a sequence of

numbers– the numbers, of course, refer to characters in

some character set

• Example– UTF-8 is an encoding for Unicode– UTF-16 is another

Page 28: Unicode 101

28

Kinds of encodings

• Single-byte– each byte is a number 0-255. end of story– ISO 8859-x

• Double-byte– each word (2 bytes) is a number 0-65536. end of story– UCS-2

• Variable length– more complex rules (UTF-8)

• Escape code-based– uses escape codes to change between different modes– ISO 2022

Page 29: Unicode 101

29

ASCII

• The mother of all character sets– 7-bit

• Nearly all character sets today are ASCII subsets

• The exception is EBCDIC– mostly used by IBM mainframes– also a terrible design

Page 30: Unicode 101

30

Code pages

• Primitive character set solution on IBM PCs

• Basically, changing the code page would change the system font– 65 would always be ‘A’– 216 could be ‘Ø’, Cyrillic, Greek, ... depending

on code page

• Essentially, swapping code page would change the contents of all text files...– awful for text processing software

Page 31: Unicode 101

31

ISO 8859-x series

• Lower 128 characters is ASCII• Higher 128 characters are language-

specific• Now obsolete, thanks to Unicode• Microsoft have their own extensions

– Windows-12xx, add extra characters where 8859 have obsolete control codes

1 Western Europe 5 Cyrillic 9 Turkish 13 Baltic

2 Central Europe 6 Arabic 10 Nordic (Sami) 14 Celtic

3 South Europe 7 Greek 11 Thai 15 Latin-1 ++

4 North Europe 8 Hebrew 12 Doesn’t exist 16 Latin-3 ++

Page 32: Unicode 101

32

The Far East

• Generally, one character set per country– JapanJIS X 0208– Korea KS X 1001 (and 1003)– China GB 2312– Taiwan ???

• Combined with different character encodings– ISO 2022 (-JP, -KR, -CN)– EUC (-JP, -KR, -TW)

• Additional variants– Shift-JIS (Japanese, from Microsoft)– Big5 (Taiwanese)– ...

Page 33: Unicode 101

33

China

• Doesn’t want to use Unicode• Instead introduced GB 18030

– takes GB 2312, then adds Unicode after the GB 2312 part

– requires a mapping table for lower part– higher part can be mapped to Unicode

algorithmically

Page 34: Unicode 101

34

VISCII

• Vietnamese character set– tries hard to maintain ASCII compatibility, but

there are just too many Vietnamese characters...

Page 35: Unicode 101

35

Unicode

Page 36: Unicode 101

36

Unicode

• The character set to end all character sets

• Before, there was at least one character set for each script– generally, it would have Latin plus one more

script– software therefore had to support many

different internal representations of text

• Now, Unicode supports every character that’s ever appeared in a character set anywhere– therefore, it’s the only character set you need

Page 37: Unicode 101

37

Origin• 1987

– Engineers at Xerox and Apple discuss the possibility of a universal character set

– they investigate, and decide it’s feasible

• 1988– tentative proposal for a 16-bit character set

• 1989– Unicode Working Group set up– all of ISO 8859 added

• 1990– many more people join– Chinese characters added

• 1991– Unicode Consortium founded– Unicode 1.1 released

• 1992– ISO 10646 killed off, and replaced by Unicode

Page 38: Unicode 101

38

Design goals

• Universal– should be the only character set ever needed

• Semantics– characters should have well-defined semantics– Ø ≠ ∅

• Dynamic composition– characters can be composed dynamically

• Convertibility– every character in an existing character set,

must have a single corresponding character in Unicode

Page 39: Unicode 101

39

Structure

• Originally intended to be 16-bit– 0x000 – 0xFFFF– explicit rationale: enough to encode all

characters in daily use– implicit: not excessive use of space

• Unfortunately, this is not nearly enough– decided to expand it in 1996– keep the 16-bit structure– original range becomes Basic Multilingual Plane– each stretch of 0xFFFF characters is a plane– 17 planes (0-16) in all

Page 40: Unicode 101

40

Contents

• As of Unicode 6.0– more than 109,000 characters– 93 different scripts

Page 41: Unicode 101

41

The planes

• Plane 0: BMP – Nearly the only one needed

• Plane 1: Supplementary Multilingual Plane– mostly historical scripts and weird symbols

• Plane 2: Supplementary Ideographic Plane– historical Chinese characters

• Panes 3: Tertiary Ideographic Plane– not in use, reserved for ancient Chinese characters

• Planes 4-13: Unused• Plane 14: Supplementary Special-purpose Plane• Planes 15-16: Private Use Area

Page 42: Unicode 101

42

Basic Multilingual Plane

0xxx 1xxx 2xxx 3xxx 4xxx 5xxx 6xxx 7xxx 8xxx 9xxx Axxx Bxxx Cxxx Dxxx Exxx Fxxx

x1FF

x3FF

x5FF

x7FF

x9FF

xBFF

xDFF

xFFF

Latin

CJK

CJK CJK CJK CJK CJK CJK

Han

gul

Han

gul

Han

gul

Han

gul

CJKSyll.

Yi

Surro

gates

Priv

ate use

Priv

ate use

Stuff

Scripts

Scripts

Symbols

Sym

Latin

Braille

Misc

Page 43: Unicode 101

43

Plane 10xxx 1xxx 2xxx 3xxx 4xxx 5xxx 6xxx 7xxx 8xxx 9xxx Axxx Bxxx Cxxx Dxxx Exxx Fxxx

x1FF

x3FF

x5FF

x7FF

x9FF

xBFF

xDFF

xFFF

LTR

RTL

Indic

African

Conlang

Neareast

Hieroglyphics

Hieroglyphics

Hieroglyphics

Undeciph

Undeciph

NorthAm.

Sumerian

Notational

Notational

Notational

LargeAsian

LargeAsian

LargeAsian

LargeAsian

LargeAsian

Page 44: Unicode 101

44

Encodings

• UCS-2– from ISO 10646– two bytes per character– can only encode the BMP

• UCS-4– also from ISO 10646– four bytes per character– can encode the whole thing

• UTF-32– same as UCS-4

Page 45: Unicode 101

45

UTF-16

• Like UCS-2– but extended with a trick to cover the full set

• Surrogates– a block of code points set aside specifically for

UTF-16

• Each non-BMP character is written as two surrogates, one low and one high– first 10 bits (0-03FF) added to D800 = first two

bytes– next 10 bits added to DC00 = next two bytes

• So, the characters become:– 1101 10xx xxxx xxxx 1101 11xx xxxx xxxx

Page 46: Unicode 101

46

UTF-8

• Cleverly designed variable-length encoding

• ASCII is encoded as ASCII• Can encode all of Unicode as 4 bytes

– whether more or less compact than UTF-16 depends on the text being coded

– for files UTF-8 is usually far more compact

Page 47: Unicode 101

47

UTF-8

• Far and away most used Unicode encoding– because of compatibility with ASCII

• Easy to recognize bit patterns– Lett Ã¥ kjenne igjen = UTF-8 interpreted as

8859-1– æøå = æøå– ÆØÅ = ÆØÅ

Page 48: Unicode 101

48

Reading UTF-8 as UTF-16

• Two bytes get treated as a single character– effectively turns it into a random character

As UTF-8 As UTF-16

Page 49: Unicode 101

49

Han unification

• Not only are there many Chinese characters– there are different variants of each character– differences between China (traditional &

simplified), Japan, and Korea (also Vietnam)

• Unicode has decided to encode these only once– different renderings are considered visual

differences only

• This is quite unpopular in the Far East– particularly in Japan

Page 50: Unicode 101

50

Too many characters

• Latin characters have a nearly infinite number of variants:

• New ones pop up all the time– these can’t all be encoded

a á à â

ã ä å ā

ą ă ậ ẩ

ạ aː ȧ ...

Page 51: Unicode 101

51

Solution: combining characters

• What if I needed Z with stroke, cedilla, and umlaut?

• Simple, encode as– U+01B5 LATIN CAPITAL LETTER Z WITH

STROKE– U+0327 COMBINING CEDILLA– U+0308 COMBINING DIAERESIS

• Norwegian å should actually be written– a + combining ring

Page 52: Unicode 101

52

Many ways to write a character

• Unicode has inherited precomposed characters (like å) from older character sets– these are all included, for ease of

roundtripping

• Unicode normalization provides ways of streamlining this– unfortunately, it’s complex, with numerous

different normalization forms– won’t go further into it

Page 53: Unicode 101

53

Unicode Character Database

• Unicode contains more than just the characters– there is a whole database of characters with many

fields

• It contains things like– names for each character– decomposition mappings– deprecation mappings– case mappings– breakdown into blocks– category for each character– numeric value– what script the character belongs to– ...

Page 54: Unicode 101

54

Uses for UCD

• Matching in regular expressions– by character category– by script– ...

• Upper- and lower-casing of strings– beware, this is complicated...

• Stripping accents– use decomposition mappings in UCD

• ...

Page 55: Unicode 101

55

More stuff in the Unicode standard

• Guidance on upper/lower-casing– tricky, because there are national variations

• Unicode Normalization• Sorting algorithm• Regular expression guidelines• Line breaking algorithm• Bidirectional text display (Arabic)

Page 56: Unicode 101

56

Dealing with characters

Programming

Page 57: Unicode 101

57

The key principle

• One internal representation for text– used everywhere, with no deviations– text from outside must always be converted

• Modern programming languages enforce this– char/String vs byte (in Java)– Stream vs Reader/Writer (also Java)

• Older languages do not– C char has no defined representation

Page 58: Unicode 101

58

In an ideal world

• The internal encoding of strings should not be visible– they should just be sequences of Unicode

characters

• In practice, this turns out to be difficult– string.charAt(): what should this return?– string.length(): what should this return?– etc

Page 59: Unicode 101

59

Internal encodings

• C strings are byte arrays• C++ bytes, UTF-16, or UTF-32• Java UTF-16• .NET UTF-16• Python UTF-16• Ruby ???• JavaScript UTF-16 or UCS-2 (poorly

defined)• PHP strings are byte arrays1)

• Perl UTF-81) http://kore-nordmann.de/blog/php_charset_encoding_FAQ.html#which-charset-encoding-do-strings-have-in-php

Page 60: Unicode 101

60

Java

• String.charAt(int ix)– returns the UTF-16 code unit (word) at that

index

• String.codePointAt(int ix)– like charAt, but if it’s a high surrogate,

returns code point by combining with charAt(ix + 1)

• String.length()– returns the number of UTF-16 code units in

the string representationFew programming languages document thebehaviour of the String class well enough forthis to be clear...

Page 61: Unicode 101

61

How long is a string?

String str = "\u01B5\u0327\u0308";System.out.println(str.length());

Output is 3, even though there is just a single,combined character.

Page 62: Unicode 101

62

Find the bug

import java.io.*;

public class Cat {

public static void main(String[] argv) throws IOException { BufferedReader in = new BufferedReader(new FileReader(argv[0])); String line = in.readLine(); while (line != null) { System.out.println(line); line = in.readLine(); } }}

Q: What encoding are we reading?A: We have no idea.

Page 63: Unicode 101

63

How to solve

• Either find some way to auto-detect encoding– requires you to know the syntax of the file– and that syntax to have auto-detect rules

• Or find some way for the user to specify the encoding– for example a command-line parameter

Page 64: Unicode 101

64

Find the bug, 2

public HttpResponse get(String request) throws IOException { InputStream responseContent = null; HttpGet httpGet = new HttpGet(request); HttpResponse response = new DefaultHttpClient().execute(httpGet); responseContent = response.getEntity().getContent(); return new HttpResponse(response.getStatusCode(), response.getStatusLine().getReasonPhrase(), read(new InputStreamReader(responseContent, "UTF-8"))); }

Q: How do we know this is UTF-8?A: We don’t.

Page 65: Unicode 101

65

How to solve

• Need to look at the Content-type– “Content-type: text/plain; charset=iso-8859-1”

• Occasionally, it can be even harder– MIME-type specific rules for deciding charset

if not specified in request

• Best solution (for a generic Get class) is to return the stream (not the reader)– and provide enough info for clients to figure

out the encoding

Page 66: Unicode 101

66

URI vs IRI

• Originally, the character encoding of URIs was not defined– characters must be ASCII, or %-escaped– however, character set of %-escapes not defined – this was RFC 2396

• Then, RFC 3986 defined %-escapes as being UTF-8– explicit characters must still be ASCII only

• RFC 3987 introduced IRIs– here, everything is UTF-8– non-ASCII characters do not need to be escaped

Page 67: Unicode 101

67

How to distinguish IRIs and URIs

• Well, uh, you can’t, really...– if there are non-ASCII characters it’s probably

an IRI

• Specifications can decide to support IRIs– for example, in XTM it’s all IRIs

• So the context can tell you, in some cases

If it sounds like a mess, that’s because it is...

Page 68: Unicode 101

68

XML and Unicode

• One of the good things about XML is that it gets Unicode right– text coming out of an XML parser is always

Unicode– unless the author has made a stupid mistake,

there will be no character encoding problems

• Does this via– syntax for declaring encoding in the document,– careful, explicit rules for detecting encoding,– escape syntax for Unicode characters, and– well-designed rules for Unicode characters in

identifiers and elsewhere

Page 69: Unicode 101

69

Representing characters in XML

• Don’t ever use entity references– &aring; and similar are the spawn of the devil– they require the DOCTYPE to be downloaded

• Use UTF-8 as the character encoding– and declare it in the <?xml, or omit entirely– this way all characters can be expressed

directly

• If you absolutely must use stupid tricks, use &#XXXX; character references– it’s better to avoid these, but in special cases

(human authoring) they can be useful