Must Know about Unicode

1

Must Know about Unicode

Vinson Hsieh

2

如果不知道你拿到的字串是什麼encoding其實你不該寫code，直到你懂為止

3

ASCIIANSI

Unicode

4

世界的演變• When Unix was being invented and K&R (Brian Kernighan

and Dennis Ritchie) were writing The C Programming Language, everything was very simple.

• The only characters that mattered were good old unaccented English letters, we had a code for them called ASCII which was able to represent every character using a number between 32 and 127 . This could conveniently be stored in 7 bits.

• Codes below 32 were called unprintable . They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.

http://www.robelle.com/library/smugbook/ascii.html

5

ASCII The lower 128 (codes 0-127) are the most often used codes. Early email systems in fact would only allow you to transmit characters 0-127 (i.e. "7-bit text")

6

Plain text = ASCII = Characters 8 bits

• Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare.

• 『 gosh, we can use the codes 128-255 for our own purposes. 』 The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

7

• The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters (Dos 時代畫表格 )

IBM PC Code Page 850

8

Buying PCs outside of America

• For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel ( ). In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.

9

ANSI standard

• Eventually this OEM free-for-all got codified in the ANSI standard.

• Everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived.

• These different systems ( 國家 / 單位 ) were called code pages.

http://www.i18nguy.com/unicode/codepages.html

10

128 to 255 才 128 個怎麼夠中國字用 ?Big 5?

11

DBCS• Asian alphabets have thousands of letters

• This was usually solved by the messy system called DBCS, the 『 double byte character set 』• Visual C++ 裡， MBCS 永遠是指 DBCS• 65536 可以表達六萬多個字

8bits

12

秦代的《倉頡》、《博學》、《爰歷》三篇共有3300字，漢代揚雄作《訓纂篇》，有5340字，到許慎作《説文解字》就有9353字了，晉宋以後，文字又日漸增繁。據唐代封演《聞見記文字篇》所記晉呂忱作《字林》，有12824字，後魏楊承慶作《字統》，有13734字，梁顧野王作《玉篇》有16917字。唐代孫強增字本《玉篇》有22561字。到宋代司馬光修《類篇》多至31319字，到清代《康熙字典》就有47000多字了。1915年歐陽博存等的《中華大字典》，有48000多字。1959年日本諸橋轍次的《大漢和辭典》，收字49964個。1971年張其昀主編的《中文大辭典》，有49888字1990年徐仲舒主編的《漢語大字典》，收字數為54678個。1994年冷玉龍等的《中華字海》，收字數更是驚人，多達85000字。幸好《中華字海》一類字書裏收錄的漢字絕大部分是“死字”，也就是歷史上存在過而今天的書面語裏已經廢置不用的字。

13

Shift-JIS Kanji Table Multibyte Character Sets take advantage of the fact that only the first 128 characters of the ASCII set are commonly used (codes 0-127 in decimal, or 0x00-0x7f in hex). When parsing Shift-JIS, if you get a byte in the range 0x80-0xff, you know it is the first character of a two code sequence. Else, it is a single byte of regular ASCII.

http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml

14http://www.sqlsnippets.com/en/topic-13410.html

Character based applications use whichever code page is set as the active "OEM" (aka "MS-DOS") code page and Win32 applications use whichever code page is set as the active "ANSI" code page. (Note that Windows "ANSI" code pages do not necessarily map to official ANSI standard character sets.)

cp437

15

ASCII = OEM character sets = MS-DOSANSI = MBCS = DBCS = Windows

16Can’t type Chinese now

17

Python Win32 Console (DBCS)

我愛你 should be \xa7\xda\xb7\x52\xa7\x41

In ASCII, 52 = R, 41 = A So become to \xa7\xda\xb7R\xa7A

(7F之前都會 mapping到 ASCII的 0-127)

B750 感想愛0 1 2

A7D0 役忘忌志忍忱快忸忪戒我 0 1 2 3 4 5 6 7 8 9 a

A740 作你0 1

Big 5 Code Table

看起來 \x 會把後面兩個湊成一個字

18

「許功蓋」 (DBCS)最常見字：功餐許蓋閱次常見字：擺珮豹枯淚穀愧

ASCII(5C) == “\”A45C 么 AE5C 娉 B85C 稞 C25C 擺 A55C 功 AF5C 珮 B95C 鈾 C35C 黠 A65C 吒 B05C 豹BA5C 暝 C45C 孀 A75C 吭 B15C 崤 BB5C 蓋 C55C 髏 A85C 沔 B25C 淚 BC5C 墦 C65C 躡A95C 坼 B35C 許 BD5C 穀 AA5C 歿 B45C 廄 BE5C 閱 AB5C 俞 B55C 琵 BF5C 璞 AC5C 枯B65C 跚 C05C 餐 AD5C 苒 B75C 愧 C15C 縷 ASCII(7C) == “|”AA7C 泜 B47C 揉 A87C 育 BE7C 魯 B27C 琍 BC7C 慝 C67C 鸛 A97C 尚 B37C 逖 BD7C 罵A77C 坑 B17C 悴 BB7C 誡 C57C 疊 A67C 帆 B07C 院 BA7C 漏 C47C 辮 AB7C 咽 B57C 稅BF7C 糕 AC7C 洱 B67C 閏 C07C 嚐 AD7C 迢 B77C 會 C17C 舉 A47C 弋 AE7C 徑 B87C 腮C27C 甕 A57C 四 AF7C 砝 B97C 頌 C37C 牘

http://www.khngai.com/chinese/charmap/tblbig.php

Python 會把 ‘ \’ 變成 ‘ \\’ ，還不錯，可以翻回 5C

19

Shift-JIS Kanji Table 5C/7Chttp://www.chi2ko.com/jingyan/shiftjis2uni.htm

20

How about move strings to another PC

• Of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down.

• Win95/98 時代

21

Windows 98

It has 16-bit Windows heritage– Almost everything using ANSI strings

22

Unicode

• Unicode 只是一個字形和內碼上的標準，並沒有定義實際在電腦上存取的方法，因此 Unicode 協會便定義了一整套的電腦存取 Unicode 編碼的轉換格式，並考慮了與其它編碼方式兼容，稱之為UTF(Unicode/UCS Transformation Format, 統一碼 / 通用字集變換格式 ) 。UTF8/16/32 。

23

Unicode Code Point Chart• U+0000 to U+007F: Basic Latin • U+0080 to U+00FF: Latin-1 Supplement • U+0100 to U+017F: Latin Extended-A • U+0180 to U+024F: Latin Extended-B • U+0250 to U+02AF: IPA Extensions • U+02B0 to U+02FF: Spacing Modifier Letters • U+0300 to U+036F: Combining Diacritical Marks • U+0370 to U+03FF: Greek and Coptic • U+0400 to U+04FF: Cyrillic • U+0500 to U+052F: Cyrillic Supplement • U+0530 to U+058F: Armenian • U+0590 to U+05FF: Hebrew • U+0600 to U+06FF: Arabic • U+0700 to U+074F: Syriac • U+0750 to U+077F: Arabic Supplement • U+0780 to U+07BF: Thaana • U+0900 to U+097F: Devanagari • …

http://inamidst.com/stuff/unidata/

http://inamidst.com/stuff/unidata/

24

Unicode terminologySample Unicode Symbols03A0 Π Greek Capital Letter Pi

03A3 Σ Greek Capital Letter Sigma

03A9 Ω Greek Capital Letter Omega

notation U+NNNN

uni = {U+03A0} + {U+03A3} + {U+03A9} (ΠΣΩ)

25

Now, even though we know exactly what 'uni' represents (ΠΣΩ) note that there is no way to:

1. Print uni to the screen. 2. Save uni to a file. 3. Add uni to another piece of text. 4. Tell me how many bytes it takes to store uni.

26

Valid Coding of ΩEncoding name Binary representation

ISO-8859-7 (OEM/ASCII) \xD9"Native" Greek encoding

UTF-8 \xCE\xA9

UTF-16 \xFF\xFE\xA9\x03

UTF-32 \xFF\xFE\x00\x00\xA9\x03\x00\x00

You should think of Unicode as symbols (Ω), not as bytes.

27

Converting Unicode symbols to Python literalsPseudocode:uni = ‘abc_’ + {U+03A0} + {U+03A3} + {U+03A9} + ‘.txt’

Here is how you make that string in Python:uni = u"abc_\u03a0\u03a3\u03a9.txt"

Pseudocode:uni = {U+1A} + {U+B3C} + {U+1451} + {U+1D10C}

Python:uni = u'\u001a\u0bc3\u1451\U0001d10c’

Python:uni = u'\u001A\u0BC3\u1451\U0001D10C'

28

Codecs

• Unicode objects have no fixed computer representation.

• Before an Unicode object can be printed, stored to disk, or sent across a network, it must be encoded into a fixed computer representation. This is done using a codec. Some popular codecs you may have heard about in your day to day experiences: ASCII, iso-8859-7,UTF-8, UTF-16.

29

轉換的正確觀念• ANSI 和 Unicode 間的轉換• Big5 Unicode utf8/16/32• utf8/16/32 Unicode Big5

30http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G18421

31

Unicode 字元平面映射

http://zh.wikipedia.org/wiki/%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7%A8%AE%E5%B9%B3%E9%9D%A2#.E5.9F.BA.E6.9C.AC.E5.A4.9A.E6.96.87.E7.A7.8D.E5.B9.B3.E9.9D.A2

32

UTF 32 (Always 4 bytes)UTF-32 - Each Unicode code point is represented

directly by a single 32-bit code unitUTF-32 is restricted to representation of code points in

the range 0..10FFFF16—that is, the Unicode codespaceUTF-32 may be a preferred encoding form where

memory or disk storage space for characters is no particular concern, but where fixed-width, single code unit access to characters is desired. UTF-32 is also a preferred encoding form for processing characters on most Unix platforms.

33

UTF 16 ( 2 or 4 bytes)code points in the range U+0000..U+FFFF are representedas a single 16-bit code unit; code points in the supplementary planes, in the range U+10000..U+10FFFF, are instead represented as pairs of 16-bit code units. These pairs of special code units are known as surrogate pairs.

34

UTF 8 (1 – 4 bytes) The UTF-8 encoding form maintains transparency for all of the ASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F are converted to single bytes 0x00..0x7F in UTF-8, All non-surrogate code points between U+0800 and U+FFFF are represented by three bytes; and supplementary code points above U+FFFF require four bytes.

Unihan 統漢字將中日韓文加以整合分布於 U+3400~U+9FFF 與 U+F900~U+FAFF 的空間

35

All of the core function for ––Create windows, displaying text, string manipulation require Unicode string

More memory and runs and slower, if you don’t use Unicode from the start

Windows 2000 and Unicode

36

Windows CE and UnicodeThe machines were going to be sold all over the world– Windows CE is natively Unicode

A machine with little memory and no disk storage– The ANSI Windows APIs are not support

Operating System Description

Windows 2000 Unicode & ANSI

Windows 98 ANSI only

Windows CE Unicode only

After XP is now recommended that developers make all their applications using the Unicode versions of the APIs. But you may say, "if I do that my application will not run under Windows 95, 98 and ME because those Windows versions do not support the Unicode APIs". Well this is where the Microsoft Layer for Unicode (or "mslu") comes in. The mslu is contained in a Dll called "unicows.dll". This is redistributable, so the intention is that you will ship this with your executable for placement in the same folder as your executable.

37

C++ 怎麼轉換 Unicode 和ANSI

39

MultiByteToWideChar

http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

41

Unicode String and ANSI String

44

ANSI Version

45

Big5 編碼轉成 Unicode

46

Convert ANSI to Unicode

47

Glyph Rendering

• Automatic context analysis: There is only one key for Arabic "b". The system automatically selects whether the isolate, initial, medial or final form of "b" is appropriate, and changes this if you e.g. add another character afterwards. Notice that only the letter value "b" is stored on disk, not the form: this is only selected dynamically on display.

http://www.smi.uib.no/ksv/ArabicMac.html#uni

48

Writing Direction (bidirectional)

Hebrew and Arabic, characters are arranged from right to left into lines, although digits run the other way, making the scripts inherently bidirectional.Left-to-right and right-to-left scripts are frequently used together. In such a case, arranging characters into lines becomes more complex. The Unicode Standard defines an algorithm to determine the layout of a line. See Unicode Standard Annex #9, “The Bidirectional Algorithm,” for more information.

http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G18421

letters, punctuation, symbols, and diacritics

49

http://en.wikipedia.org/wiki/Help:Arabic

50

Sequence of Base Characters and Diacritics

The sequence of Unicode characters U+0061 “a” + U+0308 + U+0075 “u” unambiguously encodes “äu” not “aü”.

54http://unicode.org/reports/tr9/

Unicode Bidirectional Algorithm

56

我 – u6211愛 - u611b你 – u4f60

http://blog.163.com/guoo1230@126/blog/static/321155112011328102542586/

Why?

U+0000 to U+007F: Basic Latin U+0370 to U+03FF: Greek and Coptic U+1400 to U+167F: Unified Canadian Aboriginal SyllabicsU+4E00 to U+9FFF: CJK Unified Ideographs

http://blog.163.com/guoo1230@126/blog/static/321155112011328102542586/

57

• UTF 编码有个优点，即尽管编码字节数不等，但是不像 gb2312/gbk 编码一样，需要从文本开始寻找，才能正确对汉字进行定位。在 UTF 编码下，根据相对固定的算法，从当前位置就能够知道当前字节是否是一个代码点的开始还是结束，从而相对简单的进行字符定位。不过定位问题最简单的还是 UTF- 32 ，它根本不需要进行字符定位，但是相对的大小也增加不少。

Must Know about Unicode

Documents

Transcript of Must Know about Unicode