Let's Explore Chinese i18n/L10n on GNU/Linux!Anthony Fok, ThizLinux Laboratory Ltd.HKLUG Linux Talk,...
-
date post
30-Jan-2016 -
Category
Documents
-
view
222 -
download
0
Transcript of Let's Explore Chinese i18n/L10n on GNU/Linux!Anthony Fok, ThizLinux Laboratory Ltd.HKLUG Linux Talk,...
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
齊來探討 齊來探討 GNU/Linux GNU/Linux 中文化中文化Let's Explore Chinese Let's Explore Chinese
internationalization and localization internationalization and localization on GNU/Linux!on GNU/Linux!
霍東靈,即時系統科研有限公司霍東靈,即時系統科研有限公司Anthony Fok, ThizLinux Laboratory Ltd.Anthony Fok, ThizLinux Laboratory Ltd.
HKLUG Linux Talk, 13 April 2002HKLUG Linux Talk, 13 April 2002
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
概覽 概覽 OverviewOverview
● 中文字符集及編碼簡介Introduction to Chinese charsets and encodings– GB 18030-2000 和 HKSCS-2001
● GNU/Linux 系統上的中文 i18n/L10n 架構Chinese i18n/L10n infrastructure on GNU/Linux
● 如何參與中文化的工作Participating in Chinese i18n/L10n
● 待辦工作及未來展望Todo list and future developments
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
中文字符集及編碼簡介Chinese character sets and
encodings
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
在起初,只有 在起初,只有 0 0 和 和 11In the beginning, there's In the beginning, there's
only 0 and 1only 0 and 1● Computer sees all data as 0s and 1s
● Each “on-off switch” unit is a “bit” (位元、比特 )● 8-bits make up 1“byte”or“octet” (位元組、字節 )● 0000 0000 to 1111 1111 (0x00 to 0xFF) make up
256 code points● Initially, each character is stored in 1 byte
– ASCII (ISO 646 IRV)– ISO 8859-1 至 ISO 8859-16 (Latin1, Latin2,
Greek, Hebrew, Thai, Cyrillic, etc.)– 256 codepoints is NOT enough for Chinese!
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
萬「碼」奔騰:眾多中文編碼標準萬「碼」奔騰:眾多中文編碼標準So many charsets and So many charsets and
encodings!encodings!● All Chinese (Han) characters that have
ever existed exceeds 100,000– Unicode 3.2 / ISO 10646 includes over
70,000– CCCII includes over 75,000– Invented in China; adopted by Japan, Korea,
and Vietnam: “CJKV”– Sources include:
● 漢語大字典 (Hanyu Da Zidian)● 康熙字典 (Kangxi Zidian)● Regional Standards (GB, CNS, HKSCS, JIS, KSC)
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
1 byte not enough? Let's 1 byte not enough? Let's use more!use more!
● If all bits are available:– 1 byte, 8 bits, 2^8 = 256 (0x00..0xFF)– 2 bytes, 16 bits, 2^16 = 65536
(0x0000..0xFFFF)– 3 bytes, 24 bits, 2^24 = 16,777,216
(0x000000..0xFFFFFF)– 4 bytes, 32 bits, 4,294,967,296
(0x00000000..0xFFFFFFFF)● Most legacy encodings must ensure ASCII
compatibility, so cannot use all the space
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GB 2312-80GB 2312-80
● GB2312 是中國大陸國家標準(國標)– ─ ─《信息技術 信息交換用漢字編碼字符集 基本集》 ,
published in 1980– 2-byte, {0xA1-0xFE}{0xA1-0xFE}, or 94x94,
for a total of 8836 possible 2-byte codepoints.– 6500+ Han characters, for a total of 6700+
chars● Sidenote: GB 12345-T provides a Traditional Chinese
charset encoded in the same space as GB 2312-80● Called zh_CN.GB2312 or zh_CN.EUC-CN on
GNU/Linux– Too few characters! (朱鎔 基 -> 朱容基 )
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GBK GBK 規範 規範 SpecificationSpecification
● China actively participates in ISO 10646● GB13000.1 = Unicode 2.1 (ISO 10646-1993)● Too many legacy GB2312 applications● Need a migration plan, an intermediate solution
● GBK is the first step in that direction (1995)
● Includes the repertoire of the CJK Unified Ideographs in GB13000.1 / Unicode 2.1
● U+4E00 to U+9FA5, over 20000 Han ideographs● Backward compatible with GB2312● Implemented in Windows 95 (simp. Chin) (CP936)● {0x81-0xFE}{0x40-0x7E, 0x80-0xFE}
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Big-5 Big-5 「五大碼」「五大碼」
● A “round-table” standard made up by the “Big-5” companies in Taiwan
● Implemented by all major Chinese OS's– 倚天、零一、國喬、繁體中文 Windows 等等
● Not very well designed, 選字不夠規範– Two characters are duplicated– Missing 「 」 and other chars used in HK– In Taiwan, attempts to fix/extend Big5
basically failed (CMEX's Big-5+, Big-5E...)
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
First steps beyond Big-5First steps beyond Big-5
– 倚天 ETen added some characters (Hirigana, Katagana, 「裏、銹」 , etc. (Some call it Big5-ETen). De facto Big5 standard on GNU/Linux
– Microsoft Code Page 950 includes 「裏、銹」etc., but not all of ETen's extensions
● User-Defined Areas (UDA), Vendor-Defined Areas (VDA), EUDC (End-User Defined Characters), Private User Areas (PUA)
– Different people use EUDC differently... a messy situation
– The demise of CMEX's Big-5+ standard
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Unicode / ISO 10646Unicode / ISO 10646
● Unicode Consortium (Industry)● ISO/IEC 10646 (Academic/Int'l Standard)● The two join in their efforts to produce
Unicode / UCS– Universal Multiple-Octet Coded Character Set– ISO: Design, adding characters to repertoire– Unicode Consortium: Technical
implementation● Code range: U+0000 to U+10FFFF
– 1,114,112 possible code points
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Unicode / ISO 10646Unicode / ISO 10646
● Think “integers”: UCS2, UCS4● Think “strings”
– UTF-7– UTF-8
● Variable width, 1 to 4 bytes (up to – UTF-16
● Fixed width 16-bit, with surrogates (U+D800-U+DFFF, high and low doubles up), up to U+10FFFF
– UTF-32● Fixed width 32-bit, up to U+7FFFFFFF
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Unicode / ISO 10646Unicode / ISO 10646
● ISO 10646-1:1993● ISO 10646-1:2000● ISO 10646-2:2001● Unicode 3.2 just came out● More world languages are being
researched and added, a truly worldwide effort.
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
香港增補字符集香港增補字符集 -2001-2001HKSCS-2001HKSCS-2001
– A brief history● GCCS ( 政府通用字庫 Government Common
Character Set), 1995● HKSCS-1999
– Official encoding name: BIG5-HKSCS (IANA Registry)● HKSCS-2001
– Actively promoted by ITSD– ITSD (HKSARG) wishes HKSCS-2001 to be
implemented on GNU/Linux too, and actively assists the community by providing guidance and advice
– Excellent official website, open standard(starts from http://www.digital21.gov.hk/eng/hkscs/
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
香港中文字範例香港中文字範例Sample HKSCS Chinese TextSample HKSCS Chinese Text● 大家好!你同我一齊玩!● 李、仔、魚涌、深水● 大廈 /有啊!● ( ……仲好似有五個粗口字 ) Hehe...
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GB 18030-2000GB 18030-2000
● GB 18030-2000 Standard● Rationale for a new standard: The 70207+ unified
Han ideographs in Unicode 3.1 won't all fit in the 2-byte codespace of the GBK specification
– ─ ─全名為《信息技術 信息交換用漢字編碼字符集 基本 集的擴充》 (2000-03-17, 2000-11-30)
– Further extends GBK to add 4-byte codespace● More than enough to cover U+0000 to U+10FFFF● Compatible with all future versions of ISO 10646● Backward compatible with GB2312 and GBK
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GB 18030-2000GB 18030-2000
● Why is GB18030 significant?– It solves a pressing issue in China. Finally,
all people's names, geographic names, and ancient text can be properly processed
– It is mandatory: all operating systems sold after 2001-08-31 must support GB18030
– Products must pass GB18030 certification to ensure proper input, editing, screen display, and printing of GB18030 text
– Thiz Linux Desktop was awarded A+ Grade in GB18030 Certification Test!
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GB 18030-2000GB 18030-2000
● 1-byte = ISO 646-IRV (US-ASCII)– {0x00-0x7F}
● 2-byte =~ GBK– {0x81-0xFE}{0x40-0x7E}
● 4-byte● Mapped linearly with Unicode while skipping all
existing mappings● Can be calculated algorithmically● {0x81-0xFE}{0x30-0x39}{0x81-0xFE}{0x30-
0x39)
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GB 18030-2000GB 18030-2000
● Official information hard to find– Hard to obtain the printed version of the
GB18030 standard outside China● Fortunately, many early implementers
and charsets experts have provided info:– Dirk Meyer (Adobe) translated the summary– Markus Scherer (IBM, Unicode Consortium)
provides gb-18030-2000.xml conv. table– Many efforts and interests from others,
including ThizLinux Laboratory
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
UnicodeData.txt, UnicodeData.txt, Unihan.txtUnihan.txt
● UnicodeData.txt– Important information on the character
repertoires and control codes in Unicode● Unihan.txt
– Valuable information (attributes) of over 70,000 CJK Unified ideographs
● Source● Pronunciations in CJKV (+ Cantonese and
Mandarin)● Meaning
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
實施 實施 HKSCS HKSCS 和 和 GB18030 GB18030 的難的難處處
● HKSCS-2001● CJK Extension B etc. (U+20000 – U+2FFFF), but
not all programs support beyond U+FFFF yet● Lack of fonts
● GB18030● Huge! 4-byte ● Certification● Fonts available, expensive (TrueType or bitmap)
– Both are Unicode solutions, so as Unicode support improves, so will HKSCS and GB18030
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
其他中文編碼標準其他中文編碼標準
● CCCII (Chinese Character Codes for Information Exchange)– http://public.ptl.edu.tw/publish/suyan/42/
text_07.htm● CNS 11643● Big-5+, Big-5E● 使用倉頡進行編碼● And many more
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GNU/Linux GNU/Linux 及 及 *BSD *BSD 中文化團中文化團隊隊
● CLE (Chinese GNU/Linux Extension)– A group of pioneering volunteers originally
led by Platin (小虫 )● Debian 中文計劃● FreeBSD 中文化小組● 中、港、台三地的翻譯團隊● Many more CJKV teams and i18n/L10n
worldwide, including Chinese and non-Chinese!
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
各大中文 各大中文 GNU/Linux GNU/Linux 發行版本發行版本Major Chinese GNU/Linux Major Chinese GNU/Linux
DistributionsDistributions● 各大中文 GNU/Linux 發行版本
– 即時 Linux 桌面環境 6.0 (Thiz Linux Desktop 6.0)
– Turbolinux 7.0 中文版– 中文 2000 (Chinese 2000)– 沖浪 (Xteam) 、 紅旗 (Red Flag) 、中軟
(COSIX) 、幸福 (Happy) 、百資 (Linpus) 、網虎(XLinux)
● 國外著名而有中文化的 GNU/Linux 發行版本– Debian GNU/Linux, Red Hat Linux, Linux
Mandrake, (SuSE, Slackware), FreeBSD
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GNU C Library (GLIBC)GNU C Library (GLIBC)
● Libc5● Glibc 2.1● Glibc 2.2● Conversion tables
– Big5 (CLE), GBK (Justin Yu, Sean Chen)– big5hkscs.c (Roger So, Ulrich Drepper,
ThizLinux, James Su)– GB18030 (Wu Jian, Ulrich, ThizLinux, James
Su, another version by Yu Shao)
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
XFree86 / X XFree86 / X 視窗系統視窗系統X Window SystemX Window System
● XFLD, fontset● Xrender / Xft (Keith Packard)● X-TT, “freetype” module● Addition of Big5-HKSCS encodings
(Roger So)● Addition of GB18030 encoding
(James Su et al.)
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
GTK+ and GNOMEGTK+ and GNOME
● GNOME 1.x– Charset handling Based on Glibc and
Xfree86– Good, but not perfect
● GNOME 2.0 (in development)– Pango– Xft
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Qt 3.0.4 and KDE 3.0.1Qt 3.0.4 and KDE 3.0.1
● Qt comes with its own “codecs” in order to be a multiplatform toolkit.– Somewhat tedious... the tables already
created for Glibc must be re-created for Qt● except we cannot directly use Glibc's code
because of licensing issues... No big deal, just extra efforts.
– Good Unicode support; handles everything with Unicode internally.
– Currently only supports UCS2, challenges for HKSCS-2001
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
中文輸入平台中文輸入平台Chinese Input Method Chinese Input Method
ServersServers● XCIN● Chinput
– miniChinput– magicChinput
● 楊春白雪● MyIM
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
中文輸入法中文輸入法
● 倉頡● 行列 30● 大易● 五筆字型● 智能 ABC、智能拼音● 混合● Many others
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
中文字型中文字型Chinese fontsChinese fonts
● 文鼎– AR PL Mingti2L Big5– AR PL SungtiL GB– AR PL KaitiM Big5– AR PL KaitiM GB
● 華康● 方正● 王漢忠十套 GNU GPL 中文字型
– ……可惜格式不太合用
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Web BrowsersWeb Browsers
● Netscape 4.79● Mozilla 0.9.9
– Dillo, Galeon, etc.● Konqueror
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
CJK LaTeX and FreeTypeCJK LaTeX and FreeType
● CJK LaTeX Written by Werner Lemberg from Germany– Yes, Werner can speak Chinese too!
Amazing!● FreeType 1.3.1 and FreeType 2.0.9:
– TrueType (and Type1, BDF etc.) font library
– Main authors: David Turner, Robert Wilhelm, Werner Lemberg
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
PostScript PostScript 與 與 PDFPDF
● Ghostscript + CJK (GS-CJK)● Adobe's CMaps (HKscs, GBK2K, etc.)● Acrobat Reader 4.05 for Linux does not
come with CMaps (HKscs and GBK2K) that are already in Acrobat Reader 5.0
● Ghostscript and XPDF are constantly improving
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
Office SuitesOffice Suites
– OpenOffice.org family (Thiz Office, Kai Office, Red Office)
● Chinese support improving, a joint effort● Excellent i18n/L10n support for all languages
– HancomOffice● Will be based on Qt 3● qbig5hkscscodec.cpp for Qt2 provided by
ThizLinux Laboratory; Hancom ported the code for Qt3
– Lightweight: AbiWord and Gnumeric● Quite good too!
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
如何參與 如何參與 GNU/Linux GNU/Linux 中文化中文化How to participate in How to participate in
i18n effortsi18n efforts● Improve existing infrastructure● Work on new areas● Help with localization and translation
efforts● Join a project that you like, whether it is
Chinese i18n/L10n related or not● Help spread the word! :-)
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
PO PO 翻譯翻譯
● GNOME 2.0● KDE 3.0● GNU Utilities● Gettext 工具● PO / MO 格式● 用法、編碼 (Usage, encoding issues)● 寧可不譯,不可誤譯● 「非化名的字型」 (平滑字型、反鋸齒字型 )
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
參考網站參考網站
– http://cle.linux.org.tw/– http://xcin.linux.org.tw/– http://www.debian.org.hk/intl/zh/– http://linuxfab.cx/– http://www.linuxforum.net/– http://www.unicode.org/– 朱邦復先生工作室 http://www.cflabs.com/– http://www.google.com/
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
待辦工作 待辦工作 / TODO/ TODO
● Some programs still need to be revised in order to conform to i18n/L10n infrastructure
● Always room for improvement in terms of ease of use, completeness, and stability
● More people's participations are welcome
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
未來發展未來發展Future Developments and Future Developments and
OpportunitiesOpportunities● 手寫板 Handwriting Pad● 語音識別 Voice Recognition● More smart Cantonese input methods?● IIIMF to replace XIM?● OpenType to replace TrueType?● More interesting Chinese language
researches based on GNU/Linux systems?
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
● All skills are useful, even if you are not in CS, CE or EE!
● Mathematics, Physics theory● C, C++, Perl, Python, GTK, Qt
– IPA, Jyutping, Japanese, Korean...● e.g. XCIN 作者是讀 Physics...● 語言學 Linguistics, 語音學 Phonetics
● What we can learn during the process– Skills development, learning English,
learning other new languages, meeting friends, and many more!
Comments and SuggestionsComments and Suggestions
Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002
歡迎任何問題!Questions? :-)