Unicode: A Grand Tour Character Encodings & Unicode.

81
Unicode: A Grand Tour Character Encodings & Unicode

Transcript of Unicode: A Grand Tour Character Encodings & Unicode.

  • Slide 1

Unicode: A Grand Tour Character Encodings & Unicode Slide 2 This presentation and its associated materials licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 License. You may use these materials without obtaining permission from the author. Any materials used or redistributed must contain this notice. [Derivative works may be permitted with permission of the author.] This work is copyright 2008 Addison P. Phillips Slide 3 Addison Phillips Globalization Architect, Lab126 This Presentation Internationalization and Unicode Conference Tutorial Slide 4 Globalization Architect, Lab126 (Yes, you can touch my Kindle) Chair, W3C Internationalization WG Editor, IETF LTRU-WG (BCP 47) Unicode Slide 5 the design and development of a product that is enabled for target audiences that vary in culture, region, or language. [W3C] a fundamental architectural approach to software development Slide 6 Opinions differ on capitalization (C12N); choose from: i18N I18n I18N Very geeky; not very internationalized ( I19G ?) I N T E R N A T I O N A L I Z A T I O N I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 N I18N Localization=L10N Globalization=G11N Canonicalization=C14N Slide 7 The basics of text processing in software. Slide 8 Character encodings consume more than 80% of my work day. They are the source of more mis- information and confusion than any other single thing. And developers arent getting any better educated. ~Glen Perkins Globalization Architect Slide 9 Real Jargon Multibyte Variable width Wide character Character encoding Coded character set Bidi or bidirectional Glyph, character, code unit Unicode Potentially Bogus Jargon kanji double-byte language extended ASCII ANSI, OEM encoding agnostic Slide 10 bits : 010000010101101101101000 byte or octet : 01000001 (0x41) code unit: a unit of physical storage and information interchange represent numbersrepresent numbers come in various sizes (e.g. 7, 8, 16, 32, 64 bits)come in various sizes (e.g. 7, 8, 16, 32, 64 bits) how do we map text to the numbers used by computers? Slide 11 Glyphs A glyph is screen unit of text: it s a picture of what users think of as a character. A grapheme is a single visual unit of text. Characters A character is a single logical unit of text. A character set is a set of characters. A code point is a number assigned to a character in a character set. A coded character set is a character set where each character has a code point. Bytes A character encoding maps a sequence of code points ( characters ) to a sequence of code units (such as bytes). A code unit is a single logical unit of storage. 0xC3 0x80 U+00C0 Slide 12 Collection ( repertoire ) of characters, that is: a set. Organized so that each character has a unique numeric (typically integer) value ( code point ). Examples: Unicode ASCII (ANSI X3.4) ISO 646 JIS X 208 Latin-1 (ISO 8859-1) Slide 13 Maps a sequence of code points (characters) to a sequence of code units (e.g. bytes). Some encodings use another unit instead of the byte. For example, some encodings use a 16-bit, 32-bit, or 64- bit code unit. U+00C00xC3 0x80 Slide 14 All texthas a character encoding All text has a character encoding When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding. In memory, on disk, on the network, etc. Slide 15 Tofu hollow boxes Mojibake garbage characters Question Marks (conversion not supported) Slide 16 Slide 17 Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example) Not usually a bug: its a display problem Can mask or masquerade as character corruption. Slide 18 When Good Characters Go Bad Slide 19 View text using the wrong encoding Apply a transfer encoding and forget to remove it Convert to an encoding twice Convert to or from the wrong encoding Overzealous escaping Conversion to entities ( entitization ) Multiple conversions Slide 20 Slide 21 Slide 22 7 bits = 2 7 = 128 characters Enough for U.S. English Slide 23 ASCII for characters 0x00 through 0x7F Accented letters and other symbols 0x80 through 0xFF Slide 24 charCp1252Cp437Cp850 0xC8?0xD4 Slide 25 Windows s encodings (called code pages ) are generally based on standard encodings plus some additional characters. Example: CP 1252 is based on ISO 8859-1, but includes 27 extra characters in the C1 control range (0x80- 0x9F) Slide 26 Originally an IBM character encoding term. IBM numbered their character sets with CCSIDs (coded character set ids) and numbered the corresponding character encodings as code pages . Microsoft borrowed code pages to create PC-DOS. Microsoft defines two kinds of code pages: ANSI code pages are the ones used by Windows GUI programs. OEM code pages are the ones used by command shell/command line programs. Neither ANSI nor OEM refer to a particular encoding standard or standards body in this context. Avoid the use of ANSI and OEM when referring to encodings. Slide 27 So far we ve been looking at single-byte encodings: one byte per character 1 byte = 1 character (= 1 glyph?) 256 character maximum Good enough for most alphabetic languages Some languages need more characters. What about the double-byte languages? Dont those take two bytes per character? Slide 28 Escape sequences to select another character set Example: ISO 2022 uses escape sequences to select various encodings Use a larger code unit ( wide character encoding) Example: IBM DBCS code pages or Unicode UTF-16 2 16 = 64K characters 2 32 = 4.2 billion characters Use a variable-width encoding Variable width encodings use different numbers of code units to represent different types of characters within the same encoding Slide 29 One or more bytes per character 1 byte != 1 character May use 1, 2, 3, or 4 bytes per character May use shift or escape sequences May encode more than one character set In fact, single-byte encodings are a special case of multibyte! Multibyte Encoding: Any variable-width encoding that uses the byte as its code unit. Slide 30 JIS X 213 11,233 characters (2) 94x94 character planes Slide 31 Specific byte ranges encoding characters that take more than one byte. A lead byte One or more trailing bytes Code point != code unit 1-4-1 (code point) 0x82 0xA0 A 1-3-33 (code point) 0x41 Slide 32 In order to reach more characters, Shift_JIS characters start with a limited range of lead bytes These can be followed by a larger range of byte values (trail byte) Slide 33 Slide 34 Lead bytes can be trail byte values Trail bytes include ASCII values Trail bytes include special values such as 0x5C ( \ ) int pos = strchr(mybuf, @); Slide 35 Stateful Encodings ex. IBM MBCS code pages [SI/SO shift between 1- byte and 2-byte characters] ISO 2022 [escape sequence changes character set being encoded] Slide 36 Slide 37 A transfer encoding syntax is a reversible transform of encoded data which may (or may not) include textual data represented in one or more character encoding schemes. Email headers URIs IDN (domain names) Abc =?UTF-8?B?QWJj44K 944O844K5?= Abc Slide 38 Document formats often require a single character encoding be used for all parts of the document. Process Output (HTML, XML, etc.) Templates ISO 8859-1 Content UTF-8 Data Shift_JIS When data is merged, the encodings must be merged also (or some of the data will be mojibake). Common Encoding Conversion Tools and Libraries iconv (Unix) ICU (C, C++, Java) perl Encode Java (native2ascii, IO/NIO) (etc.) Slide 39 Encoding conversion acts as a filter Replacement characters ( question marks ) replace characters from the source character set that are not present in the target character set. ISO 8859-1 ?????? ????? ???? UTF-8 ?????? ????? ???? ISO 8859-1 UTF-8 Shift_JIS ? (0x3F) is the replacement character for ISO 8859-1 Slide 40 Need for more converters and conversion maps Difficulty of passing, storing, and processing data in multiple encodings Too many character sets leads to what we call code page hell Slide 41 A Slide 42 Fights mojibake because: characters are from the common repertoire; characters are encoded according to one of the encoding forms; characters are interpreted with Unicode semantics; unknown characters are not corrupted Basic Principles Universal repertoire Logical order Efficiency Unification Characters, not glyphs Dynamic composition Semantics Stability Plain Text Convertibility Slide 43 Unicode is a character set that supports all of the worlds languages and writing systems. Code space of up to 0x10FFFF characters (about 1.1 million) Unicode and ISO 10646 are maintained in sync. Unicode is maintained by an industry consortium. ISO 10646 is maintained by the ISO. Slide 44 Divide Unicode in equal sized regions of code points. 17 planes (0 through 0x10), each with 65,535 characters. Plane 0 is called the Basic Multilingual Plane (BMP). > 99% of text in the wild lives in the BMP Planes 1 through 0x10 are called supplementary planes. Slide 45 An organized collection of characters. Each character has a code point aka Unicode Scalar Value (USV) U+0041 7-bit ASCII is itself All other characters take 2, 3, or 4 bytes each lead bytes have a special pattern trailing bytes range from 0x80 -> 0xBF 0xxxxxxx 0xxxxxxx 110xxxxx 10xxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Lead Bytes Trail Bytes < 0x80 < 0x800 < 0x10000 Supplementary Code Points Slide 62 ASCII-compatible Default or recommended encoding for many Internet standards Bit pattern highly detectable (over longer runs) Non-endian Streaming C char* friendly Easy to navigate Multibyte encoding requires additional processing awareness Non-shortest form checking needed Less efficient than UTF-16 for large runs of Asian text Slide 63