1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ®...

39
1 21st International Unicode Conference Dublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International Standards Strategy Microsoft Corporation JTC1/SC2/WG2 Convener Screenplay by Asmus Freytag

Transcript of 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ®...

Page 1: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

11 21st International Unicode Conference Dublin, Ireland, May 2002

ISO/IEC 10646 &The Unicode® Standard

Mike KsarSenior Program Manager

International Standards StrategyMicrosoft Corporation

JTC1/SC2/WG2 Convener

Screenplay by Asmus Freytag

Page 2: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

2 20th International Unicode Conference Washington, DC, January 2002

Background Relation between Unicode and

ISO/10646 What is the same What is different What is being merged

Synchronization Shared Process and Policies Aligned Program of Work Common publication resources

Beyond character coding Character properties & Collation Internationalization Products and Standards

Summary

Outline

Page 3: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

3 20th International Unicode Conference Washington, DC, January 2002

The Internet The internet pushes the envelope

on internationalization Users have easy access to documents

worldwide, in any character set Servers can be accessed by users from

anywhere, speaking any language Software can no longer be targeted to

a single national market The need for a single character set standard

was never greater. Why do we have two?

Page 4: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

4 20th International Unicode Conference Washington, DC, January 2002

Common Charter

Develop a standard of graphic character repertoire and coding for an international graphic character set ... of the written form of the languages of the world.

Page 5: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

5 20th International Unicode Conference Washington, DC, January 2002

Organizations

… and other National Bodies

SC 2: Codes and Character Sets

SC 22: Programming Languages..

WG 20: Internationalization

WG 2: ISO/IEC 10646

IRG: Ideographic Rapporteur Group

ISO/IEC

JTC 1: Information Technology INCITS: Information Technology

L2: Codes, Character Sets, and Internationalization

ANSI (US)

NB

UTC: Unicode Technical

Committee Bidi and other subcommittees

The Unicode Consortium

Mem

ber

Liaison

Page 6: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

6 20th International Unicode Conference Washington, DC, January 2002

ISO Framework

Basis for other standards: ISO, JTC1, ECMA, IETF, CEN/TC304 & W3C

Well established and recognized ISO development process of standardization

Worldwide expertise through national standards bodies, industry and liaison organizations

Identified as one of the standards for procurement requirements by major organizations and agencies

Page 7: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

7 20th International Unicode Conference Washington, DC, January 2002

Unicode Framework

Consortium with open membership Industry backing Direct support from key implementers Open to academic and user input Cooperation with ISO, JTC1, ECMA,

IETF, CEN/TC304 & W3C Unicode Technical Committee (UTC)

Page 8: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

9 20th International Unicode Conference Washington, DC, January 2002

10646

&

Unicode

10646

&

Unicode

1 Universal Code

. . .ASCII ISO 646

Part-2Part-1 Part-.... . .

ISO 8859-x

WindowsIBM Other. . .

Industry 8-bit Codes

National/Industry Multibyte Codes

Development Path

Page 9: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

10 20th International Unicode Conference Washington, DC, January 2002

Sources of Characters International standards

JTC1/SC2 coded character sets JTC1/SC18 text formatting and presentation ISO TC46 bibliographic community

National standards and committees China (GB2312), Japan (JIS 208),

Korea (KSC 5601) and many others Widely supported vendor character sets Regional standards committees

ASMO, ECMA ATG & Bidi & SC2/WG2/IRG Liaison organizations:

Unicode, inc., ECMA, ITU-TS, AFII, TCA, W3C, CEN/TC304 and others

User communities STIX

Page 10: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

11 20th International Unicode Conference Washington, DC, January 2002

ISO/IEC 10646

Milestones

1984: ISO starts developing 1991: Convergence with Unicode 1993: ISO/IEC 10646-part 1, First edition

Architecture & Basic Multilingual Plane Equivalent to Unicode 1.1

1998: ISO/IEC TR 15285An operational model for characters and glyphs

1995 – 1999: Technical amendments UTF-8, UTF-16, Korean, Tibetan, Braille, etc. Unicode 2.0 is equivalent through amendment 7

2000: ISO/IEC 10646-1, Second edition 3 technical corrigenda 31 amendments since 10646-1: 1993 first edition Equivalent to Unicode 3.0

2001: ISO/IEC 10646-2 for Planes 1, 2 & 14 Unicode 3.1 includes repertoires of both 10646-1

and 10646-2 plus two additional characters 2002: Amd-2 to part 1

Equivalent to Unicode 3.2

Page 11: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

12 20th International Unicode Conference Washington, DC, January 2002

Unicode 14 Years(1988-2002)

1988: First use of name Unicode 1991: Unicode Consortium founded 1991: Unicode, Version 1 1991: First Implementers' Workshop 1991: Convergence with ISO/IEC 10646 Liaison to ISO/IEC 10646 Working Group 1992: First Unicode Technical Reports 1993: Unicode, Version 1.1 1996: Version 2.0 published 2000: Version 3.0 published Dramatic increase in number and scope of

Unicode-based implementations 2001: Version 3.1 published 2002: Version 3.2 2002: 20th International Unicode Conference

Page 12: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

13 20th International Unicode Conference Washington, DC, January 2002

Background Relation between Unicode and

ISO/10646 What is the same What is different What is being merged

Outline

Page 13: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

14 20th International Unicode Conference Washington, DC, January 2002

Code Space & Structure

Plane 16Private

UsePlane 15Private

Use

Plane 14

Plane 02

Plane 01

Plane 00BMP

. . .

. . .. ..

Planes

ISO/IEC 10646 Parts 1 and 2• Only use code space in planes 0 to 16• Define characters only in planes0 (BMP), 1, 2 & 14 so far

• Reserve planes 15, 16for private use

Page 14: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

15 20th International Unicode Conference Washington, DC, January 2002

A Plane in 10646

Plane (16-bits)

Row

Cell

A plane is the basic division of code-space in ISO/IEC 10646

The first plane (Plane 0) is the Basic Multi-lingual Plane (BMP)

Unicode 3.1 matches planes 0-16

65,536 characters

Page 15: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

16 20th International Unicode Conference Washington, DC, January 2002

Basic Multilingual Plane

Reserved for accessing code points outside BMP(2048)

Alphabets, Symbols, CJK Auxiliary, Hangul, . . .

Unified Chinese, Japanese, Korean Ideographs

C1 ControlsC0 Controls

Private Use (6K), Compatibility Area, Arabic Presentation Forms, . . .(8190)

Page 16: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

18 20th International Unicode Conference Washington, DC, January 2002

Adopted Form ISO/IEC 10646 is a 16-bit or 32-bit code

UCS-2: for accessing code points in BMP, 2-bytes (16-bits) UCS-4: canonical form for accessing any code point using

4-bytes (32-bits) Transformation formats

UTF-8: for use in 8-bit environments (e.g. HTML, XML) (variable length code, 1 to 6 bytes/character)

UTF-16: for use with UCS-2 to access sixteen additional planes beyond the BMP

Note: Unicode 3.2 supports UTF-8, UTF-16 and UTF-32. UTF-32 is equivalent to UCS-4, with an upper limit of

10FFFFx.

Page 17: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

19 20th International Unicode Conference Washington, DC, January 2002

Implementation Levels Implementation level for combining

sequences Level 1: only precomposed characters Level 2: restricted combining sequences Level 3: unrestricted combining sequences

Unicode has no formal restrictions on combining sequences An implementation may choose to support a subset

of characters which does not contain any or all combining characters

Page 18: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

20 20th International Unicode Conference Washington, DC, January 2002

Collections

for Subsets

The Unicode declared subset is the whole of the BMP plus planes 1-16 accessible through UTF-16

Collections of coded graphic characters

The collections listed below are ordered by collection number. An * in the “positions” column indicates that the collection is a fixed collection.

Collection number and name Positions

1 BASIC LATIN 0020 - 007E *

2 LATIN-1 SUPPLEMENT 00A0 - 00FF *

3 LATIN EXTENDED-A 0100 - 017F *

4 LATIN EXTENDED-B 0180 - 024F

5 IPA EXTENSIONS 0250 - 02AF

6 SPACING MODIFIER LETTERS 02B0 - 02FF

Etc.

Page 19: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

21 20th International Unicode Conference Washington, DC, January 2002

Unicode Implements BMP plus next 16 planes Three encoding forms

UTF-8 UTF-16 UTF-32 (0 to 10FFFF)

Implementation level 3 No subsets

Unicode encourages transparency so that implementations can at least retransmit every character undamaged, but the level of support is otherwise explicitly left to the implementation

Page 20: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

22 20th International Unicode Conference Washington, DC, January 2002

Unicode - 10646 Relationship ISO/IEC 10646 is a character encoding standard Unicode is code for code compatible with

ISO/IEC 10646 Unicode defines additional specifications about

behavior and use of characters such as bidi algorithm, ordering, mappings, equivalence algorithm and other semantics

Conformant implementations of Unicode are conformant implementations of ISO/IEC 10646

Page 21: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

23 20th International Unicode Conference Washington, DC, January 2002

Unicode: Beyond 10646In addition to character codes Unicode specifies: Behavior and use of characters A complete bidi algorithm An equivalence algorithm Normalization Additional character properties and semantics

for spacing, zero-width space, combining characters, numeric, case and casing, directionality, letters, math operators etc

Page 22: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

24 20th International Unicode Conference Washington, DC, January 2002

Unicode: Beyond 10646 (Cont.)

Which combining marks are non-spacing marks Order and use of double-diacritic non-spacing

marks A mapping for compatibility characters Default shaping behavior of cursive scripts Default mapping tables for conversion to and

from other character set standards Rendering for Indic characters Line breaking

Page 23: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

25 20th International Unicode Conference Washington, DC, January 2002

Background Relation between Unicode and

ISO/10646 What is the same What is different What is being merged

Synchronization Shared Process and Policies Aligned Program of Work Common publication resources

Outline

Page 24: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

26 20th International Unicode Conference Washington, DC, January 2002

Continued Cooperation Architecture changes:

UTF-32 (Proposed Amendment) Restricts UCS-4 to planes 0 to 16

Future editorial and technical corrigenda to second edition |of ISO/IEC 10646-1: 2000 (will be part of Unicode 3.2)

Repertoire extensions (included in Unicode 3.2) ISO/IEC 10646-2 (planes 1, 2 & 14)

Plane 1, mathematics, hieroglyphs, music symbols, etc Plane 2, CJKV ideographic extensions Plane 14, language tags

Support current and future implementers Increase awareness and provide technical help Continued synchronization of future editions of

ISO/IEC 10646 and the Unicode Standard

Page 25: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

27 20th International Unicode Conference Washington, DC, January 2002

Going in the Same Direction

One standard No dialects Common usage

Common Encoding Forms UTF-8 UTF-16 UTF-32/UCS-4

Cooperation with ISO Examples: UTF-8, UTF-16, UTF-32,

EURO, collation, tags

Incorporation into other standards IETF WWW Consortium (W3C)

Shared expertise for lesser-used and obscure scripts

Page 26: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

28 20th International Unicode Conference Washington, DC, January 2002

WG2 Program of Work

1st Amendment 10646-1:2000 March 2002 2nd Amendment 10646-1: 2000 December 2002 1st Amendment 10646-2: 2001 2003

WG2 future meetings: Meeting 42 – Dublin, Ireland May 2002 Meeting 43 – Tokyo, Japan December 2002

Page 27: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

29 20th International Unicode Conference Washington, DC, January 2002

Background Relation between Unicode and

ISO/10646 What is the same What is different What is being merged

Synchronization Shared Process and Policies Aligned Program of Work Common publication resources

Beyond character coding Character properties & Collation Internationalization Products and Standards

Outline

Page 28: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

30 20th International Unicode Conference Washington, DC, January 2002

Collation & Character Properties

ISO/IEC 14651 Collation Standard Produced by SC22/WG20 Internationalization Matches Unicode Collation Algorithm Unicode Technical Standard (UTS) #10

Unicode Character Database Collection of character classification and

properties Geared towards the needs of implementers Supports Internationalization http://www.unicode.org/Public/UNIDATA

Page 29: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

31 20th International Unicode Conference Washington, DC, January 2002

Language Innovation

SOURCE: C / C++ JAVA / C# Identifiers ASCII Unicode

Comments Local charset byte oriented

Unicode

Literals L“ ” converts local charset

Unicode

Data Types:

char Byte oriented Unicode

wchar_t Unicode on some implementations

N/A

Page 30: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

32 20th International Unicode Conference Washington, DC, January 2002

Products Are Here!

19941995 1996 1997

93

Types of Products

Increased Function of Products

1998- 1999

2000 and beyond

Full Set

Phase 2: Increased FunctionalityMore Scripts, Combining Characters, etc.

Phase 1: Deliver a full set of productsBrowsers, Development Tools, Fonts, Word Processors, etc.

Page 31: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

33 20th International Unicode Conference Washington, DC, January 2002

Products Are HERE! Microsoft: Windows XP, Office XP, Internet Explorer 6.0,

ECMAScript, C#/CLI Compaq: Tru64 Unix HP: HP-UX & Printers Netscape: communicator 6.0, JavaScript, ECMAScript Sun: Solaris & Java Apple: Cyberdog, Mac OS X Lotus: Lotus Suite Asian solutions: JustSystems (Ichitaro) and Star+Globe

(MASS) Databases: Software AG, Sybase, Oracle, DB2, NCR

Teradata, Progress Software SAP platform Fonts: Adobe, Agfa/Monotype, Apple Advanced

Typography, Bitstream, OpenType Tools and libraries: several vendors

Page 32: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

34 20th International Unicode Conference Washington, DC, January 2002

Version 3.2

Is Here!

Version 3.2 is in sync withboth parts of ISO/IEC 10646 and 1st amendment to 10646-1 total repertoire of 95156

characters completed math repertoire for

MathML and other uses Further restriction on

ill-formed UTF-8

http://www.unicode.org/unicode/reports/tr28

Page 33: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

35 20th International Unicode Conference Washington, DC, January 2002

Background Relation between Unicode and

ISO/10646 What is the same What is different What is being merged

Synchronization Shared Process and Policies Aligned Program of Work Common publication resources

Beyond character coding Character properties & Collation Internationalization Products and Standards

Summary

Outline

Page 34: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

36 20th International Unicode Conference Washington, DC, January 2002

Common Repertoire The character repertoire of Unicode and

ISO/IEC 10646 are exactly identical Three matching encoding forms

There are minor differences in Terminology Publication format

Any conformant Unicode implementation conforms to ISO/IEC 10646

Page 35: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

37 20th International Unicode Conference Washington, DC, January 2002

Unicode Extends... Character semantics

“Discover and catalogue” Canonical and compatibility equivalence

Relate characters to their established use Technical reports with implementation

guidelines Normalization Script behavior such as bi-directional algorithm

Active promotion of the standard

Page 36: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

38 20th International Unicode Conference Washington, DC, January 2002

What Do 10646 and Unicode Do for You?

Global interoperability - write once run everywhere; One source code one binary with user installable/callable locales

Simplified software - one application with one code set versus multiple applications and managing different code sets

Data stability - A single common and widely adopted format

Reduced costs - development, maintenance, training

Page 37: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

39 20th International Unicode Conference Washington, DC, January 2002

Great Expectations

Enhance global interoperability Enhance data interchange Permit easier development of

localizable products Reduce development cost of

localized application software Replace retrofitting with

concurrent development

Page 38: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

40 20th International Unicode Conference Washington, DC, January 2002

Recommendations Buy the international standard

(including all published amendments) as well as the Unicode standard Watch for updates on the web including Unicode

technical reports and ISO amendments Join the Unicode consortium, W3C, your national

body standards committee or other organization to influence standards development processes

Define your needs and communicate them to your vendors

Build products that support ISO/IEC 10646 and The Unicode Standard

Page 39: 1 21st International Unicode ConferenceDublin, Ireland, May 2002 ISO/IEC 10646 & The Unicode ® Standard Mike Ksar Senior Program Manager International.

4545 21st International Unicode Conference Dublin, Ireland, May 2002

Thank You!