cldr_overview

27
28th Internationalization and Unicode Conference 28th Internationalization and Unicode Conference CLDR 1.3: CLDR 1.3: Overview and What’s Overview and What’s New New George Rhoten (IBM) George Rhoten (IBM) Mark Davis (IBM) Mark Davis (IBM) Steven Loomis (IBM) Steven Loomis (IBM)

Transcript of cldr_overview

Page 1: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference

CLDR 1.3:CLDR 1.3:Overview and What’s NewOverview and What’s New

George Rhoten (IBM)George Rhoten (IBM)Mark Davis (IBM)Mark Davis (IBM)

Steven Loomis (IBM)Steven Loomis (IBM)

Page 2: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20052

AgendaAgenda

Background InformationBackground Information

What does CLDR contain?What does CLDR contain?

Samples of CLDRSamples of CLDR

What is new?What is new?

Future plansFuture plans

How does CLDR get updated?How does CLDR get updated?

Page 3: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20053

Common Locale Data RepositoryCommon Locale Data Repository

Relatively new project: 2004Relatively new project: 2004

Hosted by Unicode ConsortiumHosted by Unicode Consortium• http://www.unicode.org/cldr/http://www.unicode.org/cldr/

Goals:Goals:• Common, necessary software locale data for all Common, necessary software locale data for all

world languagesworld languages

• Collect and maintain locale dataCollect and maintain locale data

• XML format for effective interchangeXML format for effective interchange

• Freely availableFreely available

Page 4: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20054

Universal Character EncodingUniversal Character Encoding

Unicode: Unique character codes for Unicode: Unique character codes for all languagesall languages

Page 5: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20055

Direct and Indirect UsageDirect and Indirect Usage Companies / OrganizationsCompanies / Organizations

• Adobe, Apple (Mac OS X), abas Software, Ascential Software, Adobe, Apple (Mac OS X), abas Software, Ascential Software, Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Business Objects, caris, CERN, ClearCommerce, Cognos, Business Objects, caris, CERN, ClearCommerce, Cognos, Debian Linux, D programming language, Gentoo Linux, GNU Debian Linux, D programming language, Gentoo Linux, GNU Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Mathworks, OpenOffice, Language Analysis Systems, Lawson Mathworks, OpenOffice, Language Analysis Systems, Lawson Software, Leica Geosystems GIS & Mapping LLC, Mandrake Software, Leica Geosystems GIS & Mapping LLC, Mandrake Linux, Novell (SuSE), Optio Software, PayPal, Progress Linux, Novell (SuSE), Optio Software, PayPal, Progress Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, WMS Gaming, Xerox, Yahoo!, and many more…WMS Gaming, Xerox, Yahoo!, and many more…

CaveatsCaveats• Not a complete listNot a complete list: usage is not tracked, so this is an : usage is not tracked, so this is an

estimateestimate• CLDR first available in 2004, some may use precursor dataCLDR first available in 2004, some may use precursor data

Page 6: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20056

What is Locale Data?What is Locale Data? Locale = identifier referring to linguistic and cultural Locale = identifier referring to linguistic and cultural

preferencespreferences• en_US, en_GB, ja_JPen_US, en_GB, ja_JP

Locale doesn’t refer to data like in POSIXLocale doesn’t refer to data like in POSIX These preferences can change over time due to cultural These preferences can change over time due to cultural

and political reasonsand political reasons• Introduction of new currencies, like the EuroIntroduction of new currencies, like the Euro• Standard sorting of Spanish changesStandard sorting of Spanish changes

Many of these preferences have varying degrees of Many of these preferences have varying degrees of standardizationstandardization• 12 and 24 hour format in the United States12 and 24 hour format in the United States

This is a very broad topicThis is a very broad topic Scope of data limited to common system applicationsScope of data limited to common system applications

Page 7: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20057

Types of Locale DataTypes of Locale Data• Dates/time formatsDates/time formats

• Number/Currency formatsNumber/Currency formats

• Measurement SystemMeasurement System

• Collation SpecificationCollation Specification SortingSorting SearchingSearching MatchingMatching

• Translated names for language, territory, Translated names for language, territory, script, timezones, currencies,…script, timezones, currencies,…

• Script and characters used by a languageScript and characters used by a language

Page 8: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20058

Sample: Languages, Scripts, Sample: Languages, Scripts, Territories in DanishTerritories in Danish

This data can be used for web site preferencesThis data can be used for web site preferences

<localeDisplayNames><localeDisplayNames>

<languages><languages>

<language type="aa"><language type="aa">AfarAfar</language></language>

<language type="ab"><language type="ab">AbkhasiskAbkhasisk</language>…</language>…

<scripts><scripts>

<script type="Arab"><script type="Arab">ArabiskArabisk</script>…</script>…

<territories><territories>

<territory type="AD"><territory type="AD">AndorraAndorra</territory></territory>

<territory type="AE"><territory type="AE">Forenede Arabiske EmiraterForenede Arabiske Emirater

</territory>…</territory>…

Page 9: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20059

Sample: Characters / DatesSample: Characters / Dates

<characters><characters>

<exemplarCharacters><exemplarCharacters>[a-z æ å ø á é í ó ú ý][a-z æ å ø á é í ó ú ý]</exemplarCharacters></exemplarCharacters>

</characters>…</characters>…

<dayContext type="format"><dayContext type="format">

<dayWidth type="abbreviated"><dayWidth type="abbreviated">

<day type="sun"><day type="sun">sønsøn</day></day>

<day type="mon"><day type="mon">manman</day>…</day>…

Page 10: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200510

Sample: Timezones / CurrenciesSample: Timezones / Currencies

<timeZoneNames><timeZoneNames>

<zone type="America/Los_Angeles"><zone type="America/Los_Angeles">

<long><long>

<standard><standard>Pacific-normaltidPacific-normaltid</standard></standard>

<daylight><daylight>Pacific-sommertidPacific-sommertid</daylight></daylight>

</long>…</long>…

<currencies><currencies>

<currency type="GAF"><currency type="GAF">

<displayName><displayName>Gabonesisk CFA-francGabonesisk CFA-franc

</displayName></displayName>

<symbol><symbol>GAFGAF</symbol>…</symbol>…

Page 11: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200511

Sample: CollationSample: Collation

<collation type="standard" ><collation type="standard" ><settings caseFirst="upper" /><settings caseFirst="upper" /><rules><rules>

<reset>D</reset><reset>D</reset><s>đ</s><s>đ</s><t>Đ</t><t>Đ</t><s>ð</s><s>ð</s><t>Ð</t><t>Ð</t><reset>t</reset><reset>t</reset>

……

Page 12: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200512

Latest Release: CLDR 1.3Latest Release: CLDR 1.3

Released:Released: June 2, 2005June 2, 2005

296 locales: 96 languages and 130 296 locales: 96 languages and 130 territoriesterritories

DataData• Unique keys:Unique keys: 3,9743,974

• Actual Values:Actual Values: 52,38252,382

• All data fields:All data fields: 898,183 898,183

(not including collation, aliased data)(not including collation, aliased data)

Page 13: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200513

CLDR 1.3CLDR 1.3 Complete POSIX-format data with POSIX conversion toolComplete POSIX-format data with POSIX conversion tool More timezone translationsMore timezone translations Data for UN M.49 regions, including continents and regionsData for UN M.49 regions, including continents and regions Addition of ISO 4217 currency codes change oversAddition of ISO 4217 currency codes change overs Additional number and data tests to verify CLDR Additional number and data tests to verify CLDR

implementationsimplementations Mappings from language to script and territoryMappings from language to script and territory Various other fixes, additions, and extensionsVarious other fixes, additions, and extensions Survey tool for improved collection of data Survey tool for improved collection of data

http://www.unicode.org/cgi-bin/cldr-surveyhttp://www.unicode.org/cgi-bin/cldr-survey(read only to non-members)(read only to non-members)

… … and many other minor improvements and bug fixesand many other minor improvements and bug fixes

Page 14: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200514

Next Release: CLDR 1.4Next Release: CLDR 1.4 2005-05-31 Phase 12005-05-31 Phase 1

• Design Design 

2005-08-31 Phase 22005-08-31 Phase 2• Structure, Tools, DocumentationStructure, Tools, Documentation

2005-09-30 Phase 2 Beta Release2005-09-30 Phase 2 Beta Release 2005-10-31 Phase 32005-10-31 Phase 3

• Data Incorporation & VettingData Incorporation & Vetting

2006-01-31 Phase 3 Beta Release2006-01-31 Phase 3 Beta Release 2006-03-31 CLDR 1.4 Released2006-03-31 CLDR 1.4 Released

Page 15: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200515

Samples of PossibleSamples of PossibleCLDR 1.4 FeaturesCLDR 1.4 Features

DataData• Enhance data for existing localesEnhance data for existing locales• Verify coverage levelVerify coverage level• Measurement unit names (eg metric vs Measurement unit names (eg metric vs

US)?US)?• Add European Ordering rules to some Add European Ordering rules to some

localeslocales• Add data/structure to support lenient Add data/structure to support lenient

parsing, formatting; relative dates, etc.parsing, formatting; relative dates, etc.• Enhance Indic sortingEnhance Indic sorting

Page 16: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200516

Samples of PossibleSamples of PossibleCLDR 1.4 Features (II)CLDR 1.4 Features (II)

StructureStructure• Add structure / data for tracking priority Add structure / data for tracking priority

and completenessand completeness

• Move weekend data & other country Move weekend data & other country data to country infodata to country info

• Improved alias structure to reduce data Improved alias structure to reduce data duplicationduplication

• Add locale specific linebreak, Add locale specific linebreak, transforms, etc.transforms, etc.

Page 17: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200517

Samples of PossibleSamples of PossibleCLDR 1.4 Features (III)CLDR 1.4 Features (III)

Tests & ToolsTests & Tools• Enhanced Survey tool for Enhanced Survey tool for

collecting/vetting datacollecting/vetting data

• Enhanced consistency checking, more Enhanced consistency checking, more complete testscomplete tests

• Improve the Java tool integration, Improve the Java tool integration, documentation, testingdocumentation, testing

Actual feature set has not been Actual feature set has not been determined yet!determined yet!

Page 18: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200518

Committee ProcessCommittee Process Designed for most effective participation Designed for most effective participation

from people around the worldfrom people around the world MeetingsMeetings

• By phone, never face to faceBy phone, never face to face• Short, frequentShort, frequent• Allows preparation between meetingsAllows preparation between meetings• Resolves conflicts and new feature requestsResolves conflicts and new feature requests

WrittenWritten• EmailEmail• Bug database submissionsBug database submissions

Page 19: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200519

Vetting Process for DataVetting Process for Data Collect from different participating organizations, Collect from different participating organizations,

experts and submissions: new or revisedexperts and submissions: new or revised• References to external sources strongly encouragedReferences to external sources strongly encouraged

• Must be given before freeze date for releaseMust be given before freeze date for release

• Use CLDR Survey ToolUse CLDR Survey Tool

Enter into the repositoryEnter into the repository• Mark with draft attributeMark with draft attribute

• Some may be entered as alternatesSome may be entered as alternates

• Differences resolved by CLDR committeeDifferences resolved by CLDR committee

Page 20: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200520

Vetting Process (II)Vetting Process (II)

Vet by CLDR committee membersVet by CLDR committee members• Consulting with country contactsConsulting with country contacts

• If disagreement, decide in committeeIf disagreement, decide in committee

AcceptAccept• As main form: draft attribute removedAs main form: draft attribute removed

• As alternate form: marked with different As alternate form: marked with different attributesattributes

Page 21: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200521

Causes of Conflicting DataCauses of Conflicting Data Typographical errorsTypographical errors

• Canda instead of CanadaCanda instead of Canada

Regional differencesRegional differences• German spelling is different between countriesGerman spelling is different between countries

Context of usageContext of usage• Normal German sorting versus German Normal German sorting versus German

phonebook sortingphonebook sorting

Parts of speechParts of speech• ““март 2004” versus “3 мартмарт 2004” versus “3 мартаа” when the ” when the

Russian word for March is used in a dateRussian word for March is used in a date

Page 22: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200522

Causes of Conflicting Data (II)Causes of Conflicting Data (II) Standards versus common useStandards versus common use

• ““Republic of Laos” versus “Laos”Republic of Laos” versus “Laos”

MisunderstandingMisunderstanding• Translating year format “yyyy” as “jjjj” instead Translating year format “yyyy” as “jjjj” instead

of changing localized pattern charactersof changing localized pattern characters

Uncommon casesUncommon cases• Translating the “Interlingua” language name Translating the “Interlingua” language name

into other languagesinto other languages

Individual preferencesIndividual preferences• 24 hour time format versus 12 hour time 24 hour time format versus 12 hour time

formatformat

Page 23: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200523

ChallengesChallenges

Complex FormatsComplex Formats

Experts knowledgeable both in Experts knowledgeable both in technology and a specific languagetechnology and a specific language• CollationCollation

• Exemplar charactersExemplar characters

• Etc…Etc…

Require close interaction of CLDR Require close interaction of CLDR experts with language expertsexperts with language experts

Page 24: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200524

Getting InvolvedGetting Involved

Simplest – Simplest – anyone!anyone!• Use CLDRUse CLDR

• Bug report / feature requestBug report / feature request

More InvolvedMore Involved• Vetting, Assessment, Tools, Policies, Vetting, Assessment, Tools, Policies,

Decisions, …Decisions, …

• Any Unicode member eligible to name Any Unicode member eligible to name representatives including country liason representatives including country liason membersmembers

Page 25: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200525

Example Country Process (Finland)Example Country Process (Finland)

Finnish Ministry of Education made Finnish Ministry of Education made CLDR data a major goal, 2004-06CLDR data a major goal, 2004-06• Research Institute for the Languages of Research Institute for the Languages of

FinlandFinland (“RILF” aka “Kotus”) designated (“RILF” aka “Kotus”) designated agencyagency

• Documenting the national preferences Documenting the national preferences in the open more important than the in the open more important than the implementation mechanismimplementation mechanism

• Results expected to lead to new/revised Results expected to lead to new/revised national standardsnational standards

Page 26: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200526

Example Country Process (II)Example Country Process (II) RILF a Unicode Liaison member, RILF a Unicode Liaison member, 2004-072004-07

• Set up fully open national group on language and Set up fully open national group on language and cultural requirements on ICT, 2004-09cultural requirements on ICT, 2004-09

• Two official languages (Finnish and Swedish) & four Two official languages (Finnish and Swedish) & four regional / minority languages (three Sámi & Romani regional / minority languages (three Sámi & Romani as spoken in Finland) to be coveredas spoken in Finland) to be covered

• Over 30 different parties represented: commercial, Over 30 different parties represented: commercial, non-commercial, individualsnon-commercial, individuals

• Public comments to be allowed: Public comments to be allowed: http://http://www.kotoistus.fiwww.kotoistus.fi//

• Documentation for all controversial issues and Documentation for all controversial issues and deviations from any national standardsdeviations from any national standards

Page 27: cldr_overview

28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200527

For More InformationFor More Information

UnicodeUnicode• http://www.unicode.org/http://www.unicode.org/

CLDRCLDR• http://www.unicode.org/cldr/http://www.unicode.org/cldr/

This presentationThis presentation• http://www.unicode.org/cldr/data/docs/phttp://www.unicode.org/cldr/data/docs/p

resentations/cldr_overview.pptresentations/cldr_overview.ppt