cldr_overview
-
Upload
guest5b036a -
Category
Technology
-
view
447 -
download
0
Transcript of cldr_overview
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference
CLDR 1.3:CLDR 1.3:Overview and What’s NewOverview and What’s New
George Rhoten (IBM)George Rhoten (IBM)Mark Davis (IBM)Mark Davis (IBM)
Steven Loomis (IBM)Steven Loomis (IBM)
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20052
AgendaAgenda
Background InformationBackground Information
What does CLDR contain?What does CLDR contain?
Samples of CLDRSamples of CLDR
What is new?What is new?
Future plansFuture plans
How does CLDR get updated?How does CLDR get updated?
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20053
Common Locale Data RepositoryCommon Locale Data Repository
Relatively new project: 2004Relatively new project: 2004
Hosted by Unicode ConsortiumHosted by Unicode Consortium• http://www.unicode.org/cldr/http://www.unicode.org/cldr/
Goals:Goals:• Common, necessary software locale data for all Common, necessary software locale data for all
world languagesworld languages
• Collect and maintain locale dataCollect and maintain locale data
• XML format for effective interchangeXML format for effective interchange
• Freely availableFreely available
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20054
Universal Character EncodingUniversal Character Encoding
Unicode: Unique character codes for Unicode: Unique character codes for all languagesall languages
…
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20055
Direct and Indirect UsageDirect and Indirect Usage Companies / OrganizationsCompanies / Organizations
• Adobe, Apple (Mac OS X), abas Software, Ascential Software, Adobe, Apple (Mac OS X), abas Software, Ascential Software, Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Business Objects, caris, CERN, ClearCommerce, Cognos, Business Objects, caris, CERN, ClearCommerce, Cognos, Debian Linux, D programming language, Gentoo Linux, GNU Debian Linux, D programming language, Gentoo Linux, GNU Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Mathworks, OpenOffice, Language Analysis Systems, Lawson Mathworks, OpenOffice, Language Analysis Systems, Lawson Software, Leica Geosystems GIS & Mapping LLC, Mandrake Software, Leica Geosystems GIS & Mapping LLC, Mandrake Linux, Novell (SuSE), Optio Software, PayPal, Progress Linux, Novell (SuSE), Optio Software, PayPal, Progress Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, WMS Gaming, Xerox, Yahoo!, and many more…WMS Gaming, Xerox, Yahoo!, and many more…
CaveatsCaveats• Not a complete listNot a complete list: usage is not tracked, so this is an : usage is not tracked, so this is an
estimateestimate• CLDR first available in 2004, some may use precursor dataCLDR first available in 2004, some may use precursor data
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20056
What is Locale Data?What is Locale Data? Locale = identifier referring to linguistic and cultural Locale = identifier referring to linguistic and cultural
preferencespreferences• en_US, en_GB, ja_JPen_US, en_GB, ja_JP
Locale doesn’t refer to data like in POSIXLocale doesn’t refer to data like in POSIX These preferences can change over time due to cultural These preferences can change over time due to cultural
and political reasonsand political reasons• Introduction of new currencies, like the EuroIntroduction of new currencies, like the Euro• Standard sorting of Spanish changesStandard sorting of Spanish changes
Many of these preferences have varying degrees of Many of these preferences have varying degrees of standardizationstandardization• 12 and 24 hour format in the United States12 and 24 hour format in the United States
This is a very broad topicThis is a very broad topic Scope of data limited to common system applicationsScope of data limited to common system applications
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20057
Types of Locale DataTypes of Locale Data• Dates/time formatsDates/time formats
• Number/Currency formatsNumber/Currency formats
• Measurement SystemMeasurement System
• Collation SpecificationCollation Specification SortingSorting SearchingSearching MatchingMatching
• Translated names for language, territory, Translated names for language, territory, script, timezones, currencies,…script, timezones, currencies,…
• Script and characters used by a languageScript and characters used by a language
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20058
Sample: Languages, Scripts, Sample: Languages, Scripts, Territories in DanishTerritories in Danish
This data can be used for web site preferencesThis data can be used for web site preferences
<localeDisplayNames><localeDisplayNames>
<languages><languages>
<language type="aa"><language type="aa">AfarAfar</language></language>
<language type="ab"><language type="ab">AbkhasiskAbkhasisk</language>…</language>…
<scripts><scripts>
<script type="Arab"><script type="Arab">ArabiskArabisk</script>…</script>…
<territories><territories>
<territory type="AD"><territory type="AD">AndorraAndorra</territory></territory>
<territory type="AE"><territory type="AE">Forenede Arabiske EmiraterForenede Arabiske Emirater
</territory>…</territory>…
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 20059
Sample: Characters / DatesSample: Characters / Dates
<characters><characters>
<exemplarCharacters><exemplarCharacters>[a-z æ å ø á é í ó ú ý][a-z æ å ø á é í ó ú ý]</exemplarCharacters></exemplarCharacters>
</characters>…</characters>…
<dayContext type="format"><dayContext type="format">
<dayWidth type="abbreviated"><dayWidth type="abbreviated">
<day type="sun"><day type="sun">sønsøn</day></day>
<day type="mon"><day type="mon">manman</day>…</day>…
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200510
Sample: Timezones / CurrenciesSample: Timezones / Currencies
<timeZoneNames><timeZoneNames>
<zone type="America/Los_Angeles"><zone type="America/Los_Angeles">
<long><long>
<standard><standard>Pacific-normaltidPacific-normaltid</standard></standard>
<daylight><daylight>Pacific-sommertidPacific-sommertid</daylight></daylight>
</long>…</long>…
<currencies><currencies>
<currency type="GAF"><currency type="GAF">
<displayName><displayName>Gabonesisk CFA-francGabonesisk CFA-franc
</displayName></displayName>
<symbol><symbol>GAFGAF</symbol>…</symbol>…
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200511
Sample: CollationSample: Collation
<collation type="standard" ><collation type="standard" ><settings caseFirst="upper" /><settings caseFirst="upper" /><rules><rules>
<reset>D</reset><reset>D</reset><s>đ</s><s>đ</s><t>Đ</t><t>Đ</t><s>ð</s><s>ð</s><t>Ð</t><t>Ð</t><reset>t</reset><reset>t</reset>
……
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200512
Latest Release: CLDR 1.3Latest Release: CLDR 1.3
Released:Released: June 2, 2005June 2, 2005
296 locales: 96 languages and 130 296 locales: 96 languages and 130 territoriesterritories
DataData• Unique keys:Unique keys: 3,9743,974
• Actual Values:Actual Values: 52,38252,382
• All data fields:All data fields: 898,183 898,183
(not including collation, aliased data)(not including collation, aliased data)
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200513
CLDR 1.3CLDR 1.3 Complete POSIX-format data with POSIX conversion toolComplete POSIX-format data with POSIX conversion tool More timezone translationsMore timezone translations Data for UN M.49 regions, including continents and regionsData for UN M.49 regions, including continents and regions Addition of ISO 4217 currency codes change oversAddition of ISO 4217 currency codes change overs Additional number and data tests to verify CLDR Additional number and data tests to verify CLDR
implementationsimplementations Mappings from language to script and territoryMappings from language to script and territory Various other fixes, additions, and extensionsVarious other fixes, additions, and extensions Survey tool for improved collection of data Survey tool for improved collection of data
http://www.unicode.org/cgi-bin/cldr-surveyhttp://www.unicode.org/cgi-bin/cldr-survey(read only to non-members)(read only to non-members)
… … and many other minor improvements and bug fixesand many other minor improvements and bug fixes
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200514
Next Release: CLDR 1.4Next Release: CLDR 1.4 2005-05-31 Phase 12005-05-31 Phase 1
• Design Design
2005-08-31 Phase 22005-08-31 Phase 2• Structure, Tools, DocumentationStructure, Tools, Documentation
2005-09-30 Phase 2 Beta Release2005-09-30 Phase 2 Beta Release 2005-10-31 Phase 32005-10-31 Phase 3
• Data Incorporation & VettingData Incorporation & Vetting
2006-01-31 Phase 3 Beta Release2006-01-31 Phase 3 Beta Release 2006-03-31 CLDR 1.4 Released2006-03-31 CLDR 1.4 Released
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200515
Samples of PossibleSamples of PossibleCLDR 1.4 FeaturesCLDR 1.4 Features
DataData• Enhance data for existing localesEnhance data for existing locales• Verify coverage levelVerify coverage level• Measurement unit names (eg metric vs Measurement unit names (eg metric vs
US)?US)?• Add European Ordering rules to some Add European Ordering rules to some
localeslocales• Add data/structure to support lenient Add data/structure to support lenient
parsing, formatting; relative dates, etc.parsing, formatting; relative dates, etc.• Enhance Indic sortingEnhance Indic sorting
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200516
Samples of PossibleSamples of PossibleCLDR 1.4 Features (II)CLDR 1.4 Features (II)
StructureStructure• Add structure / data for tracking priority Add structure / data for tracking priority
and completenessand completeness
• Move weekend data & other country Move weekend data & other country data to country infodata to country info
• Improved alias structure to reduce data Improved alias structure to reduce data duplicationduplication
• Add locale specific linebreak, Add locale specific linebreak, transforms, etc.transforms, etc.
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200517
Samples of PossibleSamples of PossibleCLDR 1.4 Features (III)CLDR 1.4 Features (III)
Tests & ToolsTests & Tools• Enhanced Survey tool for Enhanced Survey tool for
collecting/vetting datacollecting/vetting data
• Enhanced consistency checking, more Enhanced consistency checking, more complete testscomplete tests
• Improve the Java tool integration, Improve the Java tool integration, documentation, testingdocumentation, testing
Actual feature set has not been Actual feature set has not been determined yet!determined yet!
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200518
Committee ProcessCommittee Process Designed for most effective participation Designed for most effective participation
from people around the worldfrom people around the world MeetingsMeetings
• By phone, never face to faceBy phone, never face to face• Short, frequentShort, frequent• Allows preparation between meetingsAllows preparation between meetings• Resolves conflicts and new feature requestsResolves conflicts and new feature requests
WrittenWritten• EmailEmail• Bug database submissionsBug database submissions
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200519
Vetting Process for DataVetting Process for Data Collect from different participating organizations, Collect from different participating organizations,
experts and submissions: new or revisedexperts and submissions: new or revised• References to external sources strongly encouragedReferences to external sources strongly encouraged
• Must be given before freeze date for releaseMust be given before freeze date for release
• Use CLDR Survey ToolUse CLDR Survey Tool
Enter into the repositoryEnter into the repository• Mark with draft attributeMark with draft attribute
• Some may be entered as alternatesSome may be entered as alternates
• Differences resolved by CLDR committeeDifferences resolved by CLDR committee
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200520
Vetting Process (II)Vetting Process (II)
Vet by CLDR committee membersVet by CLDR committee members• Consulting with country contactsConsulting with country contacts
• If disagreement, decide in committeeIf disagreement, decide in committee
AcceptAccept• As main form: draft attribute removedAs main form: draft attribute removed
• As alternate form: marked with different As alternate form: marked with different attributesattributes
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200521
Causes of Conflicting DataCauses of Conflicting Data Typographical errorsTypographical errors
• Canda instead of CanadaCanda instead of Canada
Regional differencesRegional differences• German spelling is different between countriesGerman spelling is different between countries
Context of usageContext of usage• Normal German sorting versus German Normal German sorting versus German
phonebook sortingphonebook sorting
Parts of speechParts of speech• ““март 2004” versus “3 мартмарт 2004” versus “3 мартаа” when the ” when the
Russian word for March is used in a dateRussian word for March is used in a date
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200522
Causes of Conflicting Data (II)Causes of Conflicting Data (II) Standards versus common useStandards versus common use
• ““Republic of Laos” versus “Laos”Republic of Laos” versus “Laos”
MisunderstandingMisunderstanding• Translating year format “yyyy” as “jjjj” instead Translating year format “yyyy” as “jjjj” instead
of changing localized pattern charactersof changing localized pattern characters
Uncommon casesUncommon cases• Translating the “Interlingua” language name Translating the “Interlingua” language name
into other languagesinto other languages
Individual preferencesIndividual preferences• 24 hour time format versus 12 hour time 24 hour time format versus 12 hour time
formatformat
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200523
ChallengesChallenges
Complex FormatsComplex Formats
Experts knowledgeable both in Experts knowledgeable both in technology and a specific languagetechnology and a specific language• CollationCollation
• Exemplar charactersExemplar characters
• Etc…Etc…
Require close interaction of CLDR Require close interaction of CLDR experts with language expertsexperts with language experts
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200524
Getting InvolvedGetting Involved
Simplest – Simplest – anyone!anyone!• Use CLDRUse CLDR
• Bug report / feature requestBug report / feature request
More InvolvedMore Involved• Vetting, Assessment, Tools, Policies, Vetting, Assessment, Tools, Policies,
Decisions, …Decisions, …
• Any Unicode member eligible to name Any Unicode member eligible to name representatives including country liason representatives including country liason membersmembers
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200525
Example Country Process (Finland)Example Country Process (Finland)
Finnish Ministry of Education made Finnish Ministry of Education made CLDR data a major goal, 2004-06CLDR data a major goal, 2004-06• Research Institute for the Languages of Research Institute for the Languages of
FinlandFinland (“RILF” aka “Kotus”) designated (“RILF” aka “Kotus”) designated agencyagency
• Documenting the national preferences Documenting the national preferences in the open more important than the in the open more important than the implementation mechanismimplementation mechanism
• Results expected to lead to new/revised Results expected to lead to new/revised national standardsnational standards
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200526
Example Country Process (II)Example Country Process (II) RILF a Unicode Liaison member, RILF a Unicode Liaison member, 2004-072004-07
• Set up fully open national group on language and Set up fully open national group on language and cultural requirements on ICT, 2004-09cultural requirements on ICT, 2004-09
• Two official languages (Finnish and Swedish) & four Two official languages (Finnish and Swedish) & four regional / minority languages (three Sámi & Romani regional / minority languages (three Sámi & Romani as spoken in Finland) to be coveredas spoken in Finland) to be covered
• Over 30 different parties represented: commercial, Over 30 different parties represented: commercial, non-commercial, individualsnon-commercial, individuals
• Public comments to be allowed: Public comments to be allowed: http://http://www.kotoistus.fiwww.kotoistus.fi//
• Documentation for all controversial issues and Documentation for all controversial issues and deviations from any national standardsdeviations from any national standards
28th Internationalization and Unicode Conference28th Internationalization and Unicode Conference Orlando, Florida, September, 200527
For More InformationFor More Information
UnicodeUnicode• http://www.unicode.org/http://www.unicode.org/
CLDRCLDR• http://www.unicode.org/cldr/http://www.unicode.org/cldr/
This presentationThis presentation• http://www.unicode.org/cldr/data/docs/phttp://www.unicode.org/cldr/data/docs/p
resentations/cldr_overview.pptresentations/cldr_overview.ppt