Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes...

56
Click to add text © 2016 IBM Corporation Put ICU to Work! 1 November 1, 2016 Steven R. Loomis <[email protected]> San José, California, USA @srl295 Yoshito Umaoka <[email protected]> Littleton, Massachusetts, USA

Transcript of Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes...

Page 1: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

Click to add text

© 2016 IBM Corporation

Put ICU to Work!

1

November 1, 2016

Steven R. Loomis <[email protected]>

San José, California, USA @srl295

Yoshito Umaoka <[email protected]>

Littleton, Massachusetts, USA

Page 2: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

can’t I just use Unicode?

1,400 pages+ Annexes + additional standards

More than 110,000 characters

Significant update about once a year

80+ character properties,many multi-valued

2

Page 3: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Unicode covers the world…

“Unicode provides a unique number for every character,no matter what the platform,no matter what the program,no matter what the language.”

Photo: NASA Earth Observatory

source: http://www.unicode.org/standard/WhatIsUnicode.html

3

Page 4: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU brings you home

Requirements vary widely across languages & countries

Sorting

Text searching

Bidirectional text processing and complex text layout

Date/time/number/currency formatting

Codepage conversion

…many more

Photo Credits: james.thompson,Per Ola Wiberg ~ Powi, Tom Ravenscroft

4

Page 5: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU comes home

• 1999 – IBM Classes for Unicode becomesInternational Components for Unicode

• 2016 — not just for Unicode, but from Unicode

• • Similar open-source license• • Hosting transferred from IBM to Unicode (ICU-TC) • • Build machines accessible to entire ICU-TC• • 1-click CLA process – anyone can contribute!• • http://blog.unicode.org/2016/05/icu-joins-unicode-consortium.html

Page 6: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

• Breaks: word, line, …• Formatting• Date & time• Durations• Messages• Numbers & currencies• Plurals• Transforms• Normalization• Casing• Transliterations

ICU's laundry list

6

• Unicode text handling• Charset conversions (200+)• Charset detection• Collation & Searching• Locales from CLDR (640+)• Resource Bundles• Calendar & Time zones• Complex-text layout engine• Unicode Regular Expressions• …

Page 7: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

• Mature, widely used (all IBM brands and operating systems), up-to-date set of C/C++ and Java libraries• Basis for Java 1.1 internationalization, but goes far beyond Java 1.1• Team continues to work on improving and monitoring performance.

• Very portable – identical results on all platforms/programming languages• C/C++ (ICU4C): 30+ platforms/compilers• Java (ICU4J): Oracle and IBM JRE

• Wrappers: D/C#/PHP/Python/…• Customizable & Modular• Open source (since 1999) – but non-restrictive

• Contributions from many parties (IBM, Google, Apple, Microsoft, ...)

Benefits of ICU

7

Page 8: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Where do I get ICU? – http://icu-project.org/

§Main site: http://icu-project.org/

• Downloads

• API references

• Mailing list

• Bug tracking

§http://userguide.icu-project.org• User's guide with examples

Page 9: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU Userguide

§General to Specific

Page 10: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation10 11/4/16

API reference: http://icu-project.org/apiref (C and J)

Page 11: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Mailing Lists

lMailing Lists: http://site.icu-project.org/contactsl Icu-support – technical support and discussionl Icu-design – API proposals by ICU teaml Icu-announce –announcementsl Icu-bugrfe / icu-commits : track ICU bugs and commits

Page 12: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Bug Tracker - http://bugs.icu-project.org/trac

Page 13: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

New Ticket-

(security related bug –suggests hiding the ticket)

Page 14: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Time to jump into ICU itself...

Page 15: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Where we are going: Application Data

150,000 Ceuta and Melilla

38,087,800 Algeria

15,439,400 Ecuador

15

Task: Display a list of world regions, with their population figures.

Page 16: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU for C: First Look

#include “unicode/…”

All ICU headers are in unicode/

UErrorCode status = U_ZERO_ERROR;

Error code is a Fill-in – must be initialized.

u_init(&status);

Returns success if ICU data loads OK.

if ( U_SUCCESS(status) ) …

TRUE if no error. 16

Page 17: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU first look (C)

#include <unicode/uclean.h>

int main(…) {

UErrorCode status = U_ZERO_ERROR;

u_init(&status);

if(U_FAILURE(status)) {

puts(u_errorName(status));

return 1;

}

// ICU is OK…

// hereafter: ASSERT_OK

17

Page 18: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Hello, World! (C)

#include <unicode/uclean.h>#include <unicode/ustream.h>

UErrorCode status = U_ZERO_ERROR;u_init(&status);ASSERT_OK(status);UnicodeString msg("Hello, ");msg.append("World");msg.append(0x2603);std::cout << msg << std::endl;

Hello, World☃18

Page 19: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Hello, Welt! (C)

ULocaleDisplayNames *names;names = uldn_open(NULL, ULDN_DIALECT_NAMES, &status);UChar result[256]; // UTF-16int32_t len = uldn_regionDisplayName(names, "001", result, 256,

&status);uldn_close(names); // clean up!

ASSERT_OK(status);

UnicodeString msg("Hello, ");msg.append(result, len);msg.append(0x2603); // snowmanstd::cout << msg << std::endl;

Hello, World☃Hello, Welt☃

Hello, Mundo☃Hello, 世界☃

19

Page 20: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Hello, Welt! (J)

Locale locale = Locale.getDefault();String world = LocaleDisplayNames

.getInstance(ULocale.forLocale(locale))

.regionDisplayName("001");System.out.println("Hello, " + world + "\u2603");

Hello, World☃Hello, Welt☃

Hello, Mundo☃Hello, 世界☃

20

Page 21: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

World hello? (C)

ULocaleDisplayNames *names;names = uldn_open(NULL, ULDN_DIALECT_NAMES, &status);UChar result[256]; // UTF-16int32_t len = uldn_regionDisplayName(names, "001", result, 256,

&status);uldn_close(names); // clean up

ASSERT_OK(status);

UnicodeString msg("Hello, ");msg.append(result, len);msg.append(0x2603); // snowmanstd::cout << msg << std::endl;

Hello, World☃Hello, Welt☃

Hello, Mundo☃Hello, 世界☃

21

Page 22: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

MessageFormat (1/2)

Order is different for different languages, can't just concatenate strings.

whom + " 's " + what + " is on the " + where

Result:• English: My Aunt’s pen is on the table.• Spanish: La pluma de mi tía está en la tabla.

22

Page 23: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

MessageFormat (2/2)• Pattern Syntax:

• English: {whom}’s {what} is on the {where}.

• Translated Pattern:• Spanish: {what} de {whom} está en la {where}.

• Result:

• English: My Aunt’s pen is on the table.

• Spanish: La pluma de mi tía está en la tabla.

• Note: Use caution, a better solution might be:• “Location: table Object: pen Owner: Aunt”

23

Page 24: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

world messages: setup

Locale defaultLocaleId = Locale::getDefault(); // “en_US”int32_t territoryCount; // == number of territories

LocalPointer<LocaleDisplayNames>ldn(LocaleDisplayNames::createInstance(defaultLocaleId,ULDN_DIALECT_NAMES));

UnicodeString myLanguage;ldn->localeDisplayName(defaultLocaleId, myLanguage);// “American English”

24

Page 25: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

world messages: format

UnicodeString welcome = resourceBundle.getStringEx("welcome", status);

UnicodeString names[] = { "myLanguage", "today", "territoryCount" };

Formattable args[] = { defaultLocaleName, Calendar::getNow(),

territoryCount };

MessageFormat fmt(welcome, defaultLocaleId, status);fmt.format(names, args, 3, result, status);

myLanguage today territoryCount

UnicodeString UDate int32_t

“American English” <right now> 257

25

(English) My language is {myLanguage} and today is {today, date}. We have {territoryCount, number} territories.

(English) My language is American English and today is Apr 15, 2014. We have 257 territories.

(Spanish) Mi lengua es {myLanguage}, y hoy es {today, date}. Tenemos{territoryCount, number} regiónes.

(Spanish) Mi lengua es español (Estados Unidos), y hoy es abr. 15, 2014. Tenemos 257 regiónes.

Page 26: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

world messages: J

String pattern; // loaded from resource bundle

MessageFormat fmt = new MessageFormat(pattern, defaultLocaleID);

Map<String, Object> msgargs = new HashMap<String, Object>();

msgargs.put("territoryCount", territoryCount);

msgargs.put("myLanguage", defaultLocaleName);

msgargs.put("today", System.currentTimeMillis());

System.out.println(fmt.format(msgargs, new StringBuffer(), null));

26

Page 27: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

message formats (English U.S. examples)

number date time

(default) 123,456.789 (default) Oct 22, 2013 12:34:56 PM

integer 123,457 short 10/22/13 12:34 PM

currency $123,456.79 medium Oct 22, 2013 12:34:56 PM

percent 82% long October 22, 2013 12:34:56 PM PST

full Tuesday, October 22, 2013 12:34:56 PM Pacific Standard Time

27

Page 28: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

resource bundles

English:greeting=“A hello to …”Spanish: greeting=“¡Hola a …!”

ResourceBundle("myApp", locale,…).getString(“greeting”)…

locale= “en”

locale= “es”

28

Page 29: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Locale IDs

29

Script

Language

Territory

Page 30: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

authoring resource bundles

root.txtroot {hello{“Hello!”}}

en.txten {

// empty…}

“root” contains English text

30

“en” fully inherits from root.

“root” is sent out fortranslation…

Page 31: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

authoring resource bundles

root.txtroot {hello{“Hello!”}}

es.txtes {hello {“¡Hola!”}}

en.txten {

// empty…}

31

“es” returned from translationwith spanish content

Page 32: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

building resource bundles

$ bldicures--from ./txtfiles --dest ./out--name myapp

32

Page 33: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

using resource bundles: C

u_setDataDirectory("out"); // (see userguide)

ResourceBundle resourceBundle("myapp", locale, status);

UnicodeString thing = resourceBundle.getStringEx("hello", status);

std::cout << thing << std::endl;

33

Page 34: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

using ICU4C resource bundles in J

(Most apps will use Java ListResourceBundle)

Locale locale = Locale.getDefault();UResourceBundle bundle =

UResourceBundle.getBundleInstance(Sample30_ResHello.class.getPackage().getName().replace('.', '/')+"/data/reshello",locale,Sample30_ResHello.class.getClassLoader());

System.out.println(bundle.getString("hello"));

34

Page 35: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: who’s counting?

root { // root.txtinfo { "The territory of {territory} has {population} persons." }

}

The territory of Caribbean Netherlands has 20,000 persons.

The territory of Brazil has 201,010,000 persons.

The territory of Bouvet Island has 1 persons.

The territory of Unknown Region has 0 persons.

Bouvet de Lozier

35

Bouvet de Lozier

Page 36: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: the problem of plurals

“{count} territor(ies).”

if (territoryCount == 1) {message = message1; // “one territory…”;

} else {message = messageN; // “{territoryCount} territories”;

}

Page 37: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

plurals are language specific

Language # cases

Japanese

Hungarian1

English

Italian2

Polish

Romanian3

Arabic

Welsh6

zero, one, two, few, many, other

37

Page 38: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: using pluralsroot {

{population, plural,

=0{{territory} has a population of zero.}

one{Only one person lives on {territory}!}

other{The territory of {territory} has # persons.}}

}

The territory of Caribbean Netherlands has 20,000 persons.

The territory of Brazil has 201,010,000 persons.

Only one person lives on Bouvet Island!

Unknown Region has a population of zero.

Bouvet de Lozier

38

Page 39: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: sorting it all out

150,000 Ceuta and Melilla

38,087,800 Algeria

15,439,400 Ecuador

binary comparison inadequate

order varies by language (Danish ‘aa…’ follows ‘z…’)

need multiple-level collation

39

Page 40: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

collationUses:

comparingsortingsearching

Options:case sensitive?ignore punctuation?UPPERCASE first?which variant collator?which locale?custom tailorings?time vs. memory tradeoff?

40

Page 41: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: collation helper

// for use with std::sortclass CollatorLessThan :

public std::binary_function<const TerritoryEntry*,const TerritoryEntry*, bool> {

public:CollatorLessThan(Collator &coll, UErrorCode &status) : fColl(coll),

fStatus(status) {}inline bool operator()(const TerritoryEntry* a, const TerritoryEntry* b) {return (fColl.compare(a->getTerritoryName(),

b->getTerritoryName(), fStatus) == UCOL_LESS);}

private:Collator &fColl;UErrorCode &fStatus;

};41

Page 42: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: applying collation

LocalPointer<Collator>

coll(Collator::createInstance(locale, status));

std::sort(territoryList,

territoryList + territoryCount,

CollatorLessThan(*coll, status));

The territory of Afghanistan has 31,108,100 persons in it.The territory of Åland Islands has 26,200 persons in it.The territory of Albania has 3,011,410 persons in it.The territory of Algeria has 38,087,800 persons in it.…

42

Page 43: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: collation (J)

43

Collator collator = Collator.getInstance(locale);

territoryList = PopulationData.getTerritoryEntries(locale,

new TreeSet<TerritoryEntry>(new Comparator<TerritoryEntry>(){public int compare(TerritoryEntry o1,

TerritoryEntry o2) {return

collator.compare(o1.territoryName(), o2.territoryName());

}}));

Page 44: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

population: other locales

EnglishThe territory of Afghanistan has 22 262 500 persons.

The territory of Albania has 8 221 650 persons.The territory of Azerbaijan has 9 590 160 persons.

The territory of Аландские о-ва has 26 200 persons in it.

The territory of Албания has 3 011 410 persons in it.

44

Japaneseアイスランドには、315,281人います。アイルランドには、4,775,980人います。

アゼルバイジャンには、9,590,160人います。アセンション島には、940人います。

アフガニスタンには、31,108,100人います。アメリカ合衆国には、316,439,000人います。

SpanishEn la región de “Afganistán” hay 31.108.100 personas.

En la región de “Albania” hay 3.011.410 personas.En la región de “Alemania” hay 81.147.300 personas.

En la región de “Andorra” hay 85.293 personas.En la región de “Angola” hay 18.565.300 personas.

Page 45: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU API Stability Policy

Page 46: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

API Status

December 3, 2015

InternalUsed by ICU implementationor Technology Preview.

DraftNew API, reviewed and approvedby ICU project team. The APImight be still changed.

StableFor public use, the API signature won'tbe changed in future releases.

DeprecatedPreviously Stable, but no longerrecommended. The API might beremoved after a few releases.

Page 47: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Backward Compatibility

December 3, 2015

Source code compatible• Consumer program should be compiled

successfully without changes.• Rare exceptions, documented in readme.

Serialization compatible (ICU4J)• Newer ICU version should be able to deserialize

object data serialized by older ICU version.

Page 48: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Packaging and Dependency Questions

Page 49: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU Customization ( Problems and Solutions )

§ ICU Is Too Big

• Datal Details in the userguide http://userguide.icu-

project.org/icudatal Data Customizer tool - http://apps.icu-

project.org/datacustom/

• Code size (ICU4C) l http://userguide.icu-project.org/packagingl Example:

#define UCONFIG_NO_LEGACY_CONVERSIONl (May not reduce data size)

Page 50: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU data changes from time to time...

Unicode 4.1 5.0 5.1 5.2 6.0 6.2 6.3

CLDR 1.6 1.7 1.8 1.9 2.0 21 22 23 24 25 26

ICU 4.4

Polish Date Format: "08-04-2014" ICU 53

Polish Date Format: "8 kwi 2014"

isLetter(' ') : false (undefined)

isLetter(' ') : true

(date format data from CLDR)

(isLetter data from Unicode)

Page 51: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Compatibility Concerns

§Unicode stability• character type, upper/lower case, normalization, text direction, sorting order...• policy [http://www.unicode.org/policies/stability_policy.html]• but, Unicode is still growing

§ locale data• cultural data can be updated based on community voting• cultural format results are not suited for serializing data, application protocols

and storage

Page 52: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Protecting your application from change-induced problems

§ “08-04-2014” may not parse as “8 kwi 2014”• DON'T

l send localized data across the network between programs (other side may parse/format differently)

l store localized data on disk ( later app version may parse/format differently)• DO send and store non-localized format:

l Binary: 0x12345678l “Neutral”- ISO 8601 - “2014-04-08”

§ may not be a letter (isLetter()) in one Unicode version, but may later be defined.• Could cause difficulties if used to validate account names, …• DO: Think carefully about where Unicode properties are used.

Page 53: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

Thanks / Q & A

Info/Support/Download:http://icu-project.org/

Page 54: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU4J and JDK classes – 1 – When to use ICU4J vs. JDK

JDK class ICU class ICU Benefits Suggestion

java.lang.Character com.ibm.icu.lang.UCharacter lLatest Unicode standardlMore character properties support

JDK OK:ICU as-needed

java.math.BigDecimal com.ibm.icu.math.BigDecimal lFor backward compatibility only ICU not recommended in new code

java.text.Bidi com.ibm.icu.text.Bidi lLatest Unicode bidi algorithm (UAX#9)

JDK OK:ICU as-needed

java.text.BreakIterator com.ibm.icu.text.BreakIterator lLatest Unicode standard (UAX#29)lDictionary based word break (Thai, Lao, Chinese/Japanese)

JDK OK:ICU as-needed

java.text.Collatorjava.text.RuleBasedCollator

com.ibm.icu.text.Collatorcom.ibm.icu.text.RuleBasedCollator

lUnicode collation algorithm (UTS#10)lFaster comparison

ICU recommended

§ ICU has functionality beyond JDK – not shown on this chart. See userguide.

§Where there is overlap, in some cases JDK may be used instead of ICU.

Page 55: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU4J and JDK classes - 2

JDK class ICU class ICU Benefits Suggestion

java.text.DateFormatjava.text.SimpleDateFormat

com.ibm.icu.text.DateFormatcom.ibm.icu.text.SimpleDateFormat

l Abstract (skeleton) pattern (e.g. year-month only format)

l Patterns for additional calendar typesl More field format types (e.g. narrow

weekday, standalone month)l Capitalization controll Slower service object creation, format &

parse than JDK

JDK OK,ICU as-needed

java.text.MessageFormat com.ibm.icu.text.MessageFormat l Plural formattingl Gender formatting (social applications)l Named arguments (“{filename}” vs “{4}”)l Auto apostrophe mode

JDK OK,ICU as-needed

java.text.NumberFormatjava.text.DecimalFormat

com.ibm.icu.text.NumberFormatcom.ibm.icu.text.DecimalFormat

+ RuleBasedNumberFormat

l More styles (e.g. scientific, currency spell out)

l Parse currencyl Algorithmic numbering systemsl Slower service object creation, format &

parse than JDK

JDK OK,ICU as-needed

Page 56: Click to add text Put ICU to Work! · ICU comes home • 1999 – IBM Classesfor Unicode becomes International Componentsfor Unicode • • 2016 — not just for Unicode, but from

© 2014 IBM Corporation

ICU4J and JDK classes - 3JDK class ICU class ICU Benefits Suggestion

java.util.Calendarjava.util.GregorianCalendar

com.ibm.icu.util.Calendarcom.ibm.icu.util.GregorianCalendar

+ other Calendar implementation classes

lMore calendar types (e.g. Chinese Luna, Coptic, Hindu)lFiner control around time zone transitionlSlower service object creation than JDKlNo Java 8 java.time package integration yet

JDK OK,ICU as-needed

java.util.Locale com.ibm.icu.util.ULocale lFull BCP 47 language tag syntax support also on Java 6 or older runtime

JDK 7 or newer: preferred

JDK 6 or older: OK, but ICU on as-needed basis.

java.util.ResourceBundle com.ibm.icu.util.UResourceBundle lICU style bundle supportlBundle name with script (e.g. zh_Hant)lNo JDK ResourceBundle.Control support (no controls for customizing lookup ordering, fallbacks and custom bundle names)

JDK recommendedICU not recommended unless you want to share the same resource with ICU4C

java.util.TimeZonejava.util.SimpleTimeZone

com.ibm.icu.util.TimeZonecom.ibm.icu.util.SimpleTimeZone

+ other TimeZone implementation classes

lIteration for time zone transitionslCustom time zone built from multiple different rules

JDK ok, ICU as-needed

JDK java.time.zone classes on Java 8 support zone transitions

l ICU SPI provider can provide ICU functionality through JDK API in Java 6+l Don't just change the import statements to use ICU.