Localization and Internationalization

74
5 — Localization From Code to Product gidgreen.com/course

Transcript of Localization and Internationalization

Page 1: Localization and Internationalization

5 — Localization

From Code to Product gidgreen.com/course

Page 2: Localization and Internationalization

Getting it wrong

From Code to Product Lecture 5 — Localization— Slide 2 gidgreen.com/course

Page 3: Localization and Internationalization

Something we should know?

From Code to Product Lecture 5 — Localization— Slide 3 gidgreen.com/course

Page 4: Localization and Internationalization

Lecture 5

•  Countries and languages •  Character sets •  Unicode •  Text localization •  Outsourcing translation •  Other localization

From Code to Product Lecture 5 — Localization— Slide 4 gidgreen.com/course

Page 5: Localization and Internationalization

Population

From Code to Product Lecture 5 — Localization— Slide 5 gidgreen.com/course

China 1,347 M 19.3%

India 1,210 M 17.3%

USA 313 M 4.5%

Indonesia 238 M 3.4%

Brazil 192 M 2.8%

Pakistan 179 M 2.6%

Nigeria 162 M 2.3%

Russia 143 M 2.0%

Bangladesh 142 M 2.0%

Japan 128 M 1.8%

Mandarin 845 M 12.1%

Spanish 329 M 4.7%

English 328 M 4.7%

Hindi-Urdu 240 M 3.4%

Arabic 221 M 3.2%

Bengali 181 M 2.6%

Portuguese 178 M 2.5%

Russian 144 M 2.1%

Japanese 122 M 1.7%

Punjabi 109 M 1.6%

2011-2012 from Wikipedia

Page 6: Localization and Internationalization

Economic weight (nominal)

From Code to Product Lecture 5 — Localization— Slide 6 gidgreen.com/course

USA $14.4 T 23.7%

Japan $4.9 T 8.1%

China $4.3 T 7.1%

Germany $3.7 T 6.0%

France $2.9 T 4.7%

UK $2.7 T 4.4%

Italy $2.3 T 3.8%

Russia $1.7 T 2.8%

Spain $1.6 T 2.6%

Brazil $1.6 T 2.6%

English $21.3 T 34.9%

Chinese $5.2 T 8.4%

Japanese $4.9 T 8.1%

German $4.4 T 7.2%

Spanish $4.2 T 6.8%

French $4.0 T 6.5%

Italian $2.5 T 4.1%

Russian $2.2 T 3.7%

Portuguese $1.9 T 3.1%

Arabic $1.9 T 3.1%

2008 from globalization-group.com, IMF

Page 7: Localization and Internationalization

Internet users

From Code to Product Lecture 5 — Localization— Slide 7 gidgreen.com/course

China 485 M 36%

USA 245 M 78%

India 100 M 8%

Japan 99 M 78%

Brazil 76 M 37%

Germany 65 M 80%

Russia 60 M 43%

UK 51 M 82%

France 45 M 70%

Nigeria 44 M 28%

English 565 M 43%

Chinese 510 M 37%

Spanish 165 M 39%

Japanese 99 M 78%

Portuguese 83 M 32%

German 75 M 80%

Arabic 65 M 19%

French 60 M 17%

Russian 60 M 43%

Korean 39 M 55%

2011 from internetworldstats.com

Page 8: Localization and Internationalization

Internet penetration

From Code to Product Lecture 5 — Localization— Slide 8 gidgreen.com/course

Page 9: Localization and Internationalization

E-commerce volumes

$135B

$51B

$37B

$36B $28B $28B $19B

$16B

$15B

$13B

$123B

USA

Japan

China

Germany

France

UK

Italy

Canada

Spain

South Korea

Other

From Code to Product Lecture 5 — Localization— Slide 9 gidgreen.com/course

2009 from Everis

Page 10: Localization and Internationalization

Multilingual countries

From Code to Product Lecture 5 — Localization— Slide 10 gidgreen.com/course

English 21M

French 8M

Canada

German

5.0M

French 1.6M

Italian 0.5M

Switzerland

Page 11: Localization and Internationalization

Language variations

•  US vs UK English – color | colour – vacation | holiday – Where are you (at)?

•  European vs Brazilian Portuguese •  French •  Spanish

From Code to Product Lecture 5 — Localization— Slide 11 gidgreen.com/course

Page 12: Localization and Internationalization

Language codes (ISO-639-1)

From Code to Product Lecture 5 — Localization— Slide 12 gidgreen.com/course

ar Arabic

fr French

nl Dutch

de German

he Hebrew

it Italian

ja Japanese

pl Polish

ru Russian

es Spanish

zh-CN Chinese (simplified)

zh-TW Chinese (traditional)

en-GB English (UK)

en-US English (US)

pt-BR Portuguese (Brazilian)

pt-PT Portuguese (Portugal)

es-AR Spanish (Argentina)

es-CL Spanish (Chile)

es-MX Spanish (Mexico)

es-ES Spanish (Spain)

Page 13: Localization and Internationalization

Lecture 5

•  Countries and languages •  Character sets •  Unicode •  Text localization •  Outsourcing translation •  Other localization

From Code to Product Lecture 5 — Localization— Slide 13 gidgreen.com/course

Page 14: Localization and Internationalization

Computer representation

From Code to Product Lecture X — SUBJECT— Slide 14 gidgreen.com/course

0 1 0 0 0 0 0 1

0 … 65 … 255 .,/?;:’!%abcdefghijklmnopqrstuvwxyz… A …BCDEFGHIJKMNOPQRSTUVWXYZ0123456789

00 … 41 … FF

Page 15: Localization and Internationalization

US-ASCII

From Code to Product Lecture 5 — Localization— Slide 15 gidgreen.com/course

Image from czyborra.com

Page 16: Localization and Internationalization

ISO-8859-1

From Code to Product Lecture 5 — Localization— Slide 16 gidgreen.com/course

Page 17: Localization and Internationalization

Windows-1252

From Code to Product Lecture 5 — Localization— Slide 17 gidgreen.com/course

Page 18: Localization and Internationalization

ISO-8859-5

From Code to Product Lecture 5 — Localization— Slide 18 gidgreen.com/course

Page 19: Localization and Internationalization

ISO-8859-8

From Code to Product Lecture 5 — Localization— Slide 19 gidgreen.com/course

Page 20: Localization and Internationalization

Problems with character sets

•  Extra metadata •  Potential for misdisplay •  Mutually exclusive •  Little space to grow - e.g. € •  Ideographic languages – 70,000+ Chinese characters – Multibyte encoding

From Code to Product Lecture 5 — Localization— Slide 20 gidgreen.com/course

Page 21: Localization and Internationalization

Lecture 5

•  Countries and languages •  Character sets •  Unicode •  Text localization •  Outsourcing translation •  Other localization

From Code to Product Lecture 5 — Localization— Slide 21 gidgreen.com/course

Page 22: Localization and Internationalization

The Unicode solution

•  One global character set – Over 110,000 characters – Over 100 alphabets

•  1,114,112 code points – 0…255 compatible with ISO-8859-1 – U+0041 = A

•  Multiple encodings

From Code to Product Lecture X — SUBJECT— Slide 22 gidgreen.com/course

Page 23: Localization and Internationalization

U+0000 … U+007F

From Code to Product Lecture 5 — Localization— Slide 23 gidgreen.com/course

Page 24: Localization and Internationalization

U+0080 … U+00FF

From Code to Product Lecture 5 — Localization— Slide 24 gidgreen.com/course

Page 25: Localization and Internationalization

U+0400 … U+047F

From Code to Product Lecture 5 — Localization— Slide 25 gidgreen.com/course

Page 26: Localization and Internationalization

U+0590 … U+060F

From Code to Product Lecture X — SUBJECT— Slide 26 gidgreen.com/course

Page 27: Localization and Internationalization

U+4E00 … U+4E7F

From Code to Product Lecture 5 — Localization— Slide 27 gidgreen.com/course

Page 28: Localization and Internationalization

U+2190 … U+220F

From Code to Product Lecture 5 — Localization— Slide 28 gidgreen.com/course

Page 29: Localization and Internationalization

U+2800 … U+267F

From Code to Product Lecture 5 — Localization— Slide 29 gidgreen.com/course

Page 30: Localization and Internationalization

UTF-16 encoding

•  2 or 4 bytes per code point •  Simple for U+0000…D7FF and E000…FFFF – “Basic Multilingual Pane”

•  Higher code points use 4 bytes •  U+FEFF = byte-order mark – No well-followed default

•  Windows APIs since Windows 2000 – Also .NET, Android, iOS, Mac OS X

From Code to Product Lecture 5 — Localization— Slide 30 gidgreen.com/course

Page 31: Localization and Internationalization

UTF-8 encoding

•  1 to 6 bytes per code point •  1 byte for U+0000…007F – Perfect compatibility with ASCII

•  2 bytes for U+0080…07FF – etc…

•  Byte order mark allowed – But unnecessary, causes problems

•  Dominant on web, email

From Code to Product Lecture 5 — Localization— Slide 31 gidgreen.com/course

Page 32: Localization and Internationalization

UTF-8 encoding

From Code to Product Lecture 5 — Localization— Slide 32 gidgreen.com/course

Page 33: Localization and Internationalization

UTF-8 advantages

•  Natural compression for English •  English works in old tools/APIs – HTML tags unaffected

•  No shared values between byte types – Easy to synchronize mid-stream – Easy to search by byte value

•  No zero bytes (good for C) •  Byte-sorting = codepoint-sorting

From Code to Product Lecture 5 — Localization— Slide 33 gidgreen.com/course

Page 34: Localization and Internationalization

Unicode on the web

From Code to Product Lecture 5 — Localization— Slide 34 gidgreen.com/course

Sour

ce:

goog

lebl

og.b

logs

pot.

com

Page 35: Localization and Internationalization

Lecture 5

•  Countries and languages •  Character sets •  Unicode •  Text localization •  Outsourcing translation •  Other localization

From Code to Product Lecture 5 — Localization— Slide 35 gidgreen.com/course

Page 36: Localization and Internationalization

The original source code

From Code to Product Lecture 5 — Localization— Slide 36 gidgreen.com/course

function Check_Username(username) … if Username_Taken(username)… error="username is taken." … return error end function

Page 37: Localization and Internationalization

And now in Spanish…

function Check_Username(username) … if Username_Taken(username)… error="username se toma." … return error end function

From Code to Product Lecture 5 — Localization— Slide 37 gidgreen.com/course

Page 38: Localization and Internationalization

Internationalized

function Check_Username(username) … if Username_Taken(username)… error=Get_String("un-taken") … return error end function

From Code to Product Lecture 5 — Localization— Slide 38 gidgreen.com/course

Page 39: Localization and Internationalization

Internationalized

function Check_Username(username) … if Username_Taken(username)… error=Translate("username is taken") … return error end function From Code to Product Lecture 5 — Localization— Slide 39 gidgreen.com/course

Page 40: Localization and Internationalization

IDs vs English strings

From Code to Product Lecture 5 — Localization— Slide 40 gidgreen.com/course

IDs English strings

More compact code More explicit code

English can be changed Enforces sync between languages

Less error-prone Easier for third parties

Page 41: Localization and Internationalization

Concatenation is evil

print Translate("You will travel from ") + from_city + Translate(" to ") + to_city

From Code to Product Lecture 5 — Localization— Slide 41 gidgreen.com/course

You will travel from London to Paris

Usted viajará de London a Paris

Sie wird von London nach Paris reisen

Page 42: Localization and Internationalization

Substitutions

From Code to Product Lecture 5 — Localization— Slide 42 gidgreen.com/course

raw=Translate("You will travel from %from% to %to%") raw=replace(raw, "%from%", from_city) print replace(raw, "%to%", to_city)

You will travel from %from% to %to% Usted viajará de %from% a %to% Sie wird von %from% nach %to% reisen

Page 43: Localization and Internationalization

Singular/plural

if (credits is 1) c_string=translate("1 credit")

else c_string=replace(translate("%#% credits",

"%#%", credits) raw=translate("You have %credits% left”) print replace(raw, "%credits", c_string)

From Code to Product Lecture 5 — Localization— Slide 43 gidgreen.com/course

You have 3 credits left You have 1 credit left

Page 44: Localization and Internationalization

Text in images

From Code to Product Lecture 5 — Localization— Slide 44 gidgreen.com/course

Page 45: Localization and Internationalization

Width in layouts أشكركم على الدفع. 感谢您的付款。 Gracias por su pago.

אנו מודים לך על התשלום. Спасибо за ваш платеж. Thank you for your payment. Vielen Dank für Ihre Bezahlung. Σας ευχαριστούµε για την πληρωµή σας. Nous vous remercions de votre paiement. お支払いしていただきありがとうございます。

From Code to Product Lecture 5 — Localization— Slide 45 gidgreen.com/course

+57%!

Page 46: Localization and Internationalization

LTR / RTL

From Code to Product Lecture 5 — Localization— Slide 46 gidgreen.com/course

Page 47: Localization and Internationalization

Lecture 5

•  Countries and languages •  Character sets •  Unicode •  Text localization •  Outsourcing translation •  Other localization

From Code to Product Lecture 5 — Localization— Slide 47 gidgreen.com/course

Page 48: Localization and Internationalization

Outsourcing translation

•  Preparing code •  Collecting (English) assets •  Choosing a provider •  Costs and quotes •  Glossary •  Translation memory •  Independent review

From Code to Product Lecture 5 — Localization— Slide 48 gidgreen.com/course

Page 49: Localization and Internationalization

Collecting assets

•  Text files – Simple arrays or resource files – Standard formats, e.g. gettext, XLIFF

•  HTML files – Risk of accidental markup changes

•  Graphics files – Originals, not rendered

•  Think about text expansion

From Code to Product Lecture 5 — Localization— Slide 49 gidgreen.com/course

Page 50: Localization and Internationalization

Choosing a provider

•  Problem: you can’t assess quality •  Go by reputation and clients – Examples of previous work

•  Ask who will actually do it – Native speaker of target language – Subject-specific experience

•  Consider future language needs

From Code to Product Lecture 5 — Localization— Slide 50 gidgreen.com/course

Page 51: Localization and Internationalization

Cost and quotes

From Code to Product Lecture 5 — Localization— Slide 51 gidgreen.com/course

Ibidem-translations.com

•  Add 15-50% for specialized areas •  Clarify how words are counted •  Check for extra costs

Page 52: Localization and Internationalization

Glossary

•  Fixed translation for specific terms – Control over branding – Domain-specific terminology – Consistency

•  Not-to-be-translated terms •  Requires thorough review of product

From Code to Product Lecture 5 — Localization— Slide 52 gidgreen.com/course

Page 53: Localization and Internationalization

Glossary

From Code to Product Lecture 5 — Localization— Slide 53 gidgreen.com/course

Image from Google Translator Toolkit Help

Page 54: Localization and Internationalization

Translation memory

•  Lots of translation is repetitive – Same text in many places – Small changes between versions

•  Same sentence = same translation – Save time and money – Help ensure consistency – But manual confirmation required

•  Should be owned by you

From Code to Product Lecture 5 — Localization— Slide 54 gidgreen.com/course

Page 55: Localization and Internationalization

Translation memory

From Code to Product Lecture 5 — Localization— Slide 55 gidgreen.com/course

Imag

e fr

om k

ilgra

y.co

m s

cree

nsho

ts

Page 56: Localization and Internationalization

Machine translation

From Code to Product Lecture 5 — Localization— Slide 56 gidgreen.com/course

Page 57: Localization and Internationalization

Lecture 5

•  Countries and languages •  Character sets •  Unicode •  Text localization •  Outsourcing translation •  Other localization

From Code to Product Lecture 5 — Localization— Slide 57 gidgreen.com/course

Page 58: Localization and Internationalization

Numbers

1,234,567.89 — Japan, UK, USA 1 234 567,89 — France, Central Europe 1.234.567,89 — Germany, Scandinavia 1’234’567.89 — Switzerland 123,4567.89 — China 1’234,567.89 — Mexico 12,34,567.89 — India

From Code to Product Lecture X — SUBJECT— Slide 58 gidgreen.com/course

Page 59: Localization and Internationalization

Date and Times

7/21/2012 21/7/2012 21.7.2012 2012-07-12 7. 21. 2012 7-12-2012

From Code to Product Lecture 5 — Localization— Slide 59 gidgreen.com/course

15:45 3.45 PM 3:45 pm

Page 60: Localization and Internationalization

Time zones

From Code to Product Lecture 5 — Localization— Slide 60 gidgreen.com/course

Map from wikipedia.org

Page 61: Localization and Internationalization

Displaying times online

•  Store times independent of zone •  Options for display – Ask the user for their time zone – Show an explicit time zone – Use “ago” notation

•  Javascript to get from browser

From Code to Product Lecture 5 — Localization— Slide 61 gidgreen.com/course

Page 62: Localization and Internationalization

Currencies

•  Biggest traded currencies: $ € ¥ £ – But there are almost 200

•  How to display – Number formatting – Symbols: ₪ ₩ ฿ $ – Currency codes: USD EUR JPY GBP CAD AUD

•  Also: currency conversion – Live feed, e.g. from ECB

From Code to Product Lecture 5 — Localization— Slide 62 gidgreen.com/course

Page 63: Localization and Internationalization

Names

•  Surname can come first – China, Japan, Korea, Hungary

•  Multiple surnames – José Santos Tavares Melo Silva

•  Middle names/initials •  Double-barrelled names – Sarah-Jane Darlington-Whit

•  No spaces in CJK

From Code to Product Lecture 5 — Localization— Slide 63 gidgreen.com/course

Page 64: Localization and Internationalization

Names

From Code to Product Lecture 5 — Localization— Slide 64 gidgreen.com/course

Full Name:

What should we call you?

Family name:

Other/given names:

•  Or localize based on language •  Do you need names at all? – Username or email can be enough

Page 65: Localization and Internationalization

Addresses

From Code to Product Lecture 5 — Localization— Slide 65 gidgreen.com/course

John Doe Acme, Inc Suite 3B-3824 294 W Ronson Dallas TX 75211 USA

John Smith Acme, Ltd Flat 384 33 Walton Road Birmingham B26 3QJ UK

〒100-8994 東京都中央区八重洲一丁目5番3号 東京中央郵便局 Tokyo Central Post Office 1-5-3 Yaesu, Chuo-ku Tokyo 100-8994 Japan

C/Pescadoro, 13, 2°, 3ª 28331 – Madrid Spain

Page 66: Localization and Internationalization

Addresses

•  Single multi-line field •  Change in response to country •  Generic format

From Code to Product Lecture 5 — Localization— Slide 66 gidgreen.com/course

Page 67: Localization and Internationalization

Phone numbers

UK: +44 (0) 123-456-7890 France: +33 1-23-45-67-89 China: +86 10-2345-6789 USA: +1 (123) 456-7890 x123

From Code to Product Lecture 5 — Localization— Slide 67 gidgreen.com/course

•  Country selector •  Change in response to country •  Generic format

Page 68: Localization and Internationalization

Indexing, sorting, searching

•  Capitalization and accents – Øyvind matches oyvind?

•  Collation (sort order) – Swedish: a b c … x y z å ä ö – French: cote côte coté côté

•  CJK (ideographic languages) – No spaces between words – Sort based on stroke count

From Code to Product Lecture 5 — Localization— Slide 68 gidgreen.com/course

Page 69: Localization and Internationalization

Paper sizes

From Code to Product Lecture 5 — Localization— Slide 69 gidgreen.com/course

US Legal 356 x 216 mm

US Letter 279 x 216 mm

A4 297 x 210 mm

Page 70: Localization and Internationalization

Domain names

•  Country-code top-level domains – .fr .de .uk .in .br .jp .cn

•  Need separate registrar for many •  Some countries have restrictions – .com.au requires registered company – .ca requires nationality/residence – Also restricted: .fr .br .cn .ie .jp …

•  Internationalized domain names

From Code to Product Lecture 5 — Localization— Slide 70 gidgreen.com/course

Page 71: Localization and Internationalization

And there’s more…

•  Units of measurement •  Colors •  Images of people •  Calendars •  Holidays •  Border disputes •  Culture •  Law From Code to Product Lecture 5 — Localization— Slide 71 gidgreen.com/course

Page 72: Localization and Internationalization

Google in China

•  2005: Chinese language google.com •  2006: google.cn under censorship •  2009: China blocks YouTube •  2010: Google claims hacking attack – Redirects google.cn to google.com.hk – China blocks it for a day

•  Today: Baidu 79%, Google 17% – Baidu links to MP3/movie downloads

From Code to Product Lecture 5 — Localization— Slide 72 gidgreen.com/course

Page 73: Localization and Internationalization

Getting real

•  It’s time consuming and costly •  Cheap wins in version 1.0 – Parameterize + functionize – Use Unicode throughout – Flexible layouts

•  See where there is demand •  Identify most important locales

From Code to Product Lecture 5 — Localization— Slide 73 gidgreen.com/course

Page 74: Localization and Internationalization

Getting real

•  Don’t skimp the details – Needs to look native

•  Use serious service providers •  Prepare for tech support – Machine translation an option?

•  It will slow development – So wait for product maturity

From Code to Product Lecture 5 — Localization— Slide 74 gidgreen.com/course