Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf ·...

70

Transcript of Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf ·...

Page 1: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint
Page 2: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint
Page 3: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization Introduction & Overview

Arvind27th Feb 2016

Oracle Confidential – Internal/Restricted/Highly RestrictedCopyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Page 4: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Oracle Confidential – Internal/Restricted/Highly Restricted 4

Page 5: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Agenda

Why should I care about Internationalization

History / Evolution

Various encodings

I18n & Java

References

1

2

3

4

Oracle Confidential – Internal/Restricted/Highly Restricted 5

5

Page 6: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Page 7: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Why should I care about Internationalization

History / Evolution

Various encodings

History linked to Java

References

Oracle Confidential – Internal/Restricted/Highly Restricted 7

1

2

3

4

5

Page 8: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

😁,😂 , 😃 , 😄

✌, ☂, ♫, ♪

☕, 🏥, 🏦, 🏨 , 🏊

Internationalization

Page 9: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

Page 10: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

• Mojibake

First, why should I care?

but it may actually display like this:

Page 11: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

05/06/07

or

902.300

Internationalization

Page 12: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

GB-English 5th June 2007

US-English 6th May 2007

Japanese 7th Jun 2005

(Germany) 902.300

(France) 902 300

(United States) 902,300

Page 13: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Measurement confusion causes $125 Million loss

Source: http://edition.cnn.com/TECH/space/9909/30mars.metric.02/

Internationalization

Page 14: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

Percentage of English speakers by country.

Source: https://en.wikipedia.org/wiki/List_of_countries_by_English-speaking_population

Page 15: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

• Why is it important

– Revenue generation

– Survival

– More adoption / Popularity of the software

Page 16: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

Page 17: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

Page 18: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

Isthisavalidtextyouarelookingat

Character word sentence line

Page 19: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

German Swedish

01: Åkersberga 02: Alingsås

02: Alingsås 04: Oskarshamn

03: Äppelbo 07: Utting

04: Oskarshamn 06: Üttfeld

05: Östersund 08: Zwickau

06: Üttfeld 01: Åkersberga

07: Utting 03: Äppelbo

08: Zwickau 05: Östersund

The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sort orderUCA Collation Common words (lift, elevator.., start of the week)

Table 1 shows some examples of cases where sort order differs by language, usage, or another customization. Language Swedish: z < ö

German: ö < z

Internationalization

Page 20: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization

• Internationalization is the process of designing a software application or extending the software, so that it can be easily adapted /supported in various languages and regions without major engineering changes

• Localization is the process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text

• Globalization: Combination of both i18n & l10n

What is Internationalization

Page 21: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

• Enabling i18n we can localized a software quickly

• With the addition of localized data, the same executable can run worldwide

• Textual elements, such as status messages and the GUI component labels, are not hardcoded in the program

• Instead they are stored outside the source code and retrieved dynamically

• Support for new languages does not require recompilation

• Culturally-dependent data, such as dates and currencies, appear in formats that conform to the end user's region and language

• Items to be localized :• Strings

• Date & time

• Currencies

Internationalization

Page 22: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization Introduction/overview

Why should I care about Internationalization

History / Evolution

Various encodings

History linked to Java

References

Oracle Confidential – Internal/Restricted/Highly Restricted 22

1

2

3

4

5

Page 23: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

EvolutionSource: http://www.asciitable.com/

Page 24: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Source:

http://www.lookuptables.com/ebcdic_scancodes.php

Evolution

Page 25: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

EvolutionCharacter Set / Repertoire

• A character set or repertoire is an unordered collection of characters that can be represented by numeric values.

• A character repertoire is the full set of abstract characters that a system supports. The repertoire may be closed, i.e. no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code pages). This is then called a coded character set when each character is assigned a particular number, called a code point. In the coded character set called ISO 8859-1 (also known as Latin1) the decimal code point value for the letter é is 233. However, in ISO 8859-5, the same code pointrepresents the Cyrillic character щ.

Code Point

code points are just non-negative integer numbers in a certain range. They do not have an implicit binary representation or a width of 21 or 32 bits. Binary representation and unit widths are defined for encoding forms

Character encoding scheme (Code Page in Windows)

• A character encoding scheme defines the representation of numeric values from one or more coded character sets containing symbols, letters, digits in bits and bytes

Page 26: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Evolution

CJK / ISO-8859-X

• While these 256 set were ok for the english speaking world, was not true in the East Asian part of the world (DBCS--BIG5, SJIS,ISO-8859-x,..)

Windows Code page

• ANSI ( Apps using GUI native apps using Windows GUI )& OEM ( Console based app

Information

Exchange

• Advent of WWW or data being transferred from one system to other …situation where one code page info or one char encoding is sent to the other what would you see (^&*$##@!@@) or too much of code needed . World needed a common format to address this

• In 1990, there was a parallel initiatives by 2 bodies one was ISO & group of people from OEM (Apple, Xerox early folks, Sun Microsystems, Microsoft)

Page 27: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Source: http://www.unicode.org/charts/

Page 28: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Unicode

Page 29: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Unicode

Unicode

• Unicode provides a unique number for every character no matter what the platform,no matter what the program,no matter what the language ( Also referred as Code point, as stated by http://www.unicode.org/ )

• Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience

Unicode

• Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience

• The Unicode Standard contains much more information for implementers, covering—in depth—topics such as bitwise encoding, biditext , Normalizer ( comparision , searching..)

Unicode

• Unicode was originally designed as a fixed-width 16-bit character encoding (ucs-2). However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth

• Allows you to define around 1114112 code points ( 10FFFF )

• Divided into 17 multilingual planes ( Basic, Supplementary)

Page 30: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Unicode

• Unicode was originally designed as a fixed-width 16-bit character encoding (ucs-2). However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth

• The Unicode standard therefore has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters (BMP,SMP).

• Version 2.0 of the Unicode standard was the first to include a design to enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned.

• Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it had to support supplementary characters.

Page 31: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Why should I care about Internationalization

History / Evolution

Various encodings

History linked to Java

References

Oracle Confidential – Internal/Restricted/Highly Restricted 31

1

2

3

4

5

Page 32: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Encoding formsEncoding forms that are available:

ASCII, EBCDIC,ISO-8859-1, SJIS

USC-2 (LE,BE), UCS-4(LE, BE)

Encoding Formats in Unicode – UTF-16 (LE, BE) , UTF-32(LE,BE), UTF-8

Page 33: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Problem: When data is transferred from one computer to another and more than 1 byte represents a character , there could be trouble . Anything over 2 bytes causes us to think of how it is transmitted and represented

BOM : Byte Order Mark , defines the endiness of a system and is placed at the beginning of a data file or stream

List of BOM’s–UTF-16BE: FE FF

–UTF-16LE: FF FE

–UTF-32BE: 00 00 FE FF

–UTF-32LE: FF FE 00 00

–UTF-8: EF BB BF***

Oracle Confidential – Internal/Restricted/Highly Restricted 33

Encoding forms

Page 34: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

• UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before UTF-16 were added to Version 2.0 of the standard

• UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to represent supplementary characters

• Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters

Encoding forms

Page 35: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Encoding Forms

Source: http://www.w3.org/

Page 36: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Unicode Planes: BMP, SMP

Why Surrogate Keys:

In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.

Reserves these code pointsHigh surrogates u+D800- U+DBFF

Low surrogates U+DC00 - U+DFFF

Encoding forms

Page 37: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Surrogate KeysSource: https://en.wikipedia.org/wiki/UTF-16

Example of how Surrogate Keys work : U+10437 (𐐷)

1) 0x10437 -0x10000 = 0x00437 ( 0000 0000 0100 0011 0111 )

2) Split high 10-bit value & low 10-bit value: 0000000001 and 0000110111

3) 0xD800 + 0x0001 = 0xD801

4) 0xDC00 + 0x0037 = 0xDC37

Representation of U+10437 (𐐷) - 0xD801 0xDC37 or \ud801\udc37

Page 38: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

UTF-8• UTF-8 story

• Source: https://en.wikipedia.org/wiki/UTF-8

• UTF-8 is the dominant character encoding for the World Wide Web, accounting for 85.1% of all Web pages in September 2015 (with the most popular East Asian encoding, GB 2312, at 1.0%).[4][2][5] The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8,[6] and the W3C recommends UTF-8 as the default encoding in XML and HTML.[7]

• UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard). Code points with lower numerical values

Page 39: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

UTF-8

Bits of code point First code point Last code point

Bytes in

sequence Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6

7 U+0000 U+007F 1 0xxxxxxx

11 U+0080 U+07FF 2 110xxxxx 10xxxxxx

16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx

21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Page 40: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

UTF-8

0x00..0x7F → 1 byte

0x80..0x7FF → 2 bytes

0x800..0xD7FF, 0xE000..0xFFFF → 3 bytes

0x10000 .. 0x10FFFF → 4 bytes

Page 41: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

UTF-8 & UTF-16

1. UTF-8 and UTF-16 are both used for encoding characters

2. UTF-8 uses a byte at the minimum in encoding the characters while UTF-16 uses two

3. A UTF-8 encoded file tends to be smaller than a UTF-16 encoded file (When using ASCII only characters, a UTF-16 encoded file would be roughly twice as big as the same file encoded with UTF-8)

4. UTF-8 is compatible with ASCII while UTF-16 is incompatible with ASCII

5. UTF-8 is byte oriented while UTF-16 is not

6. UTF-8 is better in recovering from errors compared to UTF-16

Page 42: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

UTF-8 & UTF-16

• No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too

•Most reasonable characters, like Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. Unless really exotic characters are needed (like for names), this means that the 16-bit subset of UTF-16 can be used as a fixed-length encoding, which speeds indexing.

Page 43: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Program Agenda

Why should I care about Internationalization

History / Evolution

Various encodings

I18n & Java

References

Oracle Confidential – Internal/Restricted/Highly Restricted 43

1

2

3

4

5

Page 44: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Java history

JDK 1.3

• Sun took over the Taligent i18n classes maintenance responsibility

Page 45: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Java history linked to i18n

• Java’s char data type, whose values are 16-bit unsigned integers representing Unicode code points in the Basic Multilingual Plane, encoded with UTF-16, and whose default value is the null code point ('\u0000')

• ICU was taken from JDK 1.3 and is independent being released since

• JDK also leverages code from ICU4J & we thank the ICU committee (IBM) for this

• Unicode is an evolving standard, and the Java platform has tracked the standard so that it now supports Unicode 8.0 in JDK 9

Page 46: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Java history linked to i18n• Version 2.0 of the Unicode standard was the first to include a design to

enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned. Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it had to support supplementary characters ( JSR 204: Supplementary characters support )

•Use the primitive type int to represent code points in low-level APIs

• Interpret char sequences in all forms as UTF-16 sequences, and promote their use in higher-level APIs.

•Provide APIs to easily convert between various char and code point-based representations.

With this approach, a char represents a UTF-16 code unit, which is not always sufficient to represent a code point. You'll

see that the J2SE specifications now use the terms code point and UTF-16 code unit where the representation is

relevant, and the generic term character where the representation is irrelevant to the discussion. APIs usually use the

name codePoint for variables of type int that represent code points, while UTF-16 code units of course have type char.

Page 47: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Unicode & i18n

New Unicode release ≠ New data addition only

– Correction to existing character

– Technical reports may be revised/added, too

● Before Unicode 7

– Releases were irregular. Advance notices were Unreliable

– Difficult to plan to add to JDK

Since Unicode 7

• – New major version is released every year in June.

– Very helpful, easy to plan

Page 48: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

I18n & Unicode versions

Java Version Supported Unicode version

Prior JDK 1.1 1.1.5

1.1 2.0

J2SE 1.2 2

J2SE 1.4 3.2

JDK 5.0 4

JDK 6 4

JDK 7 6

JDK 8 6.2

JDK 9(Yet To be Released) 7.0 ,8.0

Page 49: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Unicode & i18n● Before Unicode 7

– Releases were irregular. Advance notices were Unreliable

– Difficult to plan to add to JDK

Since Unicode 7

• – New major version is released every year in June.

– Very helpful, easy to plan

Page 50: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale & i18n

• Locale

• ID representing each cultural region

• It does not have any data

– A locale consists of :-• ISO 639-1 2-letter language code. e.g., “en”

• ISO 3166 2-letter country code. e.g., “US”

• variant code (any form): JDK supplied ones are:• “NY”: Norwegian Nynorsk

• “TH”: Thai digit for Thai Gov.

• “EURO”: Designates “Euro” (obsolete)

Page 51: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

CLDRThe Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications.

This standardizes the commonly used locale data, among the types of data that CLDR includes are the following:

• Translations for language names, territory and country names.

• Translations for currency names, including singular/plural modifications.

• Translations for weekday, month, era, period of day, in full and abbreviated forms.

• Translations for timezones and example cities (or similar) for timezones.

• Translations for calendar fields.

• Patterns for formatting/parsing dates or times of day.

• Exemplar sets of characters used for writing the language.

• Patterns for formatting/parsing numbers.

• Rules for language-adapted collation.

Page 52: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale Sensitive Services

Locale Sensitive Services

• java.text.BreakIterator, *.Collator, *.DateFormat,*.DateFormatSymbols •.DecimalFormatSymbols, java.text.NumberFormat,*.bidi

• java.util.Calendar, *.Currency, *.Locale,.TimeZone

Support for locale-sensitive behavior in the java.util and java.text packages is entirely platform independent, the only platform dependent functionality is the setting of the initial default locale and the initial default time zone based on the host operating system's locale and time zone

The Java platform does not require you to use the same Locale throughout your program. If you wish, you can assign a different Locale to every locale-sensitive object in your program. This flexibility allows you to develop multilingual applications, which can display information in multiple languages

Page 53: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale Sensitive Services

Support for locale-sensitive behavior in the java.util and java.text packages is entirely platform independent, the only platform dependent functionality is the setting of the initial default locale and the initial default time zone based on the host operating system's locale and time zone

The Java platform does not require you to use the same Locale throughout your program. If you wish, you can assign a different Locale to every locale-sensitive object in your program. This flexibility allows you to develop multilingual applications, which can display information in multiple languages

Page 54: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale Sensitive Services

BreakIterator

• Used in text editors and multiple applications with text processing

• Supports four types of text-breaking.

• “If you can dream it, you can do it. Walt Disney”• Character “/I/f/ /y/o/u/ /c/a/n/ /d/r/e/a/m/ /i/t/,/ /…/ /W/a/l/t/ /D/i/s/n/e/y/”

• Word “/If/ /you/ /can/ /dream/ /it/,/ /you/ /can/ /do/ /it/./ /Walt/ /Disney/”

• Sentence “/If you can dream it, you can do it. /Walt Disney/”

• Line “/If /you /can /dream /it, /you /can /do /it. /Walt /Disney/”

• Two implementations exist :

• Rule-based

• Specify rules using General_Category in the UCD

• Dictionary-based

• Helpful for languages which don't use a space between words

• We have only one dictionary file for word- & line-breaking of Thai language

Page 55: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale Sensitive Services

Normalizer

Normalizes text for many purposes

(e.g. comparison, sort, search)

• Supports four Normalization forms.• NFD: Canonical Decomposition

• NFC Canonical Decomposition, followed by Canonical Composition

• NFKD: Compatibility Decomposition

• NFKC: Compatibility Decomposition, followed by Canonical Composition

Page 56: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale Sensitive Services

Collator• – Performs String comparison. → Sorting

• Our only implementation is Rule-based Collator.

• Examples)

• You can choose either sorting:

– “AA”, “aa”, “BA” or “AA”, “Ba”, “aa”

• UTS #10 Unicode Collation Algorithm - Yet to be supported

Page 57: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale Sensitive Services

BiDiProvides information of the bidirectional reordering of text.

Bidirectional Character Types in the UCD are used.

Examples)

'0' (Digit, zero): EN

'٠' (Arabic-Indic digit, zero): AN

'A' (Latin capital letter, A): L

)' ' ا Arabic letter, Alef): AL

)' ' א Hebrew letter, Alef): R

Actually, there are 23 Bidirectional Charater Types!!

UAX #9 Unicode Bidirectional Algorithm

Page 58: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Locale Sensitive Services (Contd)

Supported Loacle’s as of JDK 8

http://www.oracle.com/technetwork/java/javase/java8locales-2095355.html

~ 71 Locale’s for java.text.* , java.util.*

java.util.spi java.text.spi

CurrencyNameProvider BreakIteratorProvider

LocaleServiceProvider CollatorProvider

TimeZoneNameProvider DateFormatProvider

CalendarDataProvider DateFormatSymbolsProvider

DecimalFormatSymbolsProvider

NumberFormatProvider

Page 59: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

FYI , classes related to unicode

Java Version I18 support in Java Class/es

1.0 java.lang.Character & java.lang.String

1.1 java.text.* ; BreakIterator; Collator

1.2 java.lang.Character.UnicodeBlock

1.4 java.awt.font.NumericShaper

java.text.Bidi *

7 java.lang.Character.UnicodeScript

6 java.text.Normalizer *

6 java.net.IDN

Page 60: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Java i18n History (Apart from Unicode versions)JDK 1.0

● char as 16-bit Unicode (code unit)

● Implementation supported ISO 8859-1 only

● Leaked ISO 8859-1 into specs (properties)

● java.util.Date: aligned with C library date-time functions (0-based month numbering, opposite time zone offset)

JDK 1.1● Added I18N classes, java.text.*, java.util.Locale, etc. (came from Taligent OS)

● Originally written in C++ and ported to Java

● Also ported the date-time classes from Taligent OS

○ java.util.Calendar , GregorianCalendar , TimeZone, SimpleTimeZone

JDK 1.2● Input method support (API)

● Unicode 2.0

○ Added Character.UnicodeBlock

Page 61: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Java i18n History JDK 1.3

● Sun took over the Taligent i18n classes maintenance responsibility

○ Date-time API maintenance for Y2K

● Reimplemented platform time zone detection code

● Bi-directionality text rendering support in Swing

JDK 1.4● Added Thai support

○ Text break

○ Collator

○ Thai Buddhist calendar support

○ Input method

● Added java.text.Bidi , java.util.Currency support

JDK 5.0● JSR 204: Supplementary characters support

● Multilingual Font Configuration support

● BigDecimal support in java.text.DecimalFormat

Page 62: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Java i18n History JDK 6

● Locale Sensitive Services (a.k.a. pluggable locales)

○ java.text.spi and java.util.spi

● Added java.text.Normalizer

● Added some locales (derived from CLDR)

● Added Japanese calendar support

JDK 7● Enhanced java.util.Locale

○ Script support (e.g., Hans, Hant)

○ Extensions support

Page 63: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

References & Citations used• http://www.oracle.com/us/technologies/java

• http://userguide.icu-project.org/icudata

• http://unicode.org/

• https://www.w3.org/International/

• http://userguide.icu-project.org/unicode

• https://en.wikipedia.org/wiki/List_of_Unicode_characters

• https://en.wikipedia.org/wiki/UTF-16

• https://en.wikipedia.org/wiki/UTF-8

Page 64: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Quiz

• ASCII – bits ?

• Variable encoding format -- FEFF, FFFE ?

• Can you register Domain names with unicode chars ?

• JDK i18n had 1 JSR which is that ?

Page 65: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 65

Q & A

Page 66: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 66

Thank you

Page 67: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

BACKUP

Page 68: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

public class NotI18N {

static public void main(String[] args) {

System.out.println("Hello.");

System.out.println("How are you?");

System.out.println("Goodbye.");

}

}

Page 69: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

import java.util.*;

public class I18NSample {

static public void main(String[] args) {

String language;

String country;

if (args.length != 2) {

language = new String("en");

country = new String("US");

} else {

language = new String(args[0]);

country = new String(args[1]);

}

Locale currentLocale;

ResourceBundle messages;

currentLocale = new Locale(language, country);

messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);

System.out.println(messages.getString("greetings"));

System.out.println(messages.getString("inquiry"));

Page 70: Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Properties:

greetings = Hello.

farewell = Goodbye.

inquiry = How are you?

greetings = Hallo.

farewell = Tschüß.

inquiry = Wie geht's?

greetings = Bonjour.

farewell = Au revoir.

inquiry = Comment allez-vous?

% java I18NSample fr FR

Bonjour.

Comment allez-vous?

Au revoir.

In the next example the language code is en (English) and the country

code is US (United States) so the program displays the messages in

English:

% java I18NSample en US

Hello.