Lecture 6 1 Software Localization(L10N) and Internationalization(I18N) Localization: customizing a...

Lecture 6 1

Software Localization(L10N) and Internationalization(I18N)

• Localization: customizing a software for a particular language/market

• Class discussion: What are the things that needs to be customized when Microsoft Word need to be changed from English to Chinese?

Lecture 6 2

Example: Good MorningPublic class GoodMorning {

Public static void main(String s[ ]) {

System.out.println(“Good morning!”);

}

}

• What if you want to do this for Hong Kong? • What if you want to do this for China and other

places?• Think of a way to write it without the need to

change the source code

Lecture 6 3

Revised: Good Morning

Import java.util.*;

Public class GoodMorning {

Public static void main(String s[ ]) {

ResourceBundle resources;

try {

resources = ResourceBundle.getBundle(“MyData”);

System.out.println(resources.getString(“Hi”);

} catch (MissingResourceException mre) {

System.err.println(“MyData.properties not found”);}

}

Lecture 6 4

Internationalization(I18N)• I18N: A software methodology to avoid writing separate

application software for different language/cultural environments.– change of language environment without change of

programming logic(no need to modify source code)• Why I18N:

– More complicated software design and implementation– But Saving development cost for global market– Minimize localization– Minimize exposure of source code

Lecture 6 5

• Principles of I18N:

– Do not hard-code any language related data/elements(language data) in a program

– Design well defined Interface to access language data from external sources(files, databases, or even programs)

– Clear instruction for localization

Lecture 6 6

How to write an I18N program

• analysis of language related elements in the application program and make sure they are not hard-coded in the program

• Design/use language interface Specification (routines to access the language data in a well defined way)

• Preparation of localization instructions(/follow standard) (must be precise so that data can be prepared following the instructions)

Lecture 6 7

Example• Bank ATM machines in Hong Kong• Traditional program:

– display alternate screens– Insert card and input password – get preferred display– If English, execute English program– else execute Chinese program

• What do the English and the Chinese programs have in common?

• What if we need to add Simplified Chinese?

Lecture 6 8

• I18N conscious program:– display alternate screens– Insert card and input password – get preferred display– open preferred display file– execute ATM program

Lecture 6 9

Discussion on this example:public class GoodMorning {

public static void main(String s[ ]) { int country = 0; if (s.equals(“English")) { country = 1; } else if (s.equals(“Chinese_HK")) {

country = 2; } switch (country) { case 1: System.out.println(“Good Morning!”);

break; case 2: System.out.println(“早上好！ "); break;default: System.out.println("Good Morning"); }

} }

Lecture 6 10

Data for User Interface vs. Data for manipulation

• Data for User Interface: resource files• Data for manipulation may not be in the same

language/script as the data displayed in the user interface.– Use an English UI of Window Word to

compose a Chinese article or vice versa– Not necessarily in resource files

Lecture 6 11

Language/culture Related Issues: Display & processing

(basic to all applications)• Internal representation: codeset

– Different classes of the subgroups in a codeset• Input: encoding of input strings to internal code• Output: internal code to glyph

association(display)• Date expression• Currency symbols• Fraction& large numbers:• etc.

Lecture 6 12

I18N Issues on Language Related Applications

• Handling of messages in applications(not system msgs):– Writing the menu items and messages in resource files– providing a language parameter used in application or take

the locale value to open the appropriate file(either in different directories), or use different file names.

• Certain language specific Applications(e.g. spell checking):– Open it as an API so that different algorithms can be

(dynamically) linked to the application• Data Format:

– Example: Address - vary according to locationsUSA: Flat No.(incl. bldg), street, City, ZipCode(incl. State)HK: Flat, Floor, Bldg, Estate, Street(may be optional), District– Database table design is not straight forward.

Lecture 6 13

• Measurement scales:– Imperial system vs. metric system: can cause rounding

problem• Paper sizes• Chinese language specific:

– Segmentation – Lack of morphological rules to indicate tense(time),

active/passive voice, etc.– No need for morphological rules in searching– More complicated sorting algorithm due to multiple

features of Chinese characters

Lecture 6 14

Internationalization Facilities POSIX• POSIX: Portable Operating System Interface• NLS: National Language Support• Locale: A particular localization setting

C locale, zh_TW, etc/home/staff/csluqin:>dateThur Feb 24 15:38:25 CST 2005:> setenv LANG zh_TW:> echo $LANG zh_TW:> /usr/openwin/lib/locale:>env LANG=zh_TW.BIG5 date中華民國 94 年 02 月 26 日 15 時 38 分 27 秒 CST:> /usr/openwin/lib/locale:>env LANG=fr datemercredi, 3 avril 2002, 14:30:51 HKT(not available

now)

Lecture 6 15

• Posix Locale categoriesLC_CTYPE: Controls the behavior of character

handling functions, such as isalpha()LC_TIME: Date and time format and functionsLC_MONETARY: Currency symbol, and functions etcLC_NUMERIC: Decimal separator and thousands

separatorLC_COLLATE: Control sorting order and string

conversion/comparison LC_MESSAGES: Controls the choice of message

catalogs(User message translation)

:> env LANG=zh_TW LC_MESSAGE=c

Lecture 6 16

• Character class related test functions:isalpha( c ), isupper( c ), islower( c ), isdigit( c ), isxdigit( c ), isalnum( c ), isspace( c ), ispunct( c ), isprint( c ), iscntrl( c ), isascii( c ), isgraph( c )

• Character conversion functions:toupper( c ), tolower ( c )

• Wide character vs multi-byte characters• Wide character handling functions:

mblen( c ), mbtowc( ), wctomb( ), mbstowcs( ), wcstombs( )

National Profile: data prepared for POSIX functions in a particular locale.

Example of NP.GB

Lecture 6 17

NLS and Symbolic Names

• A National profile is written using symbolic names• Each locale has a separate file called charmap which maps the symbolic

names of each character to the actual code of that localeSymbolic Name Encoding

<A> \x41<two> \x32<semicolon> \x3b<GB16-01> \xb0\xa1 /* 啊

• Why Symbolic names: • Less error prone• Flexibility

– Language/cultural conventions different but the codeset is the same

– Same language/cultural convention but different codesets

Lecture 6 18

Making Portable software for different encodings (codeset independent)

• What is the problem with this program?• ‘A’ in EUC encoding is fine: 0X41(Ascii code), but if this

program is ported to a PC big5 system => second byte of an ideographic character

• 乙丕再你杗呸服隹括耍唧涉… all the xx41 in Big5!• C language standard Guarantee: • 0X00 is not part of any MB character marking end of a string• Use of wide character

Char s[100];char *p;fgets(s,sizeof(s), stdin); /* get a line of input*/p = strchr(s,’A’); /* find letter A */if (p != NULL) /* if found, */ *p = ‘\0’; /* replace with null byte*/

Lecture 6 19

Wide characters vs. Multi-byte characters

• They may be referring to the nature of codesets or it may refer to data types in programming languages

• Multibyte characters: Character lengths vary from character to character, it can be referring to characters in a single codeset(Taiwan’s CNS), or characters in multiple codesets(Big5 with ASCII) such as char in C language

• Wide characters: fixed-length character encoding such as wchar_t in C language, and characters in Java which are all unicode(wide characters)

Lecture 6 20

• Multibyte examples(Big5):

學習 ABC => total of 7 bytes

學習普通話 => total of 10 bytes• Problems:

– String length and byte length cannot be calculated directly(context sensitive). Detection of character boundary is needed.

– Difficult to go to any position in a string to know if it is the first byte of a character or not

• Need for conversion of MBC and WC

Lecture 6 21

• Conversion of MBC to WC

• Note: Unicode is a WC, but WC is not necessarily Unicode

1 2 3 4 5 6 7

||

1 2 3 4 0016 5 0016 6 0016 7

Lecture 6 22

• When to use MBC – Copy data only– Comparing for equality– Searching for control characters– Single byte data only: if MB_CUR_MAX = 1

• When to use WC– Collation: sorting– Parsing characters: searching and processing– String editing

Lecture 6 23

• MB_LEN_MAX, LC independent• MB_CUR_MAX, <stdlib.h> LC dependent

Char s[100];wchar_t ws[100];size_t n;char *p;wchar_t *wcp;fgets(s,sizeof(s), stdin); /* get a line of input*/mbstowcs(ws,s,100); /* convert s to ws */wcp = wcschr(ws, mbtowc(’A’) ); /* find “A” */……..

•Use Wide characters

Lecture 6 1 Software Localization(L10N) and Internationalization(I18N) Localization: customizing a...

Documents

Transcript of Lecture 6 1 Software Localization(L10N) and Internationalization(I18N) Localization: customizing a...