Lecture 6 1 Software Localization(L10N) and Internationalization(I18N) Localization: customizing a...
-
date post
20-Dec-2015 -
Category
Documents
-
view
230 -
download
1
Transcript of Lecture 6 1 Software Localization(L10N) and Internationalization(I18N) Localization: customizing a...
Lecture 6 1
Software Localization(L10N) and Internationalization(I18N)
• Localization: customizing a software for a particular language/market
• Class discussion: What are the things that needs to be customized when Microsoft Word need to be changed from English to Chinese?
Lecture 6 2
Example: Good MorningPublic class GoodMorning {
Public static void main(String s[ ]) {
System.out.println(“Good morning!”);
}
}
• What if you want to do this for Hong Kong? • What if you want to do this for China and other
places?• Think of a way to write it without the need to
change the source code
Lecture 6 3
Revised: Good Morning
Import java.util.*;
Public class GoodMorning {
Public static void main(String s[ ]) {
ResourceBundle resources;
try {
resources = ResourceBundle.getBundle(“MyData”);
System.out.println(resources.getString(“Hi”);
} catch (MissingResourceException mre) {
System.err.println(“MyData.properties not found”);}
}
Lecture 6 4
Internationalization(I18N)• I18N: A software methodology to avoid writing separate
application software for different language/cultural environments.– change of language environment without change of
programming logic(no need to modify source code)• Why I18N:
– More complicated software design and implementation– But Saving development cost for global market– Minimize localization– Minimize exposure of source code
Lecture 6 5
• Principles of I18N:
– Do not hard-code any language related data/elements(language data) in a program
– Design well defined Interface to access language data from external sources(files, databases, or even programs)
– Clear instruction for localization
Lecture 6 6
How to write an I18N program
• analysis of language related elements in the application program and make sure they are not hard-coded in the program
• Design/use language interface Specification (routines to access the language data in a well defined way)
• Preparation of localization instructions(/follow standard) (must be precise so that data can be prepared following the instructions)
Lecture 6 7
Example• Bank ATM machines in Hong Kong• Traditional program:
– display alternate screens– Insert card and input password – get preferred display– If English, execute English program– else execute Chinese program
• What do the English and the Chinese programs have in common?
• What if we need to add Simplified Chinese?
Lecture 6 8
• I18N conscious program:– display alternate screens– Insert card and input password – get preferred display– open preferred display file– execute ATM program
Lecture 6 9
Discussion on this example:public class GoodMorning {
public static void main(String s[ ]) { int country = 0; if (s.equals(“English")) { country = 1; } else if (s.equals(“Chinese_HK")) {
country = 2; } switch (country) { case 1: System.out.println(“Good Morning!”);
break; case 2: System.out.println(“早上好! "); break;default: System.out.println("Good Morning"); }
} }
Lecture 6 10
Data for User Interface vs. Data for manipulation
• Data for User Interface: resource files• Data for manipulation may not be in the same
language/script as the data displayed in the user interface.– Use an English UI of Window Word to
compose a Chinese article or vice versa– Not necessarily in resource files
Lecture 6 11
Language/culture Related Issues: Display & processing
(basic to all applications)• Internal representation: codeset
– Different classes of the subgroups in a codeset• Input: encoding of input strings to internal code• Output: internal code to glyph
association(display)• Date expression• Currency symbols• Fraction& large numbers:• etc.
Lecture 6 12
I18N Issues on Language Related Applications
• Handling of messages in applications(not system msgs):– Writing the menu items and messages in resource files– providing a language parameter used in application or take
the locale value to open the appropriate file(either in different directories), or use different file names.
• Certain language specific Applications(e.g. spell checking):– Open it as an API so that different algorithms can be
(dynamically) linked to the application• Data Format:
– Example: Address - vary according to locationsUSA: Flat No.(incl. bldg), street, City, ZipCode(incl. State)HK: Flat, Floor, Bldg, Estate, Street(may be optional), District– Database table design is not straight forward.
Lecture 6 13
• Measurement scales:– Imperial system vs. metric system: can cause rounding
problem• Paper sizes• Chinese language specific:
– Segmentation – Lack of morphological rules to indicate tense(time),
active/passive voice, etc.– No need for morphological rules in searching– More complicated sorting algorithm due to multiple
features of Chinese characters
Lecture 6 14
Internationalization Facilities POSIX• POSIX: Portable Operating System Interface• NLS: National Language Support• Locale: A particular localization setting
C locale, zh_TW, etc/home/staff/csluqin:>dateThur Feb 24 15:38:25 CST 2005:> setenv LANG zh_TW:> echo $LANG zh_TW:> /usr/openwin/lib/locale:>env LANG=zh_TW.BIG5 date中華民國 94 年 02 月 26 日 15 時 38 分 27 秒 CST:> /usr/openwin/lib/locale:>env LANG=fr datemercredi, 3 avril 2002, 14:30:51 HKT(not available
now)
Lecture 6 15
• Posix Locale categoriesLC_CTYPE: Controls the behavior of character
handling functions, such as isalpha()LC_TIME: Date and time format and functionsLC_MONETARY: Currency symbol, and functions etcLC_NUMERIC: Decimal separator and thousands
separatorLC_COLLATE: Control sorting order and string
conversion/comparison LC_MESSAGES: Controls the choice of message
catalogs(User message translation)
:> env LANG=zh_TW LC_MESSAGE=c
Lecture 6 16
• Character class related test functions:isalpha( c ), isupper( c ), islower( c ), isdigit( c ), isxdigit( c ), isalnum( c ), isspace( c ), ispunct( c ), isprint( c ), iscntrl( c ), isascii( c ), isgraph( c )
• Character conversion functions:toupper( c ), tolower ( c )
• Wide character vs multi-byte characters• Wide character handling functions:
mblen( c ), mbtowc( ), wctomb( ), mbstowcs( ), wcstombs( )
National Profile: data prepared for POSIX functions in a particular locale.
Example of NP.GB
Lecture 6 17
NLS and Symbolic Names
• A National profile is written using symbolic names• Each locale has a separate file called charmap which maps the symbolic
names of each character to the actual code of that localeSymbolic Name Encoding
<A> \x41<two> \x32<semicolon> \x3b<GB16-01> \xb0\xa1 /* 啊
• Why Symbolic names: • Less error prone• Flexibility
– Language/cultural conventions different but the codeset is the same
– Same language/cultural convention but different codesets
Lecture 6 18
Making Portable software for different encodings (codeset independent)
• What is the problem with this program?• ‘A’ in EUC encoding is fine: 0X41(Ascii code), but if this
program is ported to a PC big5 system => second byte of an ideographic character
• 乙丕再你杗呸服隹括耍唧涉… all the xx41 in Big5!• C language standard Guarantee: • 0X00 is not part of any MB character marking end of a string• Use of wide character
Char s[100];char *p;fgets(s,sizeof(s), stdin); /* get a line of input*/p = strchr(s,’A’); /* find letter A */if (p != NULL) /* if found, */ *p = ‘\0’; /* replace with null byte*/
Lecture 6 19
Wide characters vs. Multi-byte characters
• They may be referring to the nature of codesets or it may refer to data types in programming languages
• Multibyte characters: Character lengths vary from character to character, it can be referring to characters in a single codeset(Taiwan’s CNS), or characters in multiple codesets(Big5 with ASCII) such as char in C language
• Wide characters: fixed-length character encoding such as wchar_t in C language, and characters in Java which are all unicode(wide characters)
Lecture 6 20
• Multibyte examples(Big5):
學習 ABC => total of 7 bytes
學習普通話 => total of 10 bytes• Problems:
– String length and byte length cannot be calculated directly(context sensitive). Detection of character boundary is needed.
– Difficult to go to any position in a string to know if it is the first byte of a character or not
• Need for conversion of MBC and WC
Lecture 6 21
• Conversion of MBC to WC
• Note: Unicode is a WC, but WC is not necessarily Unicode
1 2 3 4 5 6 7
||
1 2 3 4 0016 5 0016 6 0016 7
Lecture 6 22
• When to use MBC – Copy data only– Comparing for equality– Searching for control characters– Single byte data only: if MB_CUR_MAX = 1
• When to use WC– Collation: sorting– Parsing characters: searching and processing– String editing
Lecture 6 23
• MB_LEN_MAX, LC independent• MB_CUR_MAX, <stdlib.h> LC dependent
Char s[100];wchar_t ws[100];size_t n;char *p;wchar_t *wcp;fgets(s,sizeof(s), stdin); /* get a line of input*/mbstowcs(ws,s,100); /* convert s to ws */wcp = wcschr(ws, mbtowc(’A’) ); /* find “A” */……..
•Use Wide characters