Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna...

19
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania

Transcript of Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna...

Sophia Antipolis, September 2006

Multilinguality, localization and internationalization

Miruna Bădescu

Finsiel Romania

Unicode, encodings and character sets

3

How it all started…Until recently, most computers used font sets with a maximum 256 characters (ANSI):The first 128 (ASCII):

numbers letters a-z and A-Zpunctuation marks

The second 128 set varies: English-speaking world contain:

more punctuation markscurrency symbols (e.g. £)accented letters (á, é, ñ, ç, ô)

Places like Egypt, Greece, Russia contain characters taken from the corresponding alphabet: Arabic, Greek, Cyrillic

4

Code, encoding

Character code – a sequence of bits that a computer use to represent a character

Encoding – the rule describing how a set of bytes are transformed into characters

5

Problem

These encoding systems also conflict with one another – two encodings can use the same number for two different

characters can use different numbers for the same

character

Data can become incomprehensible when transferred from one place to another

6

Solution

Moving to a system that assigns a unique number to each character in each language of the world

The Unicode standard provides a unique number for every character

no matter what the platform,no matter what the program,no matter what the language

Unicode (as defined by the Unicode Consortium) has become a universal standard: ISO/IEC 10646, describing the 'Universal Multiple-Octet Coded Character Set' (UCS)

7

Unicode Unicode repertoire can be encoded in more than one way: UTF-8, UTF-16, UTF-32

UTF-8 encodes: ASCII characters on 1 byte other characters up to 6 bytes

Incorporating it into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets

Enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering

Allows data to be transported through many different systems without corruption.

Internationalization and localization

9

I18n

Internationalization (I18n): modification of an application so that it can handle multiple languages, countries, etc.:Display content (web pages, files) in end user’s languageDisplay messages around the site in user’s language

(e.g. “Home”, “Search”, error messages)Input characters in end user’s languagePrinting out the correct charactersHandling dates, numbers and sorting words using the

rules of that language

10

L10n

Localization (l10n) involves taking a product and making it linguistically and culturally appropriate to the target locale (country/region and language)

Means to change the language on a Web site:User selectionDetecting the browser settingsAutomatically, based on the user’s profile

Translation issue:Identifying un-translated or old translations of terms and

phrasesDifferent roles for translators and content managersOffering an interface for the content translation

11

Example of XLIFF translation file coming from the translation service

XLIFF: XML Localization Interchange File Format

Sorting in different languages

13

Sorting in the same language Strings must be sorted according to that language sorting rules Complex characters, ignorable characters and exceptional words to be considered Normally done in to steps:

primary sorting uppercase and lowercase characters are equivalentdiacritical marks are ignoredignorable characters are not considered

secondary sortingdifference between uppercase and lowercasecharacters with diacritical marks are ranked individuallyignorable characters influence the sorting

14

Sorting in different languages

Approaches 1.

All strings in the same language should be sorted according to that language’s rules

Sorting is also governed by order among languages or among groups of languages

e.g English, German, French = Roman group

2. Sort using the sorting rules that are associated with

the language chosen by the end-user or site language

SEMIDE portal and toolkit - multilinguality issues

16

Multilingual portal – EN, FR, AR, …

17

Features

All pages are encoded in UTF-8all characters of the word are supported

Default language set at startup: English

18

What aspects are multilingual?

Graphical user interfacetranslation from the administrative area

one-by-one, .po, .XLIFF

Contentindividual translation for each item on edit

Glossaries and thesauritranslation from the Zope’s Management Interface

Syndication (RDF channels)depends on the selected language

Searchesuser multiple selection

19

Language negotiation

When an item is not translated in the language selected by the end user, the system searches translations in:

1. the language from the user's browser settings

2. the default language

…and displays the items’ id if none of these work