Coping with Babel How to Localize XML. Designing for Localization Document design can seriously...

Coping with Babel

How to Localize XML

Designing for Localization

• Document design can seriously impact the costs of translation and localization.

• Remember that you are designing for all languages, not just English.

• There are clear do’s and don’ts.

• Overriding principle is good XML practice.

• Always consider the target language implications.

Entity references

Do not use entity references for word substitution:

<para>Use a &tool; to release the catch.</para>

• Cause problems for inflected languages

• Cause problems for parsing/translation tools

• Use boiler plate text instead

Translatable attributes

Avoid using translatable attributes:<para>Use a <tool id="a1098" name="claw hammer"> to release the CPU retention catch.</para>

• Cause problems for inflected languages

• Cause extra burden for translators

• More to go wrong

CDATA sections

Avoid using CDATA sections that may contain translatable text:

<tmpl><![CDATA[Please refer to the index page page for further information]]></tmpl>

• Lose syntactical control

• How are translation tools to cope?

Processing instructions

Avoid Processing Instructions in translatable text:

<para>Use a <?tool name="claw hammer"?> to release the CPU retention catch.</para>

• Syntactically week

• Confuse translation memory operations

Infinite Naming Schemes

Avoid the use of infinite naming schemes:<resources xml:lang="en">

<err001>Cannot open file $1.</err001>

<hint001>Hint: does file $1 exist.</hint001>

<err002>Incorrect value.</err002>

<hint002>Hint: Must be between $1 and 2.</hint002>

<err003>Connection timeout.</err999>

</resources>

• No clear element definitions

Typographical elements

Avoid the use of "typographical" elements:<para>Do not use type elements.</para>

• Bad XML practice.

• Causes problems for translators.

• Target language text may be in the opposite order.

Do not break sentences

Never break a linguistically complete text unit over more than one non-inline element:

<para>

<line>This text should not be</line>

<line>broken this way – the translated text may well be in a different order.</line>

</para>

XML Translation Standards

• LISA - Localization Industry Standards Association: http://www.lisa.org

• OASIS - Organization for the Advancement of Structured Information Standards: http://www.oasis-open.org

• W3C - World Wide Web Consortium: http://www.w3c.org

• OLIF Consortium: http://www.olif.net

LISA Standards

• TMX - Translation Memory Exchange format: http://www.lisa.org/tmx

• TBX - Termbase Exchange format: http://www.lisa.org/tbx

• SRX - Segmentation Rules Exchange format: http://www.lisa.org/srx

• GMX - GILT Metrics Exchange format: http://www.lisa.org/gmx

OASIS L10n Standards

• XLIFF - XML Localization Interchange File Format: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff

• TransWS - Translation Web Services: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=trans-ws

W3C and OLIF

• W3C to start on Localization Directives standard.

• OLIF - Open Lexicon Interchange Format: http://www.olif.net

xml:tm

XML Text Memory

A radical new approach to translating XML documents

• Machine Translation

• Translation Memory

• Hybrid Linguistic Inferencing Engines

• Terminology

Computational Linguistic Methodologies

Translation memory

• Advent in early 1980’s

• Intermediate format

• Alignment

• Storage

• Leveraged memory

• Fuzzy matching – statistical

• Advantages: cost reduction, consistency

• Drawbacks: proofreading, managing memories

• No significant advances in technology

XML namespace

• Major new feature of XML compared to SGML• Allows the mapping of different ontological

entities onto the same representation

• Allows different ways to look at the same data• Namespaces can be made transparent

xml:tm namespace

• Text Memory namespace• Can be mapped onto any XML document• Vertical view of document in terms of ‘text segments’• Can be totally transparent

xml:tm

xml:tm namespacexml:tm

Example of the use of namespace in an XML document:

<document xmlns:tm="urn:xml-Intl-tm" > <tm:tm> <section> <para> <tm:te> <tm:tu> Namespace is very flexible. </tm:tu> <tm:tu> It is very easy to use. </tm:tu> </tm:te> </para>

xml:tm namespace

doc

title

section section

para text

tm

te sentence sentencetu tu



tm namespace view

original document

view te texttutext


para text

para text

para text

para text

para text



xml:tm namespace

text


original document view

tm namespace view

xml:tm namespace

Namespace is very simple. It is easy to use.


original document view

tm namespace view

<para>

</para>

<para>

</para>

<tm:te id=“e1”>

<tm:tu id=“u1.1”> Namespace is very simple. </tm:tu>

<tm:tu id=“u1.2”> It is easy to use. </tm:tu>

</tm:te>

text

xml:tm Text Memory

• Author memoryMaintain memory of source text

Authoring statistics

Authoring tool input

• Translation memoryAutomatic alignment

Maintain perfect link of source and target text

Reduce translation costs

xml:tm

Updated Source Document

tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”new

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

xml:tm DOM differencing

origid=”5”modified

xml:tm Author Memory

• Namespace aware differencing

• Identify changes from the previous version• Unique text unit identifiers are maintained• Modification history• Text units can be loaded into a database• Authoring environment integration

xml:tm

xml:tm Translation Memory

• The tm namespace can be used to create XLIFF files

• Automatic alignment of source and target languages• Allows for more focused translation matching

– Perfect matching

– Leveraged matching from document - identical text

– Leveraged matching from database

– Modified text unit matching

– Linguistically enhanced fuzzy matching

– Non translatable text unit identification

xml:tm

xml:tm translation

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Translated Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

XLIFF Document

trans-unit id=”1”






doc

title

section section

para tekst

tm

te zdanie zdanietu tu



translated tm namespace

view

translated document

view te teksttutekst


para tekst

para tekst

para tekst

para tekst

para tekst



xml:tm translated document

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Translated Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Perfect alignment

xml:tm perfect alignment


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

modified

new

Matched Target Document

tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching

requires translation


xml:tm perfect matching

xml:tm contextual memory

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Translated Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Perfect alignment

Source Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Translated Document

tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”5”

tu id=”6”

Perfect alignment

DB

xml:tm leveraged DB memory

xml:tm in-document leveraged matching


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

modified

new:same id=”3”


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching


requires proofing

leveraged match

xml:tm in-document fuzzy matching


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

mod:origid=”5”

New:same


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching


requires proofing

fuzzy match

leveraged match


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

deleted

tu id=”8”

mod:origid=”5”

new:same


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching


requires proofing

fuzzy match

doc leveraged match

tu id=”9” tu id=”9”

xml:tm db leveraged matching

DB

requires proofing DB leveraged match


tu id=”1”

tu id=”2”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

non trans

tu id=”8”new:same


tu id=”1”

tu id=”3”

tu id=”4”

tu id=”7”

tu id=”6”

tu id=”8”

Perfect Matching


requires proofing

fuzzy match

doc leveraged match

tu id=”9” tu id=”9”

DB

requires proofing DB leveraged match

tu id=”2” requires no translation non translatable

xml:tm non translatable text

Traditional Translation Scenarioxml:tm

source text

Publishing Translation

source text extract

Extracted text

tm process

Prepared text

TranslateTranslated

text

target text

target text

merge

target text

QA

xml:tm

xml source

text

Publishing

Translator

extractExtracted

texttm

process

Prepared text

Translate

xml target text merge

Web

perfect matching

leveraged matching

Automatic Process

web interfaceQA

Automatic Process

xml:tm Translation Scenario

xml:tm matching• Perfect Matching driven by Author Memory• Leveraged Matching:

100% same textIn document Leveraged MatchingDatabase Leveraged Matching

• Fuzzy MatchingModified MatchingLinguistically aware Fuzzy Matching

• Non translatable element identificationAlphanumericNumericMeasurements

xml:tm

xml:tm benefits

• Enterprise level scalability

• Totally integrated within the XML framework

• Source text is automatically extracted and matched• Word counts are controlled by the customer• Text can be presented for translation via the web• Online composition• The most up to date translation is held by the customer• Data is merged automatically at end of translation cycle• All memory operations are totally automated • Can be used transparently for relay translations• Much cheaper to implement and run• More accurate – better matching

xml:tm

xml:tm summary

• Can be used to build consistent authoring systems• Can be used to produce automatic authoring statistics• Translation Memory generation and alignment is totally

automatic

• Memory is held within the documents themselves• Extraction and merging for translation are automatic• The system provides much more efficient matching mechanisms• Structure of the XML document is protected during translation

xml:tm

xml:tm

• Fully specified XML based standard• http://www.xml-intl.com/docs/specification/

xml-tm.html• Maintained by xml-intl.com• http://www.xml-intl.com/dtd/tm.dtd• http://www.xml-intl.com/dtd/tm.xsd• Detailed article on www.xml.com• Offered for consideration as a Lisa standard

xml:tm

Any questions?

xml:tm

Coping with Babel How to Localize XML. Designing for Localization Document design can seriously...

Documents

Transcript of Coping with Babel How to Localize XML. Designing for Localization Document design can seriously...