Coping with Babel How to Localize XML. Designing for Localization Document design can seriously...
-
Upload
allison-stewart -
Category
Documents
-
view
232 -
download
0
Transcript of Coping with Babel How to Localize XML. Designing for Localization Document design can seriously...
Coping with Babel
How to Localize XML
Designing for Localization
• Document design can seriously impact the costs of translation and localization.
• Remember that you are designing for all languages, not just English.
• There are clear do’s and don’ts.
• Overriding principle is good XML practice.
• Always consider the target language implications.
Entity references
Do not use entity references for word substitution:
<para>Use a &tool; to release the catch.</para>
• Cause problems for inflected languages
• Cause problems for parsing/translation tools
• Use boiler plate text instead
Translatable attributes
Avoid using translatable attributes:<para>Use a <tool id="a1098" name="claw hammer"> to release the CPU retention catch.</para>
• Cause problems for inflected languages
• Cause extra burden for translators
• More to go wrong
CDATA sections
Avoid using CDATA sections that may contain translatable text:
<tmpl><![CDATA[<p>Please refer to the <em>index page</em> page for further information</p>]]></tmpl>
• Lose syntactical control
• How are translation tools to cope?
Processing instructions
Avoid Processing Instructions in translatable text:
<para>Use a <?tool name="claw hammer"?> to release the CPU retention catch.</para>
• Syntactically week
• Confuse translation memory operations
Infinite Naming Schemes
Avoid the use of infinite naming schemes:<resources xml:lang="en">
<err001>Cannot open file $1.</err001>
<hint001>Hint: does file $1 exist.</hint001>
<err002>Incorrect value.</err002>
<hint002>Hint: Must be between $1 and 2.</hint002>
<err003>Connection timeout.</err999>
</resources>
• No clear element definitions
Typographical elements
Avoid the use of "typographical" elements:<para><b>Do not use</b> <br/> type elements.</para>
• Bad XML practice.
• Causes problems for translators.
• Target language text may be in the opposite order.
Do not break sentences
Never break a linguistically complete text unit over more than one non-inline element:
<para>
<line>This text should not be</line>
<line>broken this way – the translated text may well be in a different order.</line>
</para>
XML Translation Standards
• LISA - Localization Industry Standards Association: http://www.lisa.org
• OASIS - Organization for the Advancement of Structured Information Standards: http://www.oasis-open.org
• W3C - World Wide Web Consortium: http://www.w3c.org
• OLIF Consortium: http://www.olif.net
LISA Standards
• TMX - Translation Memory Exchange format: http://www.lisa.org/tmx
• TBX - Termbase Exchange format: http://www.lisa.org/tbx
• SRX - Segmentation Rules Exchange format: http://www.lisa.org/srx
• GMX - GILT Metrics Exchange format: http://www.lisa.org/gmx
OASIS L10n Standards
• XLIFF - XML Localization Interchange File Format: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff
• TransWS - Translation Web Services: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=trans-ws
W3C and OLIF
• W3C to start on Localization Directives standard.
• OLIF - Open Lexicon Interchange Format: http://www.olif.net
xml:tm
XML Text Memory
A radical new approach to translating XML documents
• Machine Translation
• Translation Memory
• Hybrid Linguistic Inferencing Engines
• Terminology
Computational Linguistic Methodologies
Translation memory
• Advent in early 1980’s
• Intermediate format
• Alignment
• Storage
• Leveraged memory
• Fuzzy matching – statistical
• Advantages: cost reduction, consistency
• Drawbacks: proofreading, managing memories
• No significant advances in technology
XML namespace
• Major new feature of XML compared to SGML• Allows the mapping of different ontological
entities onto the same representation
• Allows different ways to look at the same data• Namespaces can be made transparent
xml:tm namespace
• Text Memory namespace• Can be mapped onto any XML document• Vertical view of document in terms of ‘text segments’• Can be totally transparent
xml:tm
xml:tm namespacexml:tm
Example of the use of namespace in an XML document:
<document xmlns:tm="urn:xml-Intl-tm" > <tm:tm> <section> <para> <tm:te> <tm:tu> Namespace is very flexible. </tm:tu> <tm:tu> It is very easy to use. </tm:tu> </tm:te> </para>
xml:tm namespace
doc
title
section section
para text
tm
te sentence sentencetu tu
te sentence sentencetu tu
te sentence sentencetu tu
tm namespace view
original document
view te texttutext
te sentence sentencetu tu
para text
para text
para text
para text
para text
te sentence sentencetu tu
te sentence sentencetu tu
xml:tm namespace
text
te sentence sentencetu tu
original document view
tm namespace view
xml:tm namespace
Namespace is very simple. It is easy to use.
te sentence sentencetu tu
original document view
tm namespace view
<para>
</para>
<para>
</para>
<tm:te id=“e1”>
<tm:tu id=“u1.1”> Namespace is very simple. </tm:tu>
<tm:tu id=“u1.2”> It is easy to use. </tm:tu>
</tm:te>
text
xml:tm Text Memory
• Author memoryMaintain memory of source text
Authoring statistics
Authoring tool input
• Translation memoryAutomatic alignment
Maintain perfect link of source and target text
Reduce translation costs
xml:tm
Updated Source Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”new
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
xml:tm DOM differencing
origid=”5”modified
xml:tm Author Memory
• Namespace aware differencing
• Identify changes from the previous version• Unique text unit identifiers are maintained• Modification history• Text units can be loaded into a database• Authoring environment integration
xml:tm
xml:tm Translation Memory
• The tm namespace can be used to create XLIFF files
• Automatic alignment of source and target languages• Allows for more focused translation matching
– Perfect matching
– Leveraged matching from document - identical text
– Leveraged matching from database
– Modified text unit matching
– Linguistically enhanced fuzzy matching
– Non translatable text unit identification
xml:tm
xml:tm translation
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Translated Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
XLIFF Document
trans-unit id=”1”
trans-unit id=”2”
trans-unit id=”3”
trans-unit id=”4”
trans-unit id=”5”
trans-unit id=”6”
doc
title
section section
para tekst
tm
te zdanie zdanietu tu
te zdanie zdanietu tu
te zdanie zdanietu tu
translated tm namespace
view
translated document
view te teksttutekst
te zdanie zdanietu tu
para tekst
para tekst
para tekst
para tekst
para tekst
te zdanie zdanietu tu
te zdanie zdanietu tu
xml:tm translated document
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Translated Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Perfect alignment
xml:tm perfect alignment
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
modified
new
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires translation
xml:tm perfect matching
xml:tm contextual memory
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Translated Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Perfect alignment
Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Translated Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”5”
tu id=”6”
Perfect alignment
DB
xml:tm leveraged DB memory
xml:tm in-document leveraged matching
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
modified
new:same id=”3”
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires proofing
leveraged match
xml:tm in-document fuzzy matching
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
mod:origid=”5”
New:same
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires proofing
fuzzy match
leveraged match
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
deleted
tu id=”8”
mod:origid=”5”
new:same
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires proofing
fuzzy match
doc leveraged match
tu id=”9” tu id=”9”
xml:tm db leveraged matching
DB
requires proofing DB leveraged match
Updated Source Document
tu id=”1”
tu id=”2”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
non trans
tu id=”8”new:same
Matched Target Document
tu id=”1”
tu id=”3”
tu id=”4”
tu id=”7”
tu id=”6”
tu id=”8”
Perfect Matching
requires translation
requires proofing
fuzzy match
doc leveraged match
tu id=”9” tu id=”9”
DB
requires proofing DB leveraged match
tu id=”2” requires no translation non translatable
xml:tm non translatable text
Traditional Translation Scenarioxml:tm
source text
Publishing Translation
source text extract
Extracted text
tm process
Prepared text
TranslateTranslated
text
target text
target text
merge
target text
QA
xml:tm
xml source
text
Publishing
Translator
extractExtracted
texttm
process
Prepared text
Translate
xml target text merge
Web
perfect matching
leveraged matching
Automatic Process
web interfaceQA
Automatic Process
xml:tm Translation Scenario
xml:tm matching• Perfect Matching driven by Author Memory• Leveraged Matching:
100% same textIn document Leveraged MatchingDatabase Leveraged Matching
• Fuzzy MatchingModified MatchingLinguistically aware Fuzzy Matching
• Non translatable element identificationAlphanumericNumericMeasurements
xml:tm
xml:tm benefits
• Enterprise level scalability
• Totally integrated within the XML framework
• Source text is automatically extracted and matched• Word counts are controlled by the customer• Text can be presented for translation via the web• Online composition• The most up to date translation is held by the customer• Data is merged automatically at end of translation cycle• All memory operations are totally automated • Can be used transparently for relay translations• Much cheaper to implement and run• More accurate – better matching
xml:tm
xml:tm summary
• Can be used to build consistent authoring systems• Can be used to produce automatic authoring statistics• Translation Memory generation and alignment is totally
automatic
• Memory is held within the documents themselves• Extraction and merging for translation are automatic• The system provides much more efficient matching mechanisms• Structure of the XML document is protected during translation
xml:tm
xml:tm
• Fully specified XML based standard• http://www.xml-intl.com/docs/specification/
xml-tm.html• Maintained by xml-intl.com• http://www.xml-intl.com/dtd/tm.dtd• http://www.xml-intl.com/dtd/tm.xsd• Detailed article on www.xml.com• Offered for consideration as a Lisa standard
xml:tm
Any questions?
xml:tm