TOPIC 6: MACHINE TRANSLATIONnlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724_nlp... ·...
Transcript of TOPIC 6: MACHINE TRANSLATIONnlpcs724.weebly.com/uploads/6/6/1/2/66126761/cs724_nlp... ·...
TOPIC 6: MACHINE TRANSLATION
NATURAL LANGUAGE PROCESSING (NLP)
CS-724
WondwossenMulugeta (PhD) email: [email protected]
1
Topics2
Topics Subtopics6: Machine Translation
• Introduction, • Challenges, • Application, • Approaches
Why (Machine) Translation?
Languages in the world• 6,800 living languages
• 600 with written tradition
• 95% of world population
speaks 100 languages
4
Definitions
“Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another.” European Association for Machine Translation
“…Machine Translation (MT) as it is generally known --- the attempt to automate all, or part of the process of translating from one human language to another.” Arnold D J. MACHINE TRANSLATION: An Introductory Guide
Currently, Google supports 103 languages which offers
translations between the following languages over
3,000 pairs
Amharic
Afrikaans
Albanian
Arabic
Armenian
Azerbaijani
Basque
Belarusian
Bulgarian
Catalan
Chinese
Croatian Czech
Danish
Dutch
English
Estonian
Filipino
Finnish
French
Galician
Georgian
German
Greek
Haitian Creole
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Irish
Italian
Japanese
Korean
Latvian
Lithuanian
Macedonian
Malay
Maltese
Norwegian
Polish
Portuguese
Romanian
Russian
Serbian
Slovak
Slovenian
Spanish
Swahili
Swedish
Thai
Turkish
Ukrainian
Urdu
Vietnamese
Welsh
Yiddish
English-Amharic…Google Translate6
Why Machine Translation?
Full Translation
Domain specific, e.g., Weather reports
Machine-aided Translation
Requires post-editing
Cross-lingual NLP applications
Cross-language IR
Cross-language Summarization
8
Why is translation hard
Two/three steps involved:
“Understand” source text
Convert that into target language
Generate correct target text
Depends on approach
Rule-Based vs Statistical
Understanding source text involves same
problems as for any NLP application
9
Understanding the source text
Lexical ambiguity At morphological level Ambiguity of word vs stem+ending (tower, flower)
Very important and complex for Amharic Grammatical category ambiguity (eg close) Homonymy ... Identical words but different meaning Alternate meanings within same grammatical category (eg: Bank)
Syntactic ambiguity (deep) Due to combination of grammatically ambiguous words
Time flies like an arrow, fruit flies like a banana
(shallow) Due to alternative interpretations of structure The man saw the girl with a telescope
the girl having a telescope or
The man used telescope to see the girl
Multilingual Challenges
Orthographic Variations
Ambiguous spelling
كتب االوالد اشعارا َكتََب األْوالدُ اشعَارا
Ambiguous boundaries
ትመጫለሽ where is the boundary to separate the stem from the affix?
Lexical Ambiguity
Bank بنك (financial) vs. ضفة (river)
11
MT Challenges: Ambiguity
Syntactic AmbiguityI saw the man with the telescope
S
NP VP
VP PPI
With thetelescope
S
NP VP
NP
PP
I
the manWith thetelescope
NP
saw
V
the man
NPsaw
V
12
MT Challenges: Ambiguity
Syntactic AmbiguityI saw the man on the hill with the telescope
Lexical AmbiguityE: book
(the writing material or the action of reserving a place?)
Semantic Ambiguity Homography:
ball(E) = pelota, baile(S)
Polysemy:kill(E), matar, acabar (S)
Semantic granularityesperar(S) = wait, expect, hope (E)be(E) = ser, estar(S)fish(E) = pez, pescado(S)
13
Approaches to Machine Translation
Rule-based
Statistical Approaches
Hybrid Systems (Using Statistical approach in an
Rule-based Architecture or … )
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Gisting
MT ApproachesGisting Example
Sobre la base de dichas experiencias se estableció en 1988 una metodología.
Envelope her basis out speak experiences them settle at 1988 one methodology.
On the basis of these experiences, a methodology was arrived at in 1988.
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Gisting
Transfer
MT Approaches
Transfer Example
Transfer Lexicon
Map SL structure to TL structure
poner
X mantequilla en
Y
:obj:mod:subj
:obj
butter
X Y
:subj :obj
X puso mantequilla en Y X buttered Y
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Gisting
Transfer
Interlingua
MT ApproachesInterlingua Example: Lexical Conceptual Structure
MT ApproachesMT Pyramid
Source word
Source syntax
Source meaning Target meaning
Target syntax
Target word
Analysis Generation
Interlingual Lexicons
Dictionaries/Parallel Corpora
Transfer Lexicons
Statistical MT
Automatic Word Alignment
GIZA++ A statistical machine translation toolkit used to train word
alignments (we need language pairs).
Uses different algorithms and approaches to bootstrap alignments
Mary
did
not
slap
the
green
witch
Maria no dio una bofetada a la bruja verde
Phrase-Based Statistical MT
• Foreign input segmented in to phrases
– “phrase” is any sequence of words
• Each phrase is probabilistically translated into English
– P(to the conference | zur Konferenz)
– P(into the meeting | zur Konferenz)
• Phrases are probabilistically re-ordered
This is state-of-the-art!
Morgen fliege ich nach Kanada zur Konferenz
Tomorrow I will fly to the conference In Canada
MT Approaches
Practical Considerations
Resources Availability
Parsers and Generators
Input/Output compatibility
Translation Lexicons
Word-based vs. Transfer/Interlingua
Parallel Corpora
Domain of interest
Bigger is better (more data more varieties and accuracy)
Time Requirement
Statistical training,
resource building
25
Moses statistical MT
Open source MT system developed with C++ programming language
allows to automatically train translation models for any language pair.
Requirement: a collection of translated texts (parallel corpus).
phrase-based
Example-based MT
Long-established approach to empirical MT
First developed in contrast with rule-based MT
Idea of translation by analogy Translate by
adapting previously seen examples rather than by
linguistic rule
“Existing translations contain more solutions to
more translation problems than any other
available resource.” In computational terms,
belongs in family of Case-based reasoning
approaches
EBMT basic idea
Requires database of translation pairs
match input against example database
(like Translation Memory)
identify corresponding translation fragments
(align)
recombine fragment into target text
He buys a book on international politics
Input
Matches
He buys a notebook.
Kare wa nōto o kau.
I read a book on international politics.
Watashi wa kokusai seiji nitsuite kakareta hon o yomu.
Result
Kare wa o kau.kokusai seiji nitsuite kakareta hon
Example
EBMT vs SMT
SMT essentially uses
statistical data (parameters,
probabilities) derived from
the bitext
Pre-processing the data is
essential
Even if the input is in the
training data, you are not
guaranteed to get the same
translation
EBMT uses the bitext as its
primary data source
Pre-processing the data is
optional
If the input is in the
example set, you are
guaranteed to get the same
translation
SMT EBMT
30
31
32
33
MT Evaluation
Translation System evaluation is more of an art than
science… We are concerned about how people
feel about the translation
Wide range of Metrics/Techniques
interface, …,
scalability, …,
faithfulness, ...
space/time complexity, … etc.
Automatic vs. Human-based
Dumb Machines vs. Slow Humans
Automatic Evaluation Example
Bleu Metric
Statistical evaluation method
Uses modified n-gram precision with length penalty
(starts from unigram and goes to a higher window of
words)
Quick, inexpensive and language independent
(string matching….word or phrase level)
Correlates highly with human evaluation
Humans process translation by fragments
Suffers from bias against synonyms and
inflectional variations
Test Sentence
colorless green ideas sleep furiously
Gold Standard References
(Human Translation)
all dull jade ideas sleep irately
drab emerald concepts sleep furiously
colorless immature thoughts nap angrily
We have five words in the test sentences. What we have found in
the Gold Standard is only four of the words in the test sentences.
Thus:
• Unigram precision = 4/5
Automatic Evaluation Example
Bleu Metric
Test Sentence
colorless green ideas sleep furiously
colorless green ideas sleep furiously
colorless green ideas sleep furiously
colorless green ideas sleep furiously
Gold Standard References
(Human Translation)
all dull jade ideas sleep irately
drab emerald concepts sleep furiously
colorless immature thoughts nap angrily
Unigram precision = 4 / 5 = 0.8
Bigram precision = 2 / 4 = 0.5
Bleu Score = (a1 a2 …an)1/n
= (0.8 x 0.5)½ = (0.4)½ = 0.6325 63.25%
Automatic Evaluation Example
Bleu Metric
5 contents of original sentence conveyed (might need minor corrections)
4 contents of original sentence conveyed BUT errors in word order
3 contents of original sentence generally conveyed BUT errors in relationship between phrases, tense, singular/plural, etc.
2 contents of original sentence not adequately conveyed, portions of original sentence incorrectly translated, missing modifiers
1 contents of original sentence not conveyed, missing verbs, subjects, objects, phrases or clauses
Human-based Evaluation Example
Accuracy Criteria
Perform this scoring for each sentences of the test cases and take the average.
5 clear meaning, good grammar, terminology and sentence structure
4 clear meaning BUT bad grammar, bad terminology or bad sentence structure
3 meaning graspable BUT ambiguities due to bad grammar, bad terminology or bad sentence structure
2 meaning unclear BUT inferable
1 meaning absolutely unclear
Human-based Evaluation Example
Fluency Criteria
Perform this scoring for each sentences of the test cases and take the average.
End of Topic 6