English-Persian SMT
description
Transcript of English-Persian SMT
![Page 2: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/2.jpg)
Outline2
MT Introduction SMT Introduction Requirements for SMT Evaluation metrics English-Persian MT challenges English-Persian SMT
System1 System2
Problems in English-Persian SMT
![Page 3: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/3.jpg)
MT Introduction3
Automatic translation of text written in a natural language into another one by the use of computers is referred to as Machine Translation.
There are several way to do this work: Dictionary-based Rule-based Example-based Statistical approach
![Page 4: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/4.jpg)
SMT Introduction4
First ideas of Statistical machine translation was proposed by Warren Weaver in 1947.
Statistical machine translation tries to learn the translation by examining the translations made by humans.
![Page 5: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/5.jpg)
SMT Introduction(Cont.)5
Statistical MT models take the view that every sentence in the target language is a translation of the source language sentence with some probability.
The best translation, of course, is the sentence that has the highest probability.
The key problems in statistical MT are: estimating the probability of a translation and efficiently finding the sentence with the highest
probability.
![Page 6: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/6.jpg)
SMT Introduction(Cont.)6
Given a Source sentence f, we seek the target sentence e that maximizes P(e | f).
e‘ = argmaxe P(e | f)
Intuitively, P(e|f) should depend on two factors:
P(e|f) = P(e) * P(f | e) / P(f)
argmaxe P(e | f) = argmaxe P(e) * P(f | e)
fluency faithfulness
![Page 7: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/7.jpg)
SMT Introduction(Cont.)7
Philipp koehn http://homepages.inf.ed.ac.uk/pkoehn
![Page 8: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/8.jpg)
Why SMT?8
Better use of resources Not need linguistic knowledge It can use for any pair of language
But We need a big training corpus
![Page 9: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/9.jpg)
Steps of SMT9
![Page 10: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/10.jpg)
Requirements for SMT10
Bilingual and Monolingual Corpus: For bilingual need tow file aligned sentence by
sentence (one file for source language and other for target language)
Microsoft Bi-Lingual sentence Aligner
Language Model: We need a tool to compute P(e) For this step we need to monolingual corpus SRILM: a tool for create N-grams
![Page 11: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/11.jpg)
LM output11
![Page 12: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/12.jpg)
Requirements for SMT12
Translation Model: We need a tool for compute P(f|e) For this step we need to bilingual corpus GIZA++ The output of this tool is a phrase table
Decode: For search and find best translation Moses
![Page 13: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/13.jpg)
Phrase table13
![Page 14: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/14.jpg)
Moses tool14
![Page 15: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/15.jpg)
The training steps15
Prepare data Run GIZA++ Align words Get lexical translation table Extract phrases Score phrases Build reordering model Build generation models Create configuration file
![Page 16: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/16.jpg)
Evaluation metrics16
BLEU(BiLingual Evaluation Understudy)
Developed at IBM’s
The closer a MT is to a professional human translation,
the better it is
NIST
![Page 17: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/17.jpg)
English-Persian MT challenges17
The Persian language structure is very different in comparison to English
The structure of Persian language is very complex There has been little previous work done for this
language pair Effective SMT systems rely on very large bilingual
corpora but there are not readily available for the English/Persian language pair
![Page 18: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/18.jpg)
English-Persian SMT18
There have been few English-Persian MT systems
developed
Most of them are purely rule-based
There are two work on English-Persian SMT
Mohaghegh and Sarrafzadeh (Massey University)
Pilevar and Faili (Tehran University)
![Page 19: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/19.jpg)
System119
Corpus: BBC news
![Page 20: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/20.jpg)
System1(Cont.)20
Tools: SRILM, GIZA++, Moses
![Page 21: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/21.jpg)
System1: Improved Language Modeling21
![Page 22: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/22.jpg)
System222
Corpus: Bidirectional(TEP): Subtitle of films, 3 books, KDE4
![Page 23: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/23.jpg)
System2(Cont.)23
Corpus: Monolingual: Hamshahri, subtitle of films
![Page 24: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/24.jpg)
System2(Cont.)24
Tools: SRILM, GIZA++, Moses
PersianSMT with 4-gram Sub-LM
![Page 25: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/25.jpg)
Comparison PersianSMT with Google Translator
25
![Page 26: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/26.jpg)
Problems in English-Persian SMT26
compound verbs (aligning problem) Use a phrase-based SMT system But problem is inflectional morphology Large number of inflected verb forms does not let the
system learn to translate all the individual forms of a compound verb
Persian takes personal pronouns as an optional element in the sentence (aligning problem)
![Page 27: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/27.jpg)
Problems(Cont.)27
failure of the system to place the elements of the
sentence in the right order
Use a phrase-based SMT system
Re-rank the n-best output list and/or reorder the output
sentences
Prior to translation, the input sentence is reordered using
morpho-syntactic information, so that the word order
resembles better that of the target language.
![Page 28: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/28.jpg)
28
![Page 29: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/29.jpg)
References29
[1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report, Department of Computer Science and Engineering Indian Institute of Technology, 2000.
[2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008. [3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical
machine translation”, New Zealand Postgraduate Conference, 2009 . [4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data
variation in English-Persian Statistical Machine Translation”, 2009 International Conference on Innovations in Information Technology (IIT 2009)
[5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training data in English-Persian statistical machine translation “, Appear in Proceedings of the 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy, June 9-11, 2010.
[6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010
![Page 30: English-Persian SMT](https://reader036.fdocuments.net/reader036/viewer/2022081501/56813fae550346895daa94a7/html5/thumbnails/30.jpg)
References(Cont.)30
[7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian Statistical Machine Translation", to appear in Proc. of 10th International Conference on statistical analysis of textual data (JADT 2010)