Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System
description
Transcript of Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System
Rapid Prototyping of a Transfer-based Hebrew-to-English
Machine Translation System
Alon LavieLanguage Technologies Institute
Carnegie Mellon University
Joint work with:Shuly Wintner, Yaniv Eytani - University of HaifaErik Peterson, Katharina Probst – Carnegie Mellon
October 4, 2004 TMI-2004 2
Outline
• Hebrew and its Challenges for MT• CMU Transfer-based MT Framework• Hebrew-to-English System• Input pre-proc and Morph. Analysis• MT Resources: lexicon and grammar• Performance Evaluation• Conclusions, Current and Future Work
October 4, 2004 TMI-2004 3
Modern Hebrew• Native language of about 3-4 Million in Israel• Semitic language, closely related to Arabic and
with similar linguistic properties– Root+Pattern word formation system– Rich verb and noun morphology– Particles attach as prefixed to the following word:
definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)…
• Unique alphabet and Writing System– 22 letters represent (mostly) consonants– Vowels represented (mostly) by diacritics– Modern texts omit the diacritic vowels, thus
additional level of ambiguity: “bare” word word– Example: MHGR mehager, m+hagar, m+h+ger
October 4, 2004 TMI-2004 4
Modern Hebrew Spelling
• Two main spelling variants– “KTIV XASER” (difficient): spelling with the vowel
diacritics, and consonant words when the diacritics are removed
– “KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter
• KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications inconsistent spelling
• Example: – niqud (spelling): NIQWD, NQWD, NQD– Written as NQD, could also be niqed, naqed, nuqad
October 4, 2004 TMI-2004 5
Challenges for Hebrew MT
• Puacity in existing language resources for Hebrew– No publicly available broad coverage morphological
analyzer– No publicly available bilingual lexicons or dictionaries– No POS-tagged corpus or parse tree-bank corpus for
Hebrew– No large Hebrew/English parallel corpus
• Scenario well suited for CMU transfer-based MT framework for languages with limited resources
Transfer Engine
English Language Model
Transfer Rules{NP1,3}NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1))
Translation Lexicon
N::N |: ["$WR"] -> ["BULL"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL"))
N::N |: ["$WRH"] -> ["LINE"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE"))
Hebrew Input
בשורה הבאה
Decoder
English Output
in the next line
Translation Output Lattice
(0 1 "IN" @PREP)(1 1 "THE" @DET)(2 2 "LINE" @N)(1 2 "THE LINE" @NP)(0 2 "IN LINE" @PP)(0 4 "IN THE NEXT LINE" @PP)
Preprocessing
Morphology
October 4, 2004 TMI-2004 7
Transfer Rule Formalism
Type informationPart-of-speech/constituent
informationAlignments
x-side constraints
y-side constraints
xy-constraints, e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)
((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)
((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))
October 4, 2004 TMI-2004 8
Transfer Rule Formalism (II)
Value constraints
Agreement constraints
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)
((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)
((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))
October 4, 2004 TMI-2004 9
The Transfer EngineAnalysis
Source text is parsed into its grammatical structure. Determines transfer application ordering.
Example:
他 看 书。 (he read book)
S
NP VP
N V NP
他 看 书
TransferA target language tree is created by reordering, insertion, and deletion.
S
NP VP
N V NP
he read DET N
a book
Article “a” is inserted into object NP. Source words translated with transfer lexicon.
GenerationTarget language constraints are checked and final translation produced.
E.g. “reads” is chosen over “read” to agree with “he”.
Final translation:
“He reads a book”
October 4, 2004 TMI-2004 10
XFER + Decoder• XFER engine produces a lattice of all possible
transferred fragments• Decoder searches for and selects the best
scoring sequence of fragments as a final translation output
• Main advantages:– Very high robustness
• always some translation output • no transfer grammar word-to-word translation
– Scoring can take into account word-to-word translation probabilities, transfer rule scores, target statistical language model
– Effective framework for late-stage disambiguation• Main Difficulty: lattice size too big pruning
October 4, 2004 TMI-2004 11
Hebrew Text Encoding Issues
• Input texts are (most commonly) in standard Windows encoding for Hebrew, but also unicode (UTF-8) and others…
• Morphology analyzer and other resources already set to work in a romanized “ascii-like” representation
Converter script converts the input into the romanized representation – 1-to-1 mapping!
• All further processing is done in the romanized representation
• Lexicon and grammar rules are also converted into romanized representation
October 4, 2004 TMI-2004 12
Morphological Analyzer
• Analyzer program developed at Technion was available, works on Windows and with minimal adaptation on Linux
• Coverage is reasonable (for nouns and verbs and adjectives)
• Produces all analyses or a disambiguated analysis for each word
• Output format includes lexeme (base form), POS, morphological features
• Output was adapted to our representation needs (POS and feature mappings)
October 4, 2004 TMI-2004 13
Morphological Processing
• Split attached prefixes and suffixes into separate words for translation
• Produce f-structures as output• Convert feature-value codes to our
conventions• “All analyses mode”: all possible analyses for
each input word returned, represented in the form of a input lattice
• Analyzer installed as a server integrated with input pre-processer
October 4, 2004 TMI-2004 14
Morphology Example
• Input word: B$WRH
0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|
October 4, 2004 TMI-2004 15
Morphology ExampleY0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE))
Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET))
Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))
October 4, 2004 TMI-2004 16
Translation Lexicon• Constructed our own Hebrew-to-English lexicon, based
primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us
• Coverage is not great but not bad– Dahan H-to-E is about 15K translation pairs– Dahan E-to-H is about 7K translation pairs
• POS information on both sides• No proper names or named entities• Converted Dahan into our representation, added entries for
missing closed-class entries (pronouns, prepositions, etc.)• Issue with spelling conventions
– Dahan dictionary uses deficient KTIV XASER– Developed conversion scripts for most common patterns of verbs– Add/merge these into resulting lexicon
• Target side (English) morphological variants added into lexicon
October 4, 2004 TMI-2004 17
Translation Lexicon: ExamplesPRO::PRO |: ["ANI"] -> ["I"]((X1::Y1)((X0 per) = 1)((X0 num) = s)((X0 case) = nom))
PRO::PRO |: ["ATH"] -> ["you"]((X1::Y1)((X0 per) = 2)((X0 num) = s)((X0 gen) = m)((X0 case) = nom))
N::N |: ["$&H"] -> ["HOUR"]((X1::Y1)((X0 NUM) = s)((Y0 NUM) = s)((Y0 lex) = "HOUR"))
N::N |: ["$&H"] -> ["hours"]((X1::Y1)((Y0 NUM) = p)((X0 NUM) = p)((Y0 lex) = "HOUR"))
October 4, 2004 TMI-2004 18
Transfer Grammar (human-developed)
• Written by Alon in a few days…• Current grammar has 36 rules:
– 21 NP rules – one PP rule – 6 verb complexes and VP rules – 8 higher-phrase and sentence-level rules
• Captures the most common (mostly local) structural differences between Hebrew and English
October 4, 2004 TMI-2004 19
Transfer GrammarExample Rules
{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESS
NP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))
{NP1,3};;SL: H $MLWT H ADWMWT;;TL: THE RED DRESSES
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]((X3::Y1)(X1::Y2)((X1 def) = +)((X1 status) =c absolute)((X1 num) = (X3 num))((X1 gen) = (X3 gen))(X0 = X1))
October 4, 2004 TMI-2004 20
Sample Output (dev-data)
maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat
a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police
in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money
October 4, 2004 TMI-2004 21
Evaluation Results
• Test set of 62 sentences from Haaretz newspaper, 2 reference translations
System BLEU NIST P R METEOR
No Gram 0.0616 3.4109 0.4090 0.4427 0.3298
Learned 0.0774 3.5451 0.4189 0.4488 0.3478
Manual 0.1026 3.7789 0.4334 0.4474 0.3617
October 4, 2004 TMI-2004 22
Current and Future Work
• Issues specific to the Hebrew-to-English system:– Further improvements in the translation lexicon and
morphological analyzer– Manual Grammar development– Acquiring/training of word-to-word translation probabilities– Acquiring/training of a Hebrew language model at a post-
morphology level that can help with disambiguation• General Issues related to XFER framework:
– Effective pruning during full lattice construction– Effective model for assigning scores to transfer rules– Extending decoder to incorporate rule scores– Improved grammar learning
October 4, 2004 TMI-2004 23
Conclusions
• Test case for the CMU XFER framework for rapid MT prototyping
• Two-month, three person effort – we were quite happy with the outcome
• Core concept of XFER + Decoder is very powerful and promising
• We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar...
October 4, 2004 TMI-2004 24
Questions?
October 4, 2004 TMI-2004 25
Learning Transfer-Rules for Languages with Limited Resources
• Rationale:– Large bilingual corpora not available– Bilingual native informant(s) can translate and align a
small pre-designed elicitation corpus, using elicitation tool– Elicitation corpus designed to be typologically
comprehensive and compositional– Transfer-rule engine and new learning approach support
acquisition of generalized transfer-rules from the data
October 4, 2004 TMI-2004 26
English-Hindi Example
October 4, 2004 TMI-2004 27
Rule Learning - Overview
• Goal: Acquire Syntactic Transfer Rules• Use available knowledge from the source
side (grammatical structure)• Three steps:
1. Flat Seed Generation: first guesses at transfer rules; flat syntactic structure
2. Compositionality: use previously learned rules to add hierarchical structure
3. Seeded Version Space Learning: refine rules by learning appropriate feature constraints
October 4, 2004 TMI-2004 28
Flat Seed Rule Generation
Learning Example: NP
Eng: the big apple
Heb: ha-tapuax ha-gadol
Generated Seed Rule:
NP::NP [ART ADJ N] [ART N ART ADJ]
((X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2))
October 4, 2004 TMI-2004 29
CompositionalityInitial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))
NP::NP [ART ADJ N] [ART N ART ADJ]
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))
NP::NP [ART N] [ART N]
((X1::Y1) (X2::Y2))
Generated Compositional Rule:
S::S [NP V NP] [NP V P NP]
((X1::Y1) (X2::Y2) (X3::Y4))
October 4, 2004 TMI-2004 30
Seeded Version Space LearningInput: Rules and their Example Sets
S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}
((X1::Y1) (X2::Y2) (X3::Y4))
NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}
((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))
NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}
((X1::Y1) (X2::Y2))
Output: Rules with Feature Constraints:
S::S [NP V NP] [NP V P NP]
((X1::Y1) (X2::Y2) (X3::Y4)
(X1 NUM = X2 NUM)
(Y1 NUM = Y2 NUM)
(X1 NUM = Y1 NUM))