Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

Rapid Prototyping of a Transfer-based Hebrew-to-English

Machine Translation System

Alon LavieLanguage Technologies Institute

Carnegie Mellon University

Joint work with:Shuly Wintner, Yaniv Eytani - University of HaifaErik Peterson, Katharina Probst – Carnegie Mellon

October 4, 2004 TMI-2004 2

Outline

• Hebrew and its Challenges for MT• CMU Transfer-based MT Framework• Hebrew-to-English System• Input pre-proc and Morph. Analysis• MT Resources: lexicon and grammar• Performance Evaluation• Conclusions, Current and Future Work

October 4, 2004 TMI-2004 3

Modern Hebrew• Native language of about 3-4 Million in Israel• Semitic language, closely related to Arabic and

with similar linguistic properties– Root+Pattern word formation system– Rich verb and noun morphology– Particles attach as prefixed to the following word:

definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)…

• Unique alphabet and Writing System– 22 letters represent (mostly) consonants– Vowels represented (mostly) by diacritics– Modern texts omit the diacritic vowels, thus

additional level of ambiguity: “bare” word word– Example: MHGR mehager, m+hagar, m+h+ger

October 4, 2004 TMI-2004 4

Modern Hebrew Spelling

• Two main spelling variants– “KTIV XASER” (difficient): spelling with the vowel

diacritics, and consonant words when the diacritics are removed

– “KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter

• KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications inconsistent spelling

• Example: – niqud (spelling): NIQWD, NQWD, NQD– Written as NQD, could also be niqed, naqed, nuqad

October 4, 2004 TMI-2004 5

Challenges for Hebrew MT

• Puacity in existing language resources for Hebrew– No publicly available broad coverage morphological

analyzer– No publicly available bilingual lexicons or dictionaries– No POS-tagged corpus or parse tree-bank corpus for

Hebrew– No large Hebrew/English parallel corpus

• Scenario well suited for CMU transfer-based MT framework for languages with limited resources

Transfer Engine

English Language Model

Transfer Rules{NP1,3}NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1))

Translation Lexicon

N::N |: ["$WR"] -> ["BULL"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL"))

N::N |: ["$WRH"] -> ["LINE"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE"))

Hebrew Input

בשורה הבאה

Decoder

English Output

in the next line

Translation Output Lattice

(0 1 "IN" @PREP)(1 1 "THE" @DET)(2 2 "LINE" @N)(1 2 "THE LINE" @NP)(0 2 "IN LINE" @PP)(0 4 "IN THE NEXT LINE" @PP)

Preprocessing

Morphology

October 4, 2004 TMI-2004 7

Transfer Rule Formalism

Type informationPart-of-speech/constituent

informationAlignments

x-side constraints

y-side constraints

xy-constraints, e.g. ((Y1 AGR) = (X1 AGR))

;SL: the old man, TL: ha-ish ha-zaqen

NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)

((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)

((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

October 4, 2004 TMI-2004 8

Transfer Rule Formalism (II)

Value constraints

Agreement constraints

;SL: the old man, TL: ha-ish ha-zaqen

NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)

((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)

((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

October 4, 2004 TMI-2004 9

The Transfer EngineAnalysis

Source text is parsed into its grammatical structure. Determines transfer application ordering.

Example:

他看书。 (he read book)

S

NP VP

N V NP

他看书

TransferA target language tree is created by reordering, insertion, and deletion.

S

NP VP

N V NP

he read DET N

a book

Article “a” is inserted into object NP. Source words translated with transfer lexicon.

GenerationTarget language constraints are checked and final translation produced.

E.g. “reads” is chosen over “read” to agree with “he”.

Final translation:

“He reads a book”

October 4, 2004 TMI-2004 10

XFER + Decoder• XFER engine produces a lattice of all possible

transferred fragments• Decoder searches for and selects the best

scoring sequence of fragments as a final translation output

• Main advantages:– Very high robustness

• always some translation output • no transfer grammar word-to-word translation

– Scoring can take into account word-to-word translation probabilities, transfer rule scores, target statistical language model

– Effective framework for late-stage disambiguation• Main Difficulty: lattice size too big pruning

October 4, 2004 TMI-2004 11

Hebrew Text Encoding Issues

• Input texts are (most commonly) in standard Windows encoding for Hebrew, but also unicode (UTF-8) and others…

• Morphology analyzer and other resources already set to work in a romanized “ascii-like” representation

Converter script converts the input into the romanized representation – 1-to-1 mapping!

• All further processing is done in the romanized representation

• Lexicon and grammar rules are also converted into romanized representation

October 4, 2004 TMI-2004 12

Morphological Analyzer

• Analyzer program developed at Technion was available, works on Windows and with minimal adaptation on Linux

• Coverage is reasonable (for nouns and verbs and adjectives)

• Produces all analyses or a disambiguated analysis for each word

• Output format includes lexeme (base form), POS, morphological features

• Output was adapted to our representation needs (POS and feature mappings)

October 4, 2004 TMI-2004 13

Morphological Processing

• Split attached prefixes and suffixes into separate words for translation

• Produce f-structures as output• Convert feature-value codes to our

conventions• “All analyses mode”: all possible analyses for

each input word returned, represented in the form of a input lattice

• Analyzer installed as a server integrated with input pre-processer

October 4, 2004 TMI-2004 14

Morphology Example

• Input word: B$WRH

0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|

October 4, 2004 TMI-2004 15

Morphology ExampleY0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE))

Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET))

Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))

October 4, 2004 TMI-2004 16

Translation Lexicon• Constructed our own Hebrew-to-English lexicon, based

primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us

• Coverage is not great but not bad– Dahan H-to-E is about 15K translation pairs– Dahan E-to-H is about 7K translation pairs

• POS information on both sides• No proper names or named entities• Converted Dahan into our representation, added entries for

missing closed-class entries (pronouns, prepositions, etc.)• Issue with spelling conventions

– Dahan dictionary uses deficient KTIV XASER– Developed conversion scripts for most common patterns of verbs– Add/merge these into resulting lexicon

• Target side (English) morphological variants added into lexicon

October 4, 2004 TMI-2004 17

Translation Lexicon: ExamplesPRO::PRO |: ["ANI"] -> ["I"]((X1::Y1)((X0 per) = 1)((X0 num) = s)((X0 case) = nom))

PRO::PRO |: ["ATH"] -> ["you"]((X1::Y1)((X0 per) = 2)((X0 num) = s)((X0 gen) = m)((X0 case) = nom))

N::N |: ["$&H"] -> ["HOUR"]((X1::Y1)((X0 NUM) = s)((Y0 NUM) = s)((Y0 lex) = "HOUR"))

N::N |: ["$&H"] -> ["hours"]((X1::Y1)((Y0 NUM) = p)((X0 NUM) = p)((Y0 lex) = "HOUR"))

October 4, 2004 TMI-2004 18

Transfer Grammar (human-developed)

• Written by Alon in a few days…• Current grammar has 36 rules:

– 21 NP rules – one PP rule – 6 verb complexes and VP rules – 8 higher-phrase and sentence-level rules

• Captures the most common (mostly local) structural differences between Hebrew and English

October 4, 2004 TMI-2004 19

Transfer GrammarExample Rules

{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESS

NP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))

{NP1,3};;SL: H $MLWT H ADWMWT;;TL: THE RED DRESSES

NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]((X3::Y1)(X1::Y2)((X1 def) = +)((X1 status) =c absolute)((X1 num) = (X3 num))((X1 gen) = (X3 gen))(X0 = X1))

October 4, 2004 TMI-2004 20

Sample Output (dev-data)

maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat

a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police

in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money

October 4, 2004 TMI-2004 21

Evaluation Results

• Test set of 62 sentences from Haaretz newspaper, 2 reference translations

System BLEU NIST P R METEOR

No Gram 0.0616 3.4109 0.4090 0.4427 0.3298

Learned 0.0774 3.5451 0.4189 0.4488 0.3478

Manual 0.1026 3.7789 0.4334 0.4474 0.3617

October 4, 2004 TMI-2004 22

Current and Future Work

• Issues specific to the Hebrew-to-English system:– Further improvements in the translation lexicon and

morphological analyzer– Manual Grammar development– Acquiring/training of word-to-word translation probabilities– Acquiring/training of a Hebrew language model at a post-

morphology level that can help with disambiguation• General Issues related to XFER framework:

– Effective pruning during full lattice construction– Effective model for assigning scores to transfer rules– Extending decoder to incorporate rule scores– Improved grammar learning

October 4, 2004 TMI-2004 23

Conclusions

• Test case for the CMU XFER framework for rapid MT prototyping

• Two-month, three person effort – we were quite happy with the outcome

• Core concept of XFER + Decoder is very powerful and promising

• We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar...

October 4, 2004 TMI-2004 24

Questions?

October 4, 2004 TMI-2004 25

Learning Transfer-Rules for Languages with Limited Resources

• Rationale:– Large bilingual corpora not available– Bilingual native informant(s) can translate and align a

small pre-designed elicitation corpus, using elicitation tool– Elicitation corpus designed to be typologically

comprehensive and compositional– Transfer-rule engine and new learning approach support

acquisition of generalized transfer-rules from the data

October 4, 2004 TMI-2004 26

English-Hindi Example

October 4, 2004 TMI-2004 27

Rule Learning - Overview

• Goal: Acquire Syntactic Transfer Rules• Use available knowledge from the source

side (grammatical structure)• Three steps:

1. Flat Seed Generation: first guesses at transfer rules; flat syntactic structure

2. Compositionality: use previously learned rules to add hierarchical structure

3. Seeded Version Space Learning: refine rules by learning appropriate feature constraints

October 4, 2004 TMI-2004 28

Flat Seed Rule Generation

Learning Example: NP

Eng: the big apple

Heb: ha-tapuax ha-gadol

Generated Seed Rule:

NP::NP [ART ADJ N] [ART N ART ADJ]

((X1::Y1)

(X1::Y3)

(X2::Y4)

(X3::Y2))

October 4, 2004 TMI-2004 29

CompositionalityInitial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))

NP::NP [ART ADJ N] [ART N ART ADJ]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N]

((X1::Y1) (X2::Y2))

Generated Compositional Rule:

S::S [NP V NP] [NP V P NP]

((X1::Y1) (X2::Y2) (X3::Y4))

October 4, 2004 TMI-2004 30

Seeded Version Space LearningInput: Rules and their Example Sets

S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}

((X1::Y1) (X2::Y2) (X3::Y4))

NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}

((X1::Y1) (X2::Y2))

Output: Rules with Feature Constraints:

S::S [NP V NP] [NP V P NP]

((X1::Y1) (X2::Y2) (X3::Y4)

(X1 NUM = X2 NUM)

(Y1 NUM = Y2 NUM)

(X1 NUM = Y1 NUM))

Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System

Documents

Transcript of Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System