Towards Syntactically Constrained Statistical Word Alignment

52
Towards Syntactically Towards Syntactically Constrained Statistical Constrained Statistical Word Alignment Word Alignment Greg Hanneman Greg Hanneman 11-734: Advanced Machine Translation Seminar April 30, 2008

description

Towards Syntactically Constrained Statistical Word Alignment. Greg Hanneman 11-734: Advanced Machine Translation Seminar April 30, 2008. Outline. The word alignment problem Base approaches Syntax-based approaches Distortion models Tree-to-string models Tree-to-tree models Discussion. - PowerPoint PPT Presentation

Transcript of Towards Syntactically Constrained Statistical Word Alignment

Page 1: Towards Syntactically Constrained Statistical  Word Alignment

Towards Syntactically Towards Syntactically Constrained Statistical Constrained Statistical

Word AlignmentWord Alignment

Greg HannemanGreg Hanneman

11-734: Advanced Machine Translation SeminarApril 30, 2008

Page 2: Towards Syntactically Constrained Statistical  Word Alignment

Outline

• The word alignment problem

• Base approaches

• Syntax-based approaches

– Distortion models

– Tree-to-string models

– Tree-to-tree models

• Discussion

Page 3: Towards Syntactically Constrained Statistical  Word Alignment

Word Alignment

• Parallel sentence pair: F and E

• Most general: map a subset of F to a subset of E

Page 4: Towards Syntactically Constrained Statistical  Word Alignment

Word Alignment

• Very large alignment spaces!

– An n-word parallel sentence has n2 possible links and 2n2 possible alignments

– Restrict to one-to-one alignments: n! possible alignments

• Alignment models try to restrict or learn a probability distribution over this space to get the “best” alignment of a sentence

Page 5: Towards Syntactically Constrained Statistical  Word Alignment

Outline

• The word alignment problem

• Base approaches

• Syntax-based approaches

– Distortion models

– Tree-to-string models

– Tree-to-tree models

• Discussion

Page 6: Towards Syntactically Constrained Statistical  Word Alignment

A Generative Story[Brown et al. 1990]

The proposal will not be implemented

English sentence

Fertility

Les propositions neseront pas applicationmises en

Lexical generation

Les propositions ne seront pas applicationmises en

Distortion

Page 7: Towards Syntactically Constrained Statistical  Word Alignment

The Framework

• F: words f1 … fj … fn

• E: words e1 … ei … em

• Compute P(F, A | E) for hidden alignment variable A: a1 … aj … an

– The major step: decomposition, model parameters, EM algorithm, etc.

• aj = i: word fj is aligned to word ei

Page 8: Towards Syntactically Constrained Statistical  Word Alignment

The IBM Models[Brown et al. 1993; Och and Ney 2003]

• Model 1: “Bag of words” — word order doesn’t affect alignment

• Model 2: Position of words being aligned does matter

Page 9: Towards Syntactically Constrained Statistical  Word Alignment

The IBM Models[Brown et al. 1993; Och and Ney 2003]

• Later models use more implicit structural or linguistic information, but not really syntax, and not really overtly

– Fertility: P(φ | ei) of ei producing φ words in F

– Distortion: P(τ, π | E) for a set of F words τ in a permutation π

– Previous alignments: Probs. for positions in F of the different words of a fertile ei

Page 10: Towards Syntactically Constrained Statistical  Word Alignment

The HMM Model[Vogel et al. 1996; Och and Ney 2003]

• Linguistic intuition: words, and their alignments, tend to clump together in clusters

• aj depends on absolute size of “jump” between it and aj–1

Page 11: Towards Syntactically Constrained Statistical  Word Alignment

Discriminative Training

• Consider all possible alignments, score them, and pick the best ones under some set of constraints

• Can incorporate arbitrary features; generative models more fixed

• Generative models’ EM requires lots of unlabeled training data; discriminative requires some labeled data

Page 12: Towards Syntactically Constrained Statistical  Word Alignment

Discriminative Alignment[Taskar et al. 2005]

– Co-occurrence

– Position difference

– Co-occurrence of following words

– Word-frequency rank

– Model 4 prediction

– …

The

proposal

will

not

be

implemented

Les

propositions

ne

seront

pas

application

mises

en

),(),( jiT

ji fefev fw

Page 13: Towards Syntactically Constrained Statistical  Word Alignment

Outline

• The word alignment problem

• Base approaches

• Syntax-based approaches

– Distortion models

– Tree-to-string models

– Tree-to-tree models

• Discussion

Page 14: Towards Syntactically Constrained Statistical  Word Alignment

Syntax-Based Approaches

• Constrain alignment space by looking beyond flat text stream: take higher-level sentence structure into account

• Representations

– Constituency structure

– Inversion Transduction Grammar

– Dependency structure

Page 15: Towards Syntactically Constrained Statistical  Word Alignment

An MT Motivation

Page 16: Towards Syntactically Constrained Statistical  Word Alignment

Syntax-Based Distortion[DeNero and Klein 2007]

• Syntax-based MT should start from syntax-aware word alignments

• HMM model + target-language parse trees: prefer alignments that respect tree

• Handled in distortion model: jumps should reflect tree structure

Page 17: Towards Syntactically Constrained Statistical  Word Alignment

Syntax-Based Distortion[DeNero and Klein 2007]

• HMM distortion: size of jump between aj–1 and aj

• Syntactic distortion: tree path between aj–

1 and aj

Page 18: Towards Syntactically Constrained Statistical  Word Alignment

Syntax-Based Distortion[DeNero and Klein 2007]

• Training:100,000 parallel French–English and Chinese–English sentences with English parse trees

• Both E→F and F → E; combined with different unions and intersections, plus thresholds

• Test: Hand-aligned Hansards and NIST MT 2002 data

Page 19: Towards Syntactically Constrained Statistical  Word Alignment

Syntax-Based Distortion[DeNero and Klein 2007]

• HMMs roughly equal, better than GIZA++

• Soft union for French; hard union for Chinese; competitive thresholding

Page 20: Towards Syntactically Constrained Statistical  Word Alignment

Tree-to-String Models

Page 21: Towards Syntactically Constrained Statistical  Word Alignment

Tree-to-String Models

• New generative story

• Word-level fertility and distortion replaced with node insertion and sibling reordering

• Lexical translation still the same

• Word alignment produced as a side effect from lexical translations

Page 22: Towards Syntactically Constrained Statistical  Word Alignment

Tree-to-String Alignment[Yamada and Knight 2001]

• Discussed in other sessions this semester

• Training: 2121 short Japanese–English sentences, modified Collins parser output for English

• Test: First 50 sentences of training corpus

• Beat IBM Model 5 on human judgements; perplexity between Model 1 and Model 5

Page 23: Towards Syntactically Constrained Statistical  Word Alignment

Subtree Cloning[Gildea 2003]

• Original tree-to-string model is too strict

– Syntactic divergences, reordering

• Soft constraint: allow alignments that violate tree structure, but at a cost

– Tweak the tree side of the alignment to contain things needed for the string side

– Ex.: SVO to OSV

Page 24: Towards Syntactically Constrained Statistical  Word Alignment

Subtree Cloning[Gildea 2003]

S

VP

AUX VP

do ADVP VB

RB

entirely

understand

NP

I

PRP

NP

PRP$ NN

your language

NP

I

PRP

Page 25: Towards Syntactically Constrained Statistical  Word Alignment

Subtree Cloning[Gildea 2003]

S

VP

AUX VP

do

NP

I

PRP

ADVP VB

RB

entirely

understand

NP

PRP$ NN

your language

NP

I

PRP

Page 26: Towards Syntactically Constrained Statistical  Word Alignment

Subtree Cloning[Gildea 2003]

S

VP

AUX VP

do

NP

I

PRP

ADVP VB

RB

entirely

understand

NP

PRP$ NN

your language

NP

I

PRP

men ti

NULL NULL

ni hua wo tu

tung

Page 27: Towards Syntactically Constrained Statistical  Word Alignment

Subtree Cloning[Gildea 2003]

• For a node np:

– Probability of cloning something as a new child of np: single EM-learned constant for all np

– Probability of making that clone a node nc: uniform over all nc

• Surprising that this works…

Page 28: Towards Syntactically Constrained Statistical  Word Alignment

Subtree Cloning[Gildea 2003]

• Compared with IBM 1–3, basic tree-to-string, basic tree-to-tree models

• Training: 4982 Korean–English sentence pairs, with manual Korean parse trees

• Test: 101 hand-aligned held-out sentences

Page 29: Towards Syntactically Constrained Statistical  Word Alignment

Subtree Cloning[Gildea 2003]

• Cloning helps: as good or better than IBM

• Tree-to-tree model runs faster

Page 30: Towards Syntactically Constrained Statistical  Word Alignment

Tree-to-Tree Models

• Alignment must conform to tree structure on both sides — space is more constrained

• Requires more transformation operations to handle divergent structures [Gildea 2003]

• Or we could be more permissive…

Page 31: Towards Syntactically Constrained Statistical  Word Alignment

Inversion Transduction Grammar

[Wu 1997]

• For bilingual parsing; get one-to-one word alignment as a side effect

• Parallelbinary-branchingtrees with reordering

Page 32: Towards Syntactically Constrained Statistical  Word Alignment

ITG Operations

• A → [A A]

– Produce “A1 A2” in source and target streams

• A → <A A>

– Produce “A1 A2” in source stream, “A2 A1” in target stream

• A → e / f

– Produce “e” in source stream, “f” in target stream

Page 33: Towards Syntactically Constrained Statistical  Word Alignment

ITG Operations

• “Canonical form” ITG produces only one derivation for a given alignment

– S → A | B | C

– A → [A B] | [B B] | [C B] | [A C] | [B C] | [C C]

– B → <A A> | <B A> | <C A> | <A C> | <B C> | <C C>

– C → e / f

Page 34: Towards Syntactically Constrained Statistical  Word Alignment

Alignment with ITG[Zhang and Gildea 2004]

• Compared IBM 1, IBM 4, ITG, and tree-to-string (with and without cloning)

• Training: Chinese–English (18,773) and French–English (20,000) sentences less than 25 words long

• Test: Hand-aligned Chinese–English (48) and French–English (447)

Page 35: Towards Syntactically Constrained Statistical  Word Alignment

Alignment with ITG[Zhang and Gildea 2004]

• ITG best, or at least as good as IBM or tree-to-string plus cloning

• ITG has no linguistic syntax…

Page 36: Towards Syntactically Constrained Statistical  Word Alignment

Dependency Parsing

• Discussed in other sessions this semester

• Notion of violating “phrasal cohesion”

– Usually bad, but not always

Page 37: Towards Syntactically Constrained Statistical  Word Alignment

Dependencies + ITG[Cherry and Lin 2006]

• Find invalid dependency spans; assign score of –∞ if used by the ITG parser

• Simple model: maximize co-occurrence score with penalty for distant words

• ITG reduces AER by 13% relative; dependencies + ITG reduce by 34%

nj

mi

jiji feφfev 52 10),(),(

Page 38: Towards Syntactically Constrained Statistical  Word Alignment

Dependencies + ITG[Cherry and Lin 2006]

• Discriminative training with an SVM

• Feature vector for each ITG rule instance

– Features from Taskar et al. [2005]

– Feature marking ITG inversion rules

– Feature (penalty) marking invalid spans based on dependency tree

Page 39: Towards Syntactically Constrained Statistical  Word Alignment

Dependencies + ITG[Cherry and Lin 2006]

• Compared Taskar et al. to D-ITG with hard and soft constraints

• Training: 50,000 French–English sentence pairs for counts and probabilities; 100 hand-annotated pairs with derived ITG trees for discriminative training

• Test: 347 hand-annotated sentences from 2003 parallel text workshop

Page 40: Towards Syntactically Constrained Statistical  Word Alignment

Dependencies + ITG[Cherry and Lin 2006]

• Relative improvement smaller in discriminative training scenario with stronger objective function

• Hard constraint starts to hurt recall

Page 41: Towards Syntactically Constrained Statistical  Word Alignment

Outline

• The word alignment problem

• Base approaches

• Syntax-based approaches

– Distortion models

– Tree-to-string models

– Tree-to-tree models

• Discussion

Page 42: Towards Syntactically Constrained Statistical  Word Alignment

All These Tradeoffs…

• Mathematical and statistical correctness vs. computability

• Simple model vs. capturing linguistic phenomena

• Not enough syntactic information vs. too much syntactic information

• Ruling out bad alignments vs. keeping good alignments around

Page 43: Towards Syntactically Constrained Statistical  Word Alignment

• Completely unconstrained: every alignment link (ei, fj) either “on” or “off”

• Permutation space: one-to-one alignment with reordering [Taskar et al. 2005]

• ITG space: permutation space satisfying binary tree constraint [Wu 1997]

• Dependency space: permutation space maintaining phrasal cohesion

Alignment Spaces

Page 44: Towards Syntactically Constrained Statistical  Word Alignment

Alignment Spaces

• D-ITG space: Dependency ∩ ITG space [Cherry and Lin 2006]

• HD-ITG space: D-ITG space where each span must contain a head [Cherry and Lin 2006a]

Page 45: Towards Syntactically Constrained Statistical  Word Alignment

Examining Alignment Spaces

[Cherry and Lin 2006a]• Alignment score

– Learned co-occurrence score

– Gold-standard oracle score

Page 46: Towards Syntactically Constrained Statistical  Word Alignment

Examining Alignment Spaces

[Cherry and Lin 2006a]• Learned co-occurrence score

– More restricted spaces give better results

Page 47: Towards Syntactically Constrained Statistical  Word Alignment

Examining Alignment Spaces

[Cherry and Lin 2006a]• Oracle score: subsets of permutation

space

– ITG rules out almost nothing correct

– Beam search in dependency space does worst

Page 48: Towards Syntactically Constrained Statistical  Word Alignment

Conclusions

• Base alignment models are mathematical, limited notions of sentence structure

• Syntax-aware alignment helpful for syntax-aware MT [DeNero and Klein 2007]

• Using structure as a hard constraint is harmful for divergent sentences; tweaking trees [Gildea 2003] or using soft constraints [Cherry and Lin 2006] helps fix this

Page 49: Towards Syntactically Constrained Statistical  Word Alignment

Conclusions

• Surprise winner: ITG

– Computationally straightforward

– Permissive, simple grammar that mostly only rules out bad alignments [Cherry and Lin 2006a]

– Does a lot, even when it’s not the best

• Discriminative framework looks promising and flexible — can incorporate generative models as features [Taskar et al. 2005]

Page 50: Towards Syntactically Constrained Statistical  Word Alignment

Towards the Future

• Easy-to-run GIZA++ made complicated IBM models the norm — promising discriminative or syntax-based models currently lack such a toolkit

• Syntax-based discriminative techniques — morphology, POS, semantic information…

• Any other ideas?

Page 51: Towards Syntactically Constrained Statistical  Word Alignment

References• Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty,

R. Mercer, and P. Roossin, “A statistical approach to machine translation,” Computational Linguistics, 16(2):79-85, 1990.

• Brown, P., S. Della Pietra, V. Della Pietra, and R. Mercer, “The mathematics of statistical machine translation: Parameter estimation,” Computational Linguistics, 19(2):263-311.

• Cherry, Colin and Dekang Lin, “Soft syntactic constraints for word alignment through discriminative training,” Proceedings of the COLING/ACL Poster Session, 105-112, 2006.

• Cherry, Colin and Dekang Lin, “A comparison of syntactically motivated alignment spaces,” Proceedings of EACL, 145-152, 2006a.

• DeNero, John and Dan Klein, “Tailoring word alignments to syntactic machine translation,” Proceedings of ACL, 17-24, 2007.

• Gildea, Daniel, “Loosely tree-based alignment for machine translation,” Proceedings of ACL, 80-87, 2003.

Page 52: Towards Syntactically Constrained Statistical  Word Alignment

References• Och, Franz and Hermann Ney, “A systematic comparison of various

statistical alignment models,” Computational Linguistics, 29(1):19-51, 2003.

• Taskar, B., S. Lacoste-Julien, and D. Klein, “A discriminative matching approach to word alignment,” Proceedings of HLT/EMNLP, 73-80, 2005.

• Vogel, S., H. Ney, and C. Tillmann, “HMM-based word alignment in statistical translation,” Proceedings of COLING, 836-841, 1996.

• Wu, Dekai, “Stochastic inversion transduction grammars and bilingual parsing of parallel corpora,” Computational Linguistics, 23(3):377-403.

• Yamada, Kenji and Kevin Knight, “A syntax-based statistical translation model,” Proceedings of ACL, 523-530, 2001.

• Zhang, Hao and Daniel Gildea, “Syntax-based alignment: Supervised or unsupervised?” Proceedings of COLING, 418-424, 2004.