Statistical NLP Lecture 18: Bayesian grammar induction & machine translation Roger Levy Department...

Statistical NLPLecture 18: Bayesian grammar induction & machine translation

Roger Levy

Department of Linguistics, UCSD

Thanks to Percy Liang, Noah Smith, and Dan Klein for slides

Plan

1. Recent developments in Bayesian unsupervised grammar induction

• Nonparametric grammars• Non-conjugate priors

2. A bit about machine translation

Nonparametric grammars

• Motivation:• How many symbols should a grammar have?• Really an open question• “Let the data have a say”

Hierarchical Dirichlet Process PCFG

• Start with the standard Bayesian picture:

(Liang et al., 2007)

Grammar representation

• Liang et al. use Chomsky normal-form (CNF) grammars

• A CNF grammar has no -e productions, and only has rules of form• X ® Y Z [binary rewrite]• X ® a [unary terminal production]

not CNFCNF

HDP-PCFG defined

• Each grammar has a top-level distribution over (non-terminal) symbols b• This distribution is a Dirichlet process (stick-breaking

distribution; Sethuraman, 1994)

• So really there are infinitely many nonterminals

• Each nonterminal symbol has:• an emission distribution• a binary rule distribution• and a distribution over what type of rule to use

HDP-PCFG defined

The prior over symbols

• The Dirichlet Process controls expectations about symbol distributions

Binary rewrite rules

Inference

• Variational Bayes• The tractable distribution is factored into data, top-level

symbol, and rewrite components

Results

• Simple synthetic grammar (all rule probs equal):

• Successfully recovers sparse symbol structure

(standard ML-PCFG fails)

Results on treebank parsing

• Binarize the Penn Treebank and erase category labels• Try to recover label structure, and then parse

sentences with the resulting grammar

ML estimation

Dependency grammar induction & other priors

• We’ll now cover work by Noah Smith and colleagues on unsupervised dependency grammar induction

• Highlight on: non-conjugate priors• What types of priors are interesting to use?

Klein & Manning dependency recap

Klein and Manning’s DMV

• Probabilistic, unlexicalized dependency grammar over part-of-speech sequences, designed for unsupervised learning (Klein and Manning, 2004).

• Left and right arguments are independent; two states to handle valence.

$

Det

Nsing

Vpast

Prep Adj

Nsing .

Aside: Visual Notation

T

G

Xt

Yt

maximized over

integrated out

observed

EM for Maximum Likelihood Estimation

• E step: calculate exact posterior given current grammar

• M step: calculate best grammar, assuming current posterior

T

G

Xt

Yt

Convenient Change of Variable

T

G

Xt

Yt

E

T

G

Xt

Ft,e

E

EM (Algorithmic View)

• E step: calculate derivation event posteriors given grammar

• M step: calculate best grammar using event posteriors

T

G

Xt

Ft,e

Maximum a Posteriori (MAP) Estimation

• The data are not the only source of information about the grammar.

• Robustness: the grammar should not have many zeroes. Smooth.

• This can be accomplished by putting a prior U on the grammar (Chen, 1995; Eisner, 2001, inter alia).

• The most computationally convenient prior is a Dirichlet, with α > 1.

E

MAP EM (Algorithmic View)

• E step: calculate derivation event posteriors given grammar

• M step: calculate best grammar using event posteriors

T

G

Xt

Ft,e

U

Experimental Results: EM and MAP EM

• Evaluation of learned grammar on a parsing task (unseen test data).

• Initialization and, for MAP, smoothing hyperparameter “u” need to be chosen.• Can do this with

unlabeled dev data (modulo infinite cross-ent),

• or labeled (shown in blue).

EM EM MAP MAP

German 40 20 54

English 23 42 42 42

Bulgarian 43 45 46

Mandarin 40 49 37 50

Turkish 32 42 41 48

Portuguese 43 43 37 42

Smith (2006, ch. 8)

Structural Bias and Annealing

T

• Simple idea: use soft structural constraints to encourage structures that are more plausible.• This affects the E step only. The

final grammar takes the same form as usual.

• Here: “favor short dependencies.”

• Annealing: gradually shift this bias over time.

G

Xt

Yt B

U

Algorithmic Issues

• Structural bias score for a tree needs to factor in such a way that dynamic programming algorithms are still efficient. • Equivalently, g and b, taken together, factor into local

features.

• Idea explored here: string distance between a word and its parent is penalized geometrically.

Experimental Results: Structural Bias & Annealing

• Labeled dev data used to pick• Initialization• Hyperparameter• Structural bias strength

(for SB)• Annealing schedule (for

SA)

MAP CE SA

German 54 63 72

English 42 58 67

Bulgarian 46 41 59

Mandarin 50 41 58

Turkish 48 59 62

Portuguese 42 72 51

Smith (2006, ch. 8)

Correlating Grammar Events

• Observation by Blei and Lafferty (2006), regarding topic models:• A multinomial over states that gives high probability to

some states is likely to give high probability to other, correlated states.

• For us: a class that favors one type of dependents is likely to favor similar types of dependents.• If Vpast favors Nsing as a subject, it might also favor

Nplural.• In general, certain classes are likely to have correlated

child distributions.

• Can we build a grammar-prior that encodes (and learns) these tendencies?

Logistic Normal Distribution over Multinomials

• Given: mean vector μ, covariance matrix Σ• Draw a vector η from Normal(η; μ, Σ).• Apply softmax:

Logistic Normal Distributions

p1 = p2 =

0.5

p1→ 1

p2→ 1

p1 = 1 p1 = 0

m = [ ]0.40.6

η

softmax

Logistic Normal Distributions

μ, Σ

p1→ 1

p2→ 1

Logistic Normal Grammar

...η1

η2

η3

ηn

softmax

softmax

softmax

softmax


softmax

softmax

softmax

softmax


g


g

Learning a Logistic Normal Grammar

• We use variational EM as before to achieve Empirical Bayes; the result is a learned μ and Σ corresponding to each multinomial distribution in the grammar.• Variational model for G also has a logistic normal form.• Cohen et al. (2009) exploit tricks from Blei and Lafferty

(2006), as well as the dynamic programming trick for trees/derivation events used previously.

Experimental Results: EB

• Single initializer.• MAP hyperparameter

value is fixed at 1.1.• LN covariance matrix is

1 on the diagonal and 0.5 for tag pairs within the same “family” (thirteen, designed to be language-independent).

EM MAP EB (D)

EB (LN)

English 46 46 46 59


Cohen, Gimpel, and Smith (NIPS 2008)Cohen and Smith (NAACL-HLT 2009)

Shared Logistic Normals

• Logistic normal softly ties grammar event probabilities within the same distribution.

• What about across distributions?• If Vpast is likely to have a noun argument, so is Vpresent.• In general, certain classes are likely to have correlated

parent distributions.

• We can capture this by combining draws from logistic normal distributions.

Shared Logistic Normal Distributions

...η1

η2

η3

ηn

average & softmax

average & softmax

average & softmax

average & softmax


average & softmax

average & softmax

average & softmax

average & softmax


g


g

What to Tie?

• All verb tags share components for all six distributions (left children, right children, and stopping in each direction in each state).

• All noun tags share components for all six distributions (left children, right children, and stopping in each direction in each state).

• (Clearly, many more ideas to try!)

Experimental Results: EB

• Single initializer.• MAP hyperparameter

value is fixed at 1.1.• Tag families used for

logistic normal and shared logistic normal models.

• Verb-as-parent distributions, noun-as-parent distributions each tied in shared logistic normal models.

EM MAP EB (LN)

EB (SLN)

English 46 46 59 61


Cohen and Smith (NAACL-HLT 2009)

Bayesian grammar induction summary

• This is an exciting (though technical and computationally complex) area!

• Nonparametric models’ ability to scale model complexity with data complexity is attractive

• Since likelihood clearly won’t guide us to the right grammars, exploring a wider variety of priors is also attractive

• Open issue: nonparametric models constrain what types of priors can be used

Machine translation

• Shifting gears…

Machine Translation: Examples

Machine Translation

Madame la présidente, votre présidence de cette institution a été marquante.Mrs Fontaine, your presidency of this institution has been outstanding.Madam President, president of this house has been discoveries. Madam President, your presidency of this institution has been impressive.

Je vais maintenant m'exprimer brièvement en irlandais.I shall now speak briefly in Irish .I will now speak briefly in Ireland . I will now speak briefly in Irish .

Nous trouvons en vous un président tel que nous le souhaitions.We think that you are the type of president that we want.We are in you a president as the wanted. We are in you a president as we the wanted.

History

• 1950’s: Intensive research activity in MT• 1960’s: Direct word-for-word replacement• 1966 (ALPAC): NRC Report on MT

• Conclusion: MT no longer worthy of serious scientific investigation.

• 1966-1975: `Recovery period’• 1975-1985: Resurgence (Europe, Japan)• 1985-present: Gradual Resurgence (US)

http://ourworld.compuserve.com/homepages/WJHutchins/MTS-93.htm

http://ourworld.compuserve.com/homepages/WJHutchins/MTS-93.htm

Levels of Transfer

Interlingua

SemanticStructure

SemanticStructure

SyntacticStructure

SyntacticStructure

WordStructure

WordStructure

Source Text Target Text

SemanticComposition

SemanticDecomposition

SemanticAnalysis

SemanticGeneration

SyntacticAnalysis

SyntacticGeneration

MorphologicalAnalysis

MorphologicalGeneration

SemanticTransfer

SyntacticTransfer

Direct

(Vauquois triangle)

General Approaches

• Rule-based approaches• Expert system-like rewrite systems• Interlingua methods (analyze and generate)• Lexicons come from humans• Can be very fast, and can accumulate a lot of knowledge over time

(e.g. Systran)

• Statistical approaches• Word-to-word translation• Phrase-based translation• Syntax-based translation (tree-to-tree, tree-to-string)• Trained on parallel corpora• Usually noisy-channel (at least in spirit)

The Coding View

• “One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”

• Warren Weaver (1955:18, quoting a letter he wrote in 1947)

MT System Components

sourceP(e)

e f

decoder

observed

argmax P(e|f) = argmax P(f|e)P(e)e e

e fbest

channelP(f|e)

Language Model Translation Model

Finds an English translation which is both fluent and semantically faithful to the French source

Overview: Extracting Phrases

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …

Phrase table(translation model)

Intersected and grown word alignments

Directional word alignments

Phrase-Based Decoding

这 7 人中包括来自法国和俄罗斯的宇航员 .

Why Syntactic Translation?

Kare ha ongaku wo kiku no ga daisuki desu

From Yamada and Knight (2001)

He adores listening to music.

Two Places for Syntax?

• Language Model• Can use with any translation model• Syntactic language models seem to be better for MT

than ASR (why?)• Not thoroughly investigated [Charniak et al 03]

• Translation Model• Can use any language model• Linear LM can complement a tree-based TM (why?)• Also not thoroughly explored, but much more work

recently

Parse Tree (E) Sentence (J)

.

Reorder

VB

PRP VB2 VB1

TO VB

MN TO

he adores

listening

music to

Insert

desu

VB

PRP VB2 VB1

TO VB

MN TO

he ha

music to

ga

adores

listening no

Translate

desu

VB

PRP VB2 VB1

TO VB

MN TO

kare ha

ongaku wo

ga

daisuki

kiku no

VB

PRP VB1

he adores

listening

VB2

VB TO

MNTO

musicto

Parse Tree(E)

Sentence(J)

VB

PRP VB1 VB2

VB TO

TO MN

PRP VB2 VB1

TO VB

VB

NN TO

he adores

listening

to music

headores

listening

music to

P(PRP VB1 VB2 PRP VB2 VB1 ) = 0.723P(VB TO TO VB ) = 0.749P(TO NN NN TO ) = 0.893

1. Reorder

Parameter Table: Reorder

Original Order Reordering P(reorder|original)

PRP VB1 VB2 PRP VB1 VB2 PRP VB2 VB1 VB1 PRP VB2 VB1 VB2 PRP VB2 PRP VB1 VB2 VB1 PRP

0.074 0.723 0.061 0.037 0.083 0.021

VB TO VB TO TO VB

0.107 0.893

TO NN TO NN NN TO

0.251 0.749

2. Insert

VB

PRP VB2 VB1

TO VB

NN TO

music to

he ha ga

nolistening

adores desu

P(none|TOP-VB) = 0.735

P(right|VB-PRP)* P(ha) = 0.652 * 0.219

P(right|VB-VB) * P (ga) = 0.252 * 0.062

P(none|TO-TO) = 0.900Conditioning Feature = Parent Label & Node Label (position) none (word selection)

Parameter Table: Insert

Parent label node level

TOP VB

VB VB

VB TO

TO TO

TO NN

TO NN

P (none) P (left) P (right)

0.735 0.004 0.260

0.687 0.061 0.252

0.344 0.004 0.652

0.700 0.030 0.261

0.900 0.003 0.097

0.800 0.096 0.104

W P (insert-w) ha ta wo no ni te ga desu

0.219 0.131 0.099 0.094 0.080 0.078 0.062 0.0007

3. Translate

VB

PRP VB2 VB1

he ha TO VB ga adores desu

kare

NN TO

music to

listening no

kiku

daisukiP (he kare) = 0.952P (music ongaku) =0.900P (to wo ) = 0.038P (listening kiku ) = 0.333P (adore daisuki) = 1.000Conditioning Feature= word (E) identity

ongaku wo

Parameter Table: Translate

E adores he listening music to

J daisuki 1.000 kare 0.952 NULL 0.016 nani 0.005 da 0.003 shi 0.003

kiku 0.333 kii 0.333 mi 0.333

ongaku 0.900 naru 0.100

ni 0.216 NULL 0.204 to 0.133 no 0.046 wo 0.038

Note: Translation to NULL = deletion

Synchronous Grammars

• Multi-dimensional PCFGs (Wu 95, Melamed 04)• Both texts share the same parse tree:

Synchronous Grammars

• Formally: have paired expansions

• … with probabilities, of course!• Distribution over tree pairs• Strong assumption: constituents in one language are

constituents in the other• Is this a good assumption? Why / why not?

S NP VP

S NP VP

VP V NP

VP NP V

Synchronous Derivations

Synchronous Derivations (II)

Hiero Phrases

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

Original input: Transformation:

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

Top-Down Tree Transducers

[Next slides from Kevin Knight]

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


NP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

, ,


, wa ,ga

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

,kare wa,


, ,ga

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

kare kikuongaku owa daisuki desugano

Original input: Final output:

, , , , , , ,,


S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

kare kikuongaku owa daisuki desugano

Original input:

, , , , , , ,,


A

x0:B C

x0, F, x2, G, x1

x1:D x2:E

这 7 人中包括来自法国和俄罗斯的宇航员 .

RULE 1:DT(these) 这

RULE 2:VBP(include) 中包括

RULE 6:NNP(Russia) 俄罗斯

RULE 4:NNP(France) 法国

RULE 8:NP(NNS(astronauts)) 宇航 , 员

RULE 5:CC(and) 和

RULE 10:NP(x0:DT, CD(7), NNS(people) x0 , 7 人

RULE 13:NP(x0:NNP, x1:CC, x2:NNP) x0 , x1 , x2

RULE 15:S(x0:NP, x1:VP, x2:PUNC) x0 , x1 , x2

RULE 16:NP(x0:NP, x1:VP) x1 , 的 , x0

RULE 9:PUNC(.) .

RULE 11:VP(VBG(coming), PP(IN(from), x0:NP)) 来自 , x0

RULE 14:VP(x0:VBP, x1:NP) x0 , x1

“These 7 people include astronauts coming from France and Russia”

Derivation Tree

“France and Russia”

“coming from France and Russia”

“astronauts coming fromFrance and Russia”

“these 7 people”

“include astronauts coming fromFrance and Russia”

“these” “Russia” “astronauts” “.”“include” “France” “&”

Examples

Statistical NLP Lecture 18: Bayesian grammar induction & machine translation Roger Levy Department...

Documents

Transcript of Statistical NLP Lecture 18: Bayesian grammar induction & machine translation Roger Levy Department...