Dual Decomposition Inference for Graphical Models over Strings Nanyun (Violet) Peng Ryan Cotterell...

Dual Decomposition Inference for Graphical Models over Strings

Nanyun (Violet) PengRyan Cotterell Jason Eisner

Johns Hopkins University

1

Attention!

• Don’t care about phonology?

• Listen anyway. This is a general method for

inferring strings from other strings (if you have a probability model).

• So if you haven’t yet observed all the words of your noisy or complex language, try it!

2

A Phonological ExerciseTenses

Verb

s

3

[tɔk] [tɔks] [tɔkt]TALKTHANKHACK

1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.

[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]

CRACKSLAP

[kɹæks] [kɹækt][slæp] [slæpt]

Matrix Completion: Collaborative Filtering

Movies

Use

rs

-37 29 19 29-36 67 77 22-24 61 74 12

-79 -41-52 -39


29 19

Movies

Use

rs

2967 77 2261 74 12

-79 -41-39

-6 -3 2

[ 4 1 -5][ 7 -2 0][ 6 -2 3][-9 1 4][ 3 8 -5]

5

[

[

9 -2 1

[

[

9 -7 2

[

[

4 3 -2

[

[

-37-36-24

-52


6

Prediction!

59 -806 46

-37 29 19 29-36 67 77 22-24 61 74 12

-79 -41-52 -39

-6 -3 2

[

[

9 -2 1

[

[

9 -7 2

[

[

[

[

[ 4 1 -5][ 7 -2 0][ 6 -2 3][-9 1 4][ 3 8 -5]

Movies

Use

rs

4 3 -2[


[1,-4,3] [-5,2,1]

-10

-11

Dot Product

Gaussian Noise

7

A Phonological Exercise



Tenses

Verb

s


CRACKSLAP


8




Suffixes

Stem

s


CRACKSLAP


/Ø/ /s/ /t/ /t/

/tɔk//θeɪŋk//hæk/

/slæp//kɹæk/

9





CRACKSLAP


/Ø/ /s/ /t/ /t/


/slæp//kɹæk/

10

Suffixes

Stem

s


[tɔk] [tɔks] [tɔkt]TALK

HACK



CRACKSLAP

[kɹæk] [kɹæks] [kɹækt] [kɹækt][slæp] [slæps] [slæpt] [slæpt]

/Ø/ /s/ /t/ /t/


/slæp//kɹæk/

Prediction!11

THANK

Suffixes

Stem

s

A Model of Phonology

tɔk s

tɔks

Concatenate

“talks”12





CRACKSLAPCODEBAT


[koʊdz] [koʊdɪt][bæt] [bætɪt]

/Ø/ /s/ /t/ /t/


/bæt//koʊd//slæp//kɹæk/

13

Suffixes

Stem

s



HACK



CRACKSLAPCODEBAT


[koʊdz] [koʊdɪt][bæt] [bætɪt]

/Ø/ /s/ /t/ /t/


/bæt//koʊd//slæp//kɹæk/

z instead of s ɪt instead of t14

THANK

Suffixes

Stem

s



HACK



CRACKSLAPCODEBATEAT


[koʊdz] [koʊdɪt][bæt] [bætɪt][it] [eɪt] [itən]

/Ø/ /s/ /t/ /t/


/it//bæt//koʊd//slæp//kɹæk/

eɪt instead of itɪt 15

THANK

Suffixes

Stem

s


koʊd s

koʊd#s

koʊdz

Concatenate

Phonology (stochastic)

“codes”

16

Modeling word forms using latent underlying morphs and phonology.Cotterell et. al. TACL 2015


rizaign ation

rizaign#ation

rεzɪgneɪʃn

“resignation”

Concatenate

17

Phonology (stochastic)

dæmneɪʃənzrizaign

r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz

rεzɪgn#eɪʃən rizajn#z dæmn#zdæmn#eɪʃən

Fragment of Our Graph for English

18

1) Morphemes

2) Underlying words

3) Surface words

Concatenation

Phonology

“resignation” “resigns”

“damnation” “damns”

3rd-personsingular suffix:very common!

Limited to concatenation? No, could extend to templatic morphology …

19

Outline

20

● A motivating example: phonology● General framework:

o graphical models over stringso Inference on graphical models over strings

● Dual decomposition inferenceo The general ideao Substring features and active set

● Experiments and results

Graphical Models over Strings?

● Joint distribution over many strings

● Variables● Range over * Σ infinite set of all strings

● Relations among variables● Usually specified by (multi-tape) FSTs

21

A probabilistic approach to language change (Bouchard-Côté et. al. NIPS 2008)

Graphical models over multiple strings. (Dreyer and Eisner. EMNLP 2009)

Large-scale cognate recovery (Hall and Klein. EMNLP 2011)

Graphical Models over Strings?

● Strings are the basic units in natural languages.● Use

o Orthographic (spelling)o Phonological (pronunciation)o Latent (intermediate steps not observed directly)

● Sizeo Morphemes (meaningful subword units)o Wordso Multi-word phrases, including “named entities”o URLs

22

What relationships could you model?

● spelling pronunciation

● word noisy word (e.g., with a typo)

● word related word in another language

(loanwords, language evolution, cognates)

● singular plural (for example)

● root word

● underlying form surface form

23

Factor Graph for phonology

25

zrizajgn eɪʃən dæmn

rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z


1) Morpheme URs

2) Word URs

3) Word SRs

Concatenation (e.g.)

Phonology (PFST)

log-probabilityLet’s maximize it!




1) Morpheme URs

2) Word URs

3) Word SRs


Phonology (PFST)

Contextual Stochastic Edit Process

26

Stochastic contextual edit distance and probabilistic FSTs. (Cotterell et. al. ACL 2014)

?

?

riz’ajnz

?

r,εzɪgn’eɪʃn

?

?

riz’ajnd

??

Inference on a Factor Graph

28

1) Morpheme URs

2) Word URs

3) Word SRs

foo

?

riz’ajnz

?

r,εzɪgn’eɪʃn

s

?

riz’ajnd

dabar


29

1) Morpheme URs

2) Word URs

3) Word SRs


30

foo

bar#s

riz’ajnz

bar#foo

r,εzɪgn’eɪʃn

s

bar#da

riz’ajnd

dabar1) Morpheme URs

2) Word URs

3) Word SRs


8e-3 0.01 0.05 0.02

31

foo

bar#s

riz’ajnz

bar#foo

r,εzɪgn’eɪʃn

s

bar#da

riz’ajnd


2) Word URs

3) Word SRs


8e-3 0.01 0.05 0.02

32

foo

bar#s

riz’ajnz

bar#foo

r,εzɪgn’eɪʃn

s

bar#da

riz’ajnd


2) Word URs

3) Word SRs

6e-12002e-1300 7e-1100


8e-3 0.01 0.05 0.02

33

foo

bar#s

riz’ajnz

bar#foo

r,εzɪgn’eɪʃn

s

bar#da

riz’ajnd


2) Word URs

3) Word SRs

6e-12002e-1300 7e-1100


34

foo

far#s

riz’ajnz

far#foo

r,εzɪgn’eɪʃn

s

far#da

riz’ajnd

dafar1) Morpheme URs

2) Word URs

3) Word SRs

?


35

foo

size#s

riz’ajnz

size#foo

r,εzɪgn’eɪʃn

s

size#da

riz’ajnd

dasize1) Morpheme URs

2) Word URs

3) Word SRs

?


36

foo

…#s

riz’ajnz

…#foo

r,εzɪgn’eɪʃn

s

…#da

riz’ajnd

da…1) Morpheme URs

2) Word URs

3) Word SRs

?


37

foo

rizajn#s

riz’ajnz

rizajn#foo

r,εzɪgn’eɪʃn

s

rizajn#da

riz’ajnd

darizajn1) Morpheme URs

2) Word URs

3) Word SRs


38

foo

rizajn#s

riz’ajnz

rizajn#foo

r,εzɪgn’eɪʃn

s

rizajn#da

riz’ajnd

darizajn1) Morpheme URs

2) Word URs

3) Word SRs

0.012e-5 0.008


39

eɪʃn

rizajn#s

riz’ajnz

rizajn#eɪʃn

r,εzɪgn’eɪʃn

s

rizajn#d

riz’ajnd

drizajn1) Morpheme URs

2) Word URs

3) Word SRs

0.010.001 0.015


40

eɪʃn

rizajgn#s

riz’ajnz

rizajgn#eɪʃn

r,εzɪgn’eɪʃn

s

rizajgn#d

riz’ajnd

drizajgn1) Morpheme URs

2) Word URs

3) Word SRs

0.0080.008 0.013

eɪʃn

rizajgn#s

riz’ajnz

rizajgn#eɪʃn

r,εzɪgn’eɪʃn

s

rizajgn#d

riz’ajnd

drizajgn

0.0080.008 0.013


41

Challenges in Inference

42

• Global discrete optimization problem.

• Variables range over a infinite set … cannot be solved by ILP or even brute force. Undecidable!

• Our previous papers used approximate algorithms: Loopy Belief Propagation, or Expectation Propagation.

Q: Can we do exact inference? A: If we can live with 1-best and not marginal inference, then we can use Dual Decomposition … which is exact.

(if it terminates! the problem is undecidable in general …)

Outline

43





Graphical Model for Phonology

44

Jointly decide the values of the inter-dependent latent variables, which range over a infinite set.

1) Morpheme URs

2) Word URs

3) Word SRs


Phonology (PFST)




rεzign eɪʃən

General Idea of Dual Decomp

45




rεzign eɪʃən


zrizajneɪʃən dæmn eɪʃən zdæmnrεzɪgn



Subproblem 1 Subproblem 2 Subproblem 3 Subproblem 4

46

I preferrεzɪgn

I preferrizajn






47

Outline

48









49

Substring Features and Active Set





50

I preferrεzɪgn

Less ε, ɪ, g; more i, a, j(to match others)

I preferrizajn

Less i, a, j;more ε, ɪ, g(to match others)

Features: “Active set” method

• How many features?

• Infinitely many possible n-grams!

• Trick: Gradually increase feature set as needed.– Like Paul & Eisner (2012), Cotterell & Eisner (2015)

1. Only add features on which strings disagree.2. Only add abcd once abc and bcd already agree.

– Exception: Add unigrams and bigrams for free.

51

Fragment of Our Graph for Catalan

52

?

?

grizos

?

gris

?

?

grize

??

grizes

?

?

Stem of “grey”

Separate these 4 words into 4 subproblems as before …

53

? ?

grizos

?

gris

?

?

grize

??

??

grizes

Redraw the graph to focus on the stem …

54

? ?

grizos

?

gris

？

?

grize

??

grizes

??

？？？

Separate into 4 subproblems – each gets its own copy of the stem

55

? ?

grizos

?

gris

ε

?

grize

??

grizes

??

εε

ε

nonzero features:{ }

Iteration: 1

56

? ?

grizos

?

gris

g

?

grize

??

grizes

??

gg

g

nonzero features: { }

Iteration: 3

57

? ?

grizos

?

gris

gris

?

grize

??

grizes

??

grizgriz

griz

nonzero features: {s, z, is, iz, s$, z$ }

Iteration: 4

Feature weights (dual variable)

58

? ?

grizos

?

gris

gris

?

grize

??

grizes

??

grizgrizo

griz

nonzero features: {s, z, is, iz, s$, z$,o, zo, o$ }

Iteration: 5


59

? ?

grizos

?

gris

gris

?

grize

??

grizes

??

grizgrizo

griz


Iteration: 6

Iteration: 13


60

? ?

grizos

?

gris

griz

?

grize

??

grizes

??

grizgrizo

griz


Iteration: 14


61

? ?

grizos

?

gris

griz

?

grize

??

grizes

??

grizgriz

griz


Iteration: 17


62

? ?

grizos

?

gris

griz

?

grize

??

grizes

??

grizegriz

griz

nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$}

Iteration: 18


63

? ?

grizos

?

gris

griz

?

grize

??

grizes

??

grizegriz

griz


Iteration: 19

Iteration: 29


64

? ?

grizos

?

gris

griz

?

grize

??

grizes

??

grizgriz

griz


Iteration: 30


65

? ?

grizos

?

gris

griz

?

grize

??

grizes

??

grizgriz

griz


Iteration: 30

Converged!

I’ll try to arrange forr not i at position 2, i not z at position 3,z not at position 4.

Why n-gram features?

66

• Positional features don’t understand insertion:

• In contrast, our “z” feature counts the number of “z” phonemes, without regard to position.

These solutions already agree on “g”, “i”, “z” counts … they’re only negotiating over the “r” count.

gizgriz

gizgriz

I need more r’s.

Why n-gram features?

67

• Adjust weights λ until the “r” counts match:

• Next iteration agrees on all our unigram features:

– Oops! Features matched only counts, not positions – But bigram counts are still wrong …

so bigram features get activated to save the day

– If that’s not enough, add even longer substrings …

gizgriz I need more r’s … somewhere.

girzgriz I need more gr, ri, iz,less gi, ir, rz.

Outline

68





7 Inference Problems (graphs)

EXERCISE (small)

o 4 languages: Catalan, English, Maori, Tangale

o 16 to 55 underlying morphemes.

o 55 to 106 surface words.

CELEX (large)

o 3 languages: English, German, Dutch

o 341 to 381 underlying morphemes.

o 1000 surface words for each language.

69

# vars (unknown strings)

# subproblems

Experimental Questions

o Is exact inference by DD practical?o Does it converge? o Does it get better results than approximate

inference methods?

o Does exact inference help EM?

71

● DD seeks best λ via subgradient algorithm reduce dual objective tighten upper bound on primal objective

● If λ gets all sub-problems to agree (x1 = … = xK) constraints satisfied dual value is also value of a primal solution which must be max primal! (and min dual)

72

≤

primal (function of strings x)

dual(function of weights λ)

Convergence behavior (full graph)

Catalan Maori

English Tangale73

Dual (tighten upper bound)

primal(improve strings)

optimal!

Comparisons

● Compare DD with two types of Belief Propagation (BP) inference.

Approximate MAP inference(max-product BP)

(baseline)

Approximate marginal inference(sum-product BP)

(TACL 2015)

Exact MAP inference(dual decomposition)

(this paper)

74

Exact marginal inference(we don’t know how!)

variationalapproximation

Viterbiapproximation

Inference accuracy

75

Approximate MAP inference(max-product BP)

(baseline)

Approximate marginal inference(sum-product BP)

(TACL 2015)

Exact MAP inference(dual decomposition)

(this paper)

Model 1, EXERCISE: 90% Model 1, CELEX: 84% Model 2S, CELEX: 99%Model 2E, EXERCISE: 91%



Model 1 – trivial phonologyModel 2S – oracle phonologyModel 2E – learned phonology (inference used within EM)

impro

ves improvesmore!

worse

Conclusion

•A general DD algorithm for MAP inference on graphical models over strings.

•On the phonology problem, terminates in practice, guaranteeing the exact MAP solution.

•Improved inference for supervised model; improved EM training for unsupervised model.

•Try it for your own problems generalizing to new strings!

76

Dual Decomposition Inference for Graphical Models over Strings Nanyun (Violet) Peng Ryan Cotterell...

Documents

Transcript of Dual Decomposition Inference for Graphical Models over Strings Nanyun (Violet) Peng Ryan Cotterell...