Dual Decomposition Inference for Graphical Models over Strings Nanyun (Violet) Peng Ryan Cotterell...
-
Upload
stephen-lang -
Category
Documents
-
view
219 -
download
0
Transcript of Dual Decomposition Inference for Graphical Models over Strings Nanyun (Violet) Peng Ryan Cotterell...
Dual Decomposition Inference for Graphical Models over Strings
Nanyun (Violet) PengRyan Cotterell Jason Eisner
Johns Hopkins University
1
Attention!
• Don’t care about phonology?
• Listen anyway. This is a general method for
inferring strings from other strings (if you have a probability model).
• So if you haven’t yet observed all the words of your noisy or complex language, try it!
2
A Phonological ExerciseTenses
Verb
s
3
[tɔk] [tɔks] [tɔkt]TALKTHANKHACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAP
[kɹæks] [kɹækt][slæp] [slæpt]
Matrix Completion: Collaborative Filtering
Movies
Use
rs
-37 29 19 29-36 67 77 22-24 61 74 12
-79 -41-52 -39
Matrix Completion: Collaborative Filtering
29 19
Movies
Use
rs
2967 77 2261 74 12
-79 -41-39
-6 -3 2
[ 4 1 -5][ 7 -2 0][ 6 -2 3][-9 1 4][ 3 8 -5]
5
[
[
9 -2 1
[
[
9 -7 2
[
[
4 3 -2
[
[
-37-36-24
-52
Matrix Completion: Collaborative Filtering
6
Prediction!
59 -806 46
-37 29 19 29-36 67 77 22-24 61 74 12
-79 -41-52 -39
-6 -3 2
[
[
9 -2 1
[
[
9 -7 2
[
[
[
[
[ 4 1 -5][ 7 -2 0][ 6 -2 3][-9 1 4][ 3 8 -5]
Movies
Use
rs
4 3 -2[
Matrix Completion: Collaborative Filtering
[1,-4,3] [-5,2,1]
-10
-11
Dot Product
Gaussian Noise
7
A Phonological Exercise
[tɔk] [tɔks] [tɔkt]TALKTHANKHACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
Tenses
Verb
s
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAP
[kɹæks] [kɹækt][slæp] [slæpt]
8
A Phonological Exercise
[tɔk] [tɔks] [tɔkt]TALKTHANKHACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
Suffixes
Stem
s
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAP
[kɹæks] [kɹækt][slæp] [slæpt]
/Ø/ /s/ /t/ /t/
/tɔk//θeɪŋk//hæk/
/slæp//kɹæk/
9
A Phonological Exercise
[tɔk] [tɔks] [tɔkt]TALKTHANKHACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAP
[kɹæks] [kɹækt][slæp] [slæpt]
/Ø/ /s/ /t/ /t/
/tɔk//θeɪŋk//hæk/
/slæp//kɹæk/
10
Suffixes
Stem
s
A Phonological Exercise
[tɔk] [tɔks] [tɔkt]TALK
HACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAP
[kɹæk] [kɹæks] [kɹækt] [kɹækt][slæp] [slæps] [slæpt] [slæpt]
/Ø/ /s/ /t/ /t/
/tɔk//θeɪŋk//hæk/
/slæp//kɹæk/
Prediction!11
THANK
Suffixes
Stem
s
A Model of Phonology
tɔk s
tɔks
Concatenate
“talks”12
A Phonological Exercise
[tɔk] [tɔks] [tɔkt]TALKTHANKHACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAPCODEBAT
[kɹæks] [kɹækt][slæp] [slæpt]
[koʊdz] [koʊdɪt][bæt] [bætɪt]
/Ø/ /s/ /t/ /t/
/tɔk//θeɪŋk//hæk/
/bæt//koʊd//slæp//kɹæk/
13
Suffixes
Stem
s
A Phonological Exercise
[tɔk] [tɔks] [tɔkt]TALK
HACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAPCODEBAT
[kɹæks] [kɹækt][slæp] [slæpt]
[koʊdz] [koʊdɪt][bæt] [bætɪt]
/Ø/ /s/ /t/ /t/
/tɔk//θeɪŋk//hæk/
/bæt//koʊd//slæp//kɹæk/
z instead of s ɪt instead of t14
THANK
Suffixes
Stem
s
A Phonological Exercise
[tɔk] [tɔks] [tɔkt]TALK
HACK
1P Pres. Sg. 3P Pres. Sg. Past Tense Past Part.
[tɔkt][θeɪŋk] [θeɪŋks] [θeɪŋkt] [θeɪŋkt][hæk] [hæks] [hækt] [hækt]
CRACKSLAPCODEBATEAT
[kɹæks] [kɹækt][slæp] [slæpt]
[koʊdz] [koʊdɪt][bæt] [bætɪt][it] [eɪt] [itən]
/Ø/ /s/ /t/ /t/
/tɔk//θeɪŋk//hæk/
/it//bæt//koʊd//slæp//kɹæk/
eɪt instead of itɪt 15
THANK
Suffixes
Stem
s
A Model of Phonology
koʊd s
koʊd#s
koʊdz
Concatenate
Phonology (stochastic)
“codes”
16
Modeling word forms using latent underlying morphs and phonology.Cotterell et. al. TACL 2015
A Model of Phonology
rizaign ation
rizaign#ation
rεzɪgneɪʃn
“resignation”
Concatenate
17
Phonology (stochastic)
dæmneɪʃənzrizaign
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
rεzɪgn#eɪʃən rizajn#z dæmn#zdæmn#eɪʃən
Fragment of Our Graph for English
18
1) Morphemes
2) Underlying words
3) Surface words
Concatenation
Phonology
“resignation” “resigns”
“damnation” “damns”
3rd-personsingular suffix:very common!
Limited to concatenation? No, could extend to templatic morphology …
19
Outline
20
● A motivating example: phonology● General framework:
o graphical models over stringso Inference on graphical models over strings
● Dual decomposition inferenceo The general ideao Substring features and active set
● Experiments and results
Graphical Models over Strings?
● Joint distribution over many strings
● Variables● Range over * Σ infinite set of all strings
● Relations among variables● Usually specified by (multi-tape) FSTs
21
A probabilistic approach to language change (Bouchard-Côté et. al. NIPS 2008)
Graphical models over multiple strings. (Dreyer and Eisner. EMNLP 2009)
Large-scale cognate recovery (Hall and Klein. EMNLP 2011)
Graphical Models over Strings?
● Strings are the basic units in natural languages.● Use
o Orthographic (spelling)o Phonological (pronunciation)o Latent (intermediate steps not observed directly)
● Sizeo Morphemes (meaningful subword units)o Wordso Multi-word phrases, including “named entities”o URLs
22
What relationships could you model?
● spelling pronunciation
● word noisy word (e.g., with a typo)
● word related word in another language
(loanwords, language evolution, cognates)
● singular plural (for example)
● root word
● underlying form surface form
23
Factor Graph for phonology
25
zrizajgn eɪʃən dæmn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
1) Morpheme URs
2) Word URs
3) Word SRs
Concatenation (e.g.)
Phonology (PFST)
log-probabilityLet’s maximize it!
zrizajgn eɪʃən dæmn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
1) Morpheme URs
2) Word URs
3) Word SRs
Concatenation (e.g.)
Phonology (PFST)
Contextual Stochastic Edit Process
26
Stochastic contextual edit distance and probabilistic FSTs. (Cotterell et. al. ACL 2014)
?
?
riz’ajnz
?
r,εzɪgn’eɪʃn
?
?
riz’ajnd
??
Inference on a Factor Graph
28
1) Morpheme URs
2) Word URs
3) Word SRs
foo
?
riz’ajnz
?
r,εzɪgn’eɪʃn
s
?
riz’ajnd
dabar
Inference on a Factor Graph
29
1) Morpheme URs
2) Word URs
3) Word SRs
Inference on a Factor Graph
30
foo
bar#s
riz’ajnz
bar#foo
r,εzɪgn’eɪʃn
s
bar#da
riz’ajnd
dabar1) Morpheme URs
2) Word URs
3) Word SRs
Inference on a Factor Graph
8e-3 0.01 0.05 0.02
31
foo
bar#s
riz’ajnz
bar#foo
r,εzɪgn’eɪʃn
s
bar#da
riz’ajnd
dabar1) Morpheme URs
2) Word URs
3) Word SRs
Inference on a Factor Graph
8e-3 0.01 0.05 0.02
32
foo
bar#s
riz’ajnz
bar#foo
r,εzɪgn’eɪʃn
s
bar#da
riz’ajnd
dabar1) Morpheme URs
2) Word URs
3) Word SRs
6e-12002e-1300 7e-1100
Inference on a Factor Graph
8e-3 0.01 0.05 0.02
33
foo
bar#s
riz’ajnz
bar#foo
r,εzɪgn’eɪʃn
s
bar#da
riz’ajnd
dabar1) Morpheme URs
2) Word URs
3) Word SRs
6e-12002e-1300 7e-1100
Inference on a Factor Graph
34
foo
far#s
riz’ajnz
far#foo
r,εzɪgn’eɪʃn
s
far#da
riz’ajnd
dafar1) Morpheme URs
2) Word URs
3) Word SRs
?
Inference on a Factor Graph
35
foo
size#s
riz’ajnz
size#foo
r,εzɪgn’eɪʃn
s
size#da
riz’ajnd
dasize1) Morpheme URs
2) Word URs
3) Word SRs
?
Inference on a Factor Graph
36
foo
…#s
riz’ajnz
…#foo
r,εzɪgn’eɪʃn
s
…#da
riz’ajnd
da…1) Morpheme URs
2) Word URs
3) Word SRs
?
Inference on a Factor Graph
37
foo
rizajn#s
riz’ajnz
rizajn#foo
r,εzɪgn’eɪʃn
s
rizajn#da
riz’ajnd
darizajn1) Morpheme URs
2) Word URs
3) Word SRs
Inference on a Factor Graph
38
foo
rizajn#s
riz’ajnz
rizajn#foo
r,εzɪgn’eɪʃn
s
rizajn#da
riz’ajnd
darizajn1) Morpheme URs
2) Word URs
3) Word SRs
0.012e-5 0.008
Inference on a Factor Graph
39
eɪʃn
rizajn#s
riz’ajnz
rizajn#eɪʃn
r,εzɪgn’eɪʃn
s
rizajn#d
riz’ajnd
drizajn1) Morpheme URs
2) Word URs
3) Word SRs
0.010.001 0.015
Inference on a Factor Graph
40
eɪʃn
rizajgn#s
riz’ajnz
rizajgn#eɪʃn
r,εzɪgn’eɪʃn
s
rizajgn#d
riz’ajnd
drizajgn1) Morpheme URs
2) Word URs
3) Word SRs
0.0080.008 0.013
eɪʃn
rizajgn#s
riz’ajnz
rizajgn#eɪʃn
r,εzɪgn’eɪʃn
s
rizajgn#d
riz’ajnd
drizajgn
0.0080.008 0.013
Inference on a Factor Graph
41
Challenges in Inference
42
• Global discrete optimization problem.
• Variables range over a infinite set … cannot be solved by ILP or even brute force. Undecidable!
• Our previous papers used approximate algorithms: Loopy Belief Propagation, or Expectation Propagation.
Q: Can we do exact inference? A: If we can live with 1-best and not marginal inference, then we can use Dual Decomposition … which is exact.
(if it terminates! the problem is undecidable in general …)
Outline
43
● A motivating example: phonology● General framework:
o graphical models over stringso Inference on graphical models over strings
● Dual decomposition inferenceo The general ideao Substring features and active set
● Experiments and results
Graphical Model for Phonology
44
Jointly decide the values of the inter-dependent latent variables, which range over a infinite set.
1) Morpheme URs
2) Word URs
3) Word SRs
Concatenation (e.g.)
Phonology (PFST)
zrizajgn eɪʃən dæmn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
rεzign eɪʃən
General Idea of Dual Decomp
45
zrizajgn eɪʃən dæmn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
rεzign eɪʃən
General Idea of Dual Decomp
zrizajneɪʃən dæmn eɪʃən zdæmnrεzɪgn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
Subproblem 1 Subproblem 2 Subproblem 3 Subproblem 4
46
I preferrεzɪgn
I preferrizajn
General Idea of Dual Decomp
zrizajneɪʃən dæmn eɪʃən zdæmnrεzɪgn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
Subproblem 1 Subproblem 2 Subproblem 3 Subproblem 4
47
Outline
48
● A motivating example: phonology● General framework:
o graphical models over stringso Inference on graphical models over strings
● Dual decomposition inferenceo The general ideao Substring features and active set
● Experiments and results
zrizajneɪʃən dæmn eɪʃən zdæmnrεzɪgn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
Subproblem 1 Subproblem 2 Subproblem 3 Subproblem 4
49
Substring Features and Active Set
zrizajneɪʃən dæmn eɪʃən zdæmnrεzɪgn
rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z
r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz
Subproblem 1 Subproblem 1 Subproblem 1 Subproblem 1
50
I preferrεzɪgn
Less ε, ɪ, g; more i, a, j(to match others)
I preferrizajn
Less i, a, j;more ε, ɪ, g(to match others)
Features: “Active set” method
• How many features?
• Infinitely many possible n-grams!
• Trick: Gradually increase feature set as needed.– Like Paul & Eisner (2012), Cotterell & Eisner (2015)
1. Only add features on which strings disagree.2. Only add abcd once abc and bcd already agree.
– Exception: Add unigrams and bigrams for free.
51
Fragment of Our Graph for Catalan
52
?
?
grizos
?
gris
?
?
grize
??
grizes
?
?
Stem of “grey”
Separate these 4 words into 4 subproblems as before …
53
? ?
grizos
?
gris
?
?
grize
??
??
grizes
Redraw the graph to focus on the stem …
54
? ?
grizos
?
gris
?
?
grize
??
grizes
??
???
Separate into 4 subproblems – each gets its own copy of the stem
55
? ?
grizos
?
gris
ε
?
grize
??
grizes
??
εε
ε
nonzero features:{ }
Iteration: 1
56
? ?
grizos
?
gris
g
?
grize
??
grizes
??
gg
g
nonzero features: { }
Iteration: 3
57
? ?
grizos
?
gris
gris
?
grize
??
grizes
??
grizgriz
griz
nonzero features: {s, z, is, iz, s$, z$ }
Iteration: 4
Feature weights (dual variable)
58
? ?
grizos
?
gris
gris
?
grize
??
grizes
??
grizgrizo
griz
nonzero features: {s, z, is, iz, s$, z$,o, zo, o$ }
Iteration: 5
Feature weights (dual variable)
59
? ?
grizos
?
gris
gris
?
grize
??
grizes
??
grizgrizo
griz
nonzero features: {s, z, is, iz, s$, z$,o, zo, o$ }
Iteration: 6
Iteration: 13
Feature weights (dual variable)
60
? ?
grizos
?
gris
griz
?
grize
??
grizes
??
grizgrizo
griz
nonzero features: {s, z, is, iz, s$, z$,o, zo, o$ }
Iteration: 14
Feature weights (dual variable)
61
? ?
grizos
?
gris
griz
?
grize
??
grizes
??
grizgriz
griz
nonzero features: {s, z, is, iz, s$, z$,o, zo, o$ }
Iteration: 17
Feature weights (dual variable)
62
? ?
grizos
?
gris
griz
?
grize
??
grizes
??
grizegriz
griz
nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$}
Iteration: 18
Feature weights (dual variable)
63
? ?
grizos
?
gris
griz
?
grize
??
grizes
??
grizegriz
griz
nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$}
Iteration: 19
Iteration: 29
Feature weights (dual variable)
64
? ?
grizos
?
gris
griz
?
grize
??
grizes
??
grizgriz
griz
nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$}
Iteration: 30
Feature weights (dual variable)
65
? ?
grizos
?
gris
griz
?
grize
??
grizes
??
grizgriz
griz
nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$}
Iteration: 30
Converged!
I’ll try to arrange forr not i at position 2, i not z at position 3,z not at position 4.
Why n-gram features?
66
• Positional features don’t understand insertion:
• In contrast, our “z” feature counts the number of “z” phonemes, without regard to position.
These solutions already agree on “g”, “i”, “z” counts … they’re only negotiating over the “r” count.
gizgriz
gizgriz
I need more r’s.
Why n-gram features?
67
• Adjust weights λ until the “r” counts match:
• Next iteration agrees on all our unigram features:
– Oops! Features matched only counts, not positions – But bigram counts are still wrong …
so bigram features get activated to save the day
– If that’s not enough, add even longer substrings …
gizgriz I need more r’s … somewhere.
girzgriz I need more gr, ri, iz,less gi, ir, rz.
Outline
68
● A motivating example: phonology● General framework:
o graphical models over stringso Inference on graphical models over strings
● Dual decomposition inferenceo The general ideao Substring features and active set
● Experiments and results
7 Inference Problems (graphs)
EXERCISE (small)
o 4 languages: Catalan, English, Maori, Tangale
o 16 to 55 underlying morphemes.
o 55 to 106 surface words.
CELEX (large)
o 3 languages: English, German, Dutch
o 341 to 381 underlying morphemes.
o 1000 surface words for each language.
69
# vars (unknown strings)
# subproblems
Experimental Questions
o Is exact inference by DD practical?o Does it converge? o Does it get better results than approximate
inference methods?
o Does exact inference help EM?
71
● DD seeks best λ via subgradient algorithm reduce dual objective tighten upper bound on primal objective
● If λ gets all sub-problems to agree (x1 = … = xK) constraints satisfied dual value is also value of a primal solution which must be max primal! (and min dual)
72
≤
primal (function of strings x)
dual(function of weights λ)
Convergence behavior (full graph)
Catalan Maori
English Tangale73
Dual (tighten upper bound)
primal(improve strings)
optimal!
Comparisons
● Compare DD with two types of Belief Propagation (BP) inference.
Approximate MAP inference(max-product BP)
(baseline)
Approximate marginal inference(sum-product BP)
(TACL 2015)
Exact MAP inference(dual decomposition)
(this paper)
74
Exact marginal inference(we don’t know how!)
variationalapproximation
Viterbiapproximation
Inference accuracy
75
Approximate MAP inference(max-product BP)
(baseline)
Approximate marginal inference(sum-product BP)
(TACL 2015)
Exact MAP inference(dual decomposition)
(this paper)
Model 1, EXERCISE: 90% Model 1, CELEX: 84% Model 2S, CELEX: 99%Model 2E, EXERCISE: 91%
Model 1, EXERCISE: 95% Model 1, CELEX: 86% Model 2S, CELEX: 96%Model 2E, EXERCISE: 95%
Model 1, EXERCISE: 97% Model 1, CELEX: 90% Model 2S, CELEX: 99%Model 2E, EXERCISE: 98%
Model 1 – trivial phonologyModel 2S – oracle phonologyModel 2E – learned phonology (inference used within EM)
impro
ves improvesmore!
worse
Conclusion
•A general DD algorithm for MAP inference on graphical models over strings.
•On the phonology problem, terminates in practice, guaranteeing the exact MAP solution.
•Improved inference for supervised model; improved EM training for unsupervised model.
•Try it for your own problems generalizing to new strings!
76