1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for...
-
Upload
phebe-sylvia-spencer -
Category
Documents
-
view
218 -
download
0
Transcript of 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for...
11
Jason EisnerNAACL Workshop Keynote – June 2009
Joint Models with Missing Datafor Semi-Supervised Learning
2
Outline
1. Why use joint models?
2. Making big joint models tractable:Approximate inference and training by loopy belief propagation
3. Open questions: Semi-supervised training of joint models
3
The standard story
Taskx yp(y|x) model
Semi-sup. learning: Train on many (x,?) and a few (x,y)
4
Some running examples
Taskx yp(y|x) model
Semi-sup. learning: Train on many (x,?) and a few (x,y)
sentence parse
lemma morph. paradigm
E.g., in low-resource languages
(with David A. Smith)
(with Markus Dreyer)
5
Semi-supervised learningSemi-sup. learning: Train on many (x,?) and a few (x,y)
Why would knowing p(x) help you learn p(y|x) ??
Shared parameters via joint model e.g., noisy channel:
p(x,y) = p(y) * p(x|y)
Estimate p(x,y) to have appropriate marginal p(x)
This affects the conditional distrib p(y|x)
6
sample of p(x)
7
few params
For any x, can now recover cluster c that probably generated itA few supervised examples may let us predict y from cE.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c)
(joint model!)
sample of p(x)
8
Semi-supervised learningSemi-sup. learning: Train on many (x,?) and a few (x,y)
Why would knowing p(x) help you learn p(y|x) ??
Picture is misleading: No need to assume a distance metric(as in TSVM, label propagation, etc.)
But we do need to choose a model family for p(x,y)
Shared parameters via joint model e.g., noisy channel:
p(x,y) = p(y) * p(x|y)
Estimate p(x,y) to have appropriate marginal p(x)
This affects the conditional distrib p(y|x)
9
NLP + ML = ???
Taskx ystructured
input(may be
only partlyobserved,so infer x,
too)
structured output(so already
need jointinference for decoding, e.g.,dynamicprogramming)
p(y|x) model
depends on features of<x,y>(sparse features?)
or features of <x,z,y> where z are latent
(so infer z, too)
10
Each task in a vacuum?
Task1x1y1
Task2x2y2
Task3x3y3
Task4x4y4
11
Solved tasks help later ones? (e.g, pipeline)
Task1x z1
Task2 z2
Task3 z3
Task4 y
12
Feedback?
Task1x z1
Task2 z2
Task3 z3
Task4 y
What if Task3 isn’t solved yet and we have little
<z2,z3> training data?
13
Feedback?
Task1x z1
Task2 z2
Task3 z3
Task4 y
What if Task3 isn’t solved yet and we have little
<z2,z3> training data?Impute <z2,z3> given x1 and y4!
14
A later step benefits from many earlier ones?
Task1x z1
Task2 z2
Task3 z3
Task4 y
15
A later step benefits from many earlier ones?
Task1x z1
Task2 z2
Task3 z3
Task4 y
And conversely?
16
We end up with a Markov Random Field (MRF)
Φ1x z1
z2
z3
y
Φ2
Φ3
Φ4
17
=Variable-centric, not task-centric
z1
z2
z3
Φ1
Φ2
Φ4
Φ3
x
y
=(1/Z) Φ2(z1,z2)Φ4(z3,y)Φ3(x,z1,z2,z3)
Φ1(x,z1)p(x,z1,z2,z3,y)
Φ5
Φ5(y)
18
First, a familiar example Conditional Random Field (CRF) for POS tagging
18
Familiar MRF example
……
find preferred tags
v v v
Possible tagging (i.e., assignment to remaining variables)
Observed input sentence (shaded)
1919
Familiar MRF example First, a familiar example
Conditional Random Field (CRF) for POS tagging
……
find preferred tags
v a n
Possible tagging (i.e., assignment to remaining variables)Another possible tagging
Observed input sentence (shaded)
2020
Familiar MRF example: CRF
……
find preferred tags
v n av 0 2 1n 2 1 0a 0 3 1
v n av 0 2 1n 2 1 0a 0 3 1
”Binary” factor that measures
compatibility of 2 adjacent tags
Model reusessame parameters
at this position
2121
Familiar MRF example: CRF
……
find preferred tags
v 0.2n 0.2a 0
“Unary” factor evaluates this tagIts values depend on corresponding word
can’t be adj
v 0.2n 0.2a 0
2222
Familiar MRF example: CRF
……
find preferred tags
v 0.2n 0.2a 0
“Unary” factor evaluates this tagIts values depend on corresponding word
(could be made to depend onentire observed sentence)
2323
Familiar MRF example: CRF
……
find preferred tags
v 0.2n 0.2a 0
“Unary” factor evaluates this tagDifferent unary factor at each position
v 0.3n 0.02a 0
v 0.3n 0a 0.1
2424
Familiar MRF example: CRF
……
find preferred tags
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0.02a 0
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0a 0.1
v 0.2n 0.2a 0
v a n
p(v a n) is proportionalto the product of all
factors’ values on v a n
2525
Familiar MRF example: CRF
……
find preferred tags
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0.02a 0
v n av 0 2 1n 2 1 0a 0 3 1
v 0.3n 0a 0.1
v 0.2n 0.2a 0
v a n
= … 1*3*0.3*0.1*0.2 …
p(v a n) is proportionalto the product of all
factors’ values on v a n
NOTE: This is not just a pipeline of single-tag prediction tasks(which might work ok in well-trained supervised case …)
26
Task-centered view of the world
Task1x z1
Task2 z2
Task3 z3
Task4 y
27
=Variable-centered view of the world
z1
z2
z3
Φ1
Φ2
Φ4
Φ3
x
y
=(1/Z) Φ2(z1,z2)Φ4(z3,y)Φ3(x,z1,z2,z3)
Φ1(x,z1)p(x,z1,z2,z3,y)
Φ5
Φ5(y)
28
Variable-centric, not task-centric
Throw in any variables that might help!Model and exploit correlations
29
lexicon (word types)semantics
sentences
discourse context
resources
entailmentcorrelation
inflectioncognatestransliterationabbreviationneologismlanguage evolution
translationalignment
editingquotation
speech misspellings,typos formatting entanglement annotation
N
tokens
30
Back to our (simpler!) running examples
sentence parse
lemma morph. paradigm
(with David A. Smith)
(with Markus Dreyer)
31
Parser projection
sentence parse
translation parse of
translation
little directtraining data
much moretraining data
sentence parse (with David A. Smith)
32
Parser projection
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
33
Parser projection
sentence parse
word-to-wordalignment
translation parse of
translation
little directtraining data
much moretraining data
34
Parser projection
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
35
Parser projection
sentence parse
word-to-wordalignment
translation parse of
translation
little directtraining data
much moretraining data
need aninteresting
model
36
Parses are not entirely isomorphic
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
monotonic
nullhead-swapping
siblings
37
Dependency Relations
+ “none of the above”
38
Parser projection
sentence parse
translation
word-to-wordalignment
parse oftranslation
Typical test data (no translation observed):
39
sentence parse
translation
word-to-wordalignment
parse oftranslation
Small supervised training set (treebank):
Parser projection
40
Parser projection
sentence parse
translation
word-to-wordalignment
parse oftranslation
Moderate treebank in other language:
41
sentence parse
translation
word-to-wordalignment
parse oftranslation
Maybe a few gold alignments:
Parser projection
42
sentence parse
translation
word-to-wordalignment
parse oftranslation
Lots of raw bitext:
Parser projection
43
sentence parse
translation
word-to-wordalignment
parse oftranslation
Given bitext,
Parser projection
44
sentence parse
translation
word-to-wordalignment
parse oftranslation
Given bitext, try to impute other variables:
Parser projection
45
sentence parse
translation
word-to-wordalignment
parse oftranslation
Given bitext, try to impute other variables:Now we have more constraints on the parse …
Parser projection
46
sentence parse
translation
word-to-wordalignment
parse oftranslation
Given bitext, try to impute other variables:Now we have more constraints on the parse …which should help us train the parser.
Parser projection
We’ll see how belief propagation naturally handles this.
47
English does help us impute Chinese parse
中国 在 基本 建设 方面 , 开始 利用 国际 金融 组织 的 贷款 进行 国际性 竞争性 招标 采购
In the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurement
China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: ‘s: loans: to implement: international: competitive: bidding: procurement
Seeing noisy output of an English WSJ parser fixes these Chinese links
The corresponding bad versions found without seeing the English parse
Subject attaches to intervening noun
N P J N N , V V N N N ‘s N V J N N N
Complement verbs swap objects
48
Which does help us train a monolingual Chinese parser
49
(Could add a 3rd language …)
sentence parse
translation parse of
translation
translation’
alignment
parse oftranslation’
alignment
alignment
50
world
sentence parse
translation
word-to-wordalignment
parse oftranslation
(Could add world knowledge …)
51
(Could add bilingual dictionary …)
sentence parse
translation
word-to-wordalignment
parse oftranslation
dict(since incomplete, treat as partially observed var)
N
52
sentence parse
translation parse of
translation
alignment
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
Dynamic Markov Random Field Note: These
are structured vars
Each is expanded into a collection of fine-grained variables (words, dependency links, alignment links,…)
Thus, # of fine-grained variables & factors varies by example (but all examples share a single finite parameter vector)
53
Back to our running examplessentence parse (with David A. Smith)
lemma morph. paradigm (with Markus Dreyer)
54
inf xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present Past
Morphological paradigm
55
inf werfen
1st Sg werfe warf
2nd Sg wirfst warfst
3rd Sg wirft warf
1st Pl werfen warfen
2nd Pl werft warft
3rd Pl werfen warfen
Present Past
Morphological paradigm
56
inf xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present Past
Morphological paradigm as MRF
Each factor is a sophisticated weighted FST
57
inf 9,393
1st Sg 285 1124
2nd Sg 166 4
3rd Sg 1410 1124
1st Pl 1688 673
2nd Pl 1275 9
3rd Pl 1688 673
Present Past
# observations per form (fine-grained
semisupervision)
rare!
rare!
Question: Does joint inference help?
undertrained
58
inf gelten
1st Sg gelte galt
2nd Sg giltst galtstor: galtest
3rd Sg gilt galt
1st Pl gelten galten
2nd Pl geltet galtet
3rd Pl gelten galten
Present Past
*geltst*geltst giltstgiltst
geltetgeltet geltetgeltet
galtstgaltst galtestgaltest
*gltt*gltt galtetgaltet
gelten ‘to hold, to apply’
59
inf abbrechen
1st Sg abbrecheor: breche ab
abbrachor: brach ab
2nd Sg abbrichstor: brichst ab
abbrachstor: brachst ab
3rd Sg abbrichtor: bricht ab
abbrachor: brach ab
1st Pl abbrechenor: brechen ab
abbrachenor: brachen ab
2nd Pl abbrechtor: brecht ab
abbrachtor: bracht ab
3rd Pl abbrechenor: brechen ab
abbrachenor: brachen ab
Present Past
*abbrachten*abbrachten abbrachenabbrachen
abbrecheabbreche abbrecheabbreche
abbrachtabbracht abbrachtabbrachtabbrechtabbrecht abbrechtabbrecht
*atttrachst*atttrachst abbrachstabbrachst
*abbrechst*abbrechst abbrichstabbrichst
abbrichtabbricht abbrichtabbricht
*abbrachten*abbrachten abbrachenabbrachen
abbrechen ‘to quit’
60
inf gackern
1st Sg gackere gackerte
2nd Sg gackerst gackertest
3rd Sg gackert gackerte
1st Pl gackern gackerten
2nd Pl gackert gackertet
3rd Pl gackern gackerten
Present Past
gackern ‘to cackle’
*gackrt*gackrt gackertetgackertet
*gackart*gackart gackertestgackertest
gackeregackere gackeregackeregackerstgackerst gackerstgackerst
gackertgackert gackertgackertgackerngackern gackerngackern
gackertgackert gackertgackertgackerngackern gackerngackern
gackertegackerte gackertegackerte
gackertegackerte gackertegackertegackertengackerten gackertengackerten
gackertengackerten gackertengackerten
61
inf werfen
1st Sg werfe warf
2nd Sg wirfst warfst
3rd Sg wirft warf
1st Pl werfen warfen
2nd Pl werft warft
3rd Pl werfen warfen
Present Past
werfen ‘to throw’
warftwarft warftwarft
*werfst*werfst wirfstwirfst
werftwerft werftwerft
warfstwarfst warfstwarfst
62
Preliminary results …
joint inference helps a lot on the rare forms
Hurts on the others.Can we fix?? (Is it because our jointdecoder is approx? Or because semi-supervised training is hard and we need a better method for it?)
63
Outline
1. Why use joint models in NLP?
2. Making big joint models tractable:Approximate inference and training by loopy belief propagation
3. Open questions: Semisupervised training of joint models
64
Key Idea! We’re using an MRF to coordinate the solutions to
several NLP problems
Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables
(individual words, tags, links)
Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even
undecidable to work with
65
MRFs great for n-way classification (maxent) Also good for predicting sequences
Also good for dependency parsing
65
Why we need approximate inference
alas, forward-backward algorithmonly allows n-gram features
alas, our combinatorial algorithms only allow single-edge features (more interactions slow them down or introduce NP-hardness)
…find preferred links…
find preferred tags
v a n
66
Great Ideas in ML: Message Passing
66
3 behind
you
2 behind
you
1 behind
you
4 behind
you
5 behind
you
1 before
you
2 before
you
there’s1 of me
3 before
you
4 before
you
5 before
you
adapted from MacKay (2003) textbook
Count the soldiers
67
Great Ideas in ML: Message Passing
67
3 behind
you
2 before
you
there’s1 of me
Belief:Must be
2 + 1 + 3 = 6 of us
only seemy incoming
messages
2 31
Count the soldiers
adapted from MacKay (2003) textbook
68
Belief:Must be
2 + 1 + 3 = 6 of us
2 31
Great Ideas in ML: Message Passing
68
4 behind
you
1 before
you
there’s1 of me
only seemy incoming
messages
Belief:Must be
1 + 1 + 4 = 6 of us
1 41
Count the soldiers
adapted from MacKay (2003) textbook
69
Great Ideas in ML: Message Passing
69
7 here
3 here
11 here(=
7+3+1)
1 of me
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
70
Great Ideas in ML: Message Passing
70
3 here
3 here
7 here(=
3+3+1)
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
71
Great Ideas in ML: Message Passing
71
7 here
3 here
11 here(=
7+3+1)
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
72
Great Ideas in ML: Message Passing
72
7 here
3 here
3 here
Belief:Must be14 of us
Each soldier receives reports from all branches of tree
adapted from MacKay (2003) textbook
73
Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree
73
7 here
3 here
3 here
Belief:Must be14 of us
wouldn’t work correctly
with a “loopy” (cyclic) graph
adapted from MacKay (2003) textbook
7474
……
find preferred tags
Great ideas in ML: Forward-Backward
v 0.3n 0a 0.1
v 1.8n 0a 4.2
α βα
belief
message message
v 2n 1a 7
In the CRF, message passing = forward-backward
v 7n 2a 1
v 3n 1a 6
βv n a
v 0 2 1n 2 1 0a 0 3 1
v 3n 6a 1
v n av 0 2 1n 2 1 0a 0 3 1
75
Extend CRF to “skip chain” to capture non-local factor More influences on belief
75
……
find preferred tags
Great ideas in ML: Forward-Backward
v 3n 1a 6
v 2n 1a 7
α β
v 3n 1a 6
v 5.4n 0a 25.2
v 0.3n 0a 0.1
76
Extend CRF to “skip chain” to capture non-local factor More influences on belief Graph becomes loopy
76
……
find preferred tags
Great ideas in ML: Forward-Backward
v 3n 1a 6
v 2n 1a 7
α β
v 3n 1a 6
v 5.4`n 0a 25.2`
v 0.3n 0a 0.1
Red messages not independent?Pretend they are!
77
inf xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present Past
MRF over string-valued variables!
Each factor is a sophisticated weighted FST
78
inf xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present Past
Each factor is a sophisticated weighted FST
MRF over string-valued variables!
What are these messages?Probability distributions over strings …
Represented by weighted FSAsConstructed by finite-state operations
Parameters trainable using finite-state methods
Warning: FSAs can get larger and larger;must prune back using k-best or variational approx
79
Key Idea! We’re using an MRF to coordinate the solutions to
several NLP problems
Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables
(individual words, tags, links)
Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even
undecidable to work with We just saw this for morphology; now let’s see it for parsing
8080
Back to simple variables … CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
v a n
Local factors in a graphical model
find preferred links ……
81
Back to simple variables … CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
81
Local factors in a graphical model
find preferred links ……
tf
ft
ff
Possible parse— encoded as an assignment to these vars
v a n
82
Back to simple variables … CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
82
Local factors in a graphical model
find preferred links ……
ff
tf
tf
Possible parse— encoded as an assignment to these varsAnother possible parse
v a n
83
Back to simple variables … CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
(cycle)
83
Local factors in a graphical model
find preferred links ……
ft
ttf
Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parse
v a n
f
84
Back to simple variables … CRF for POS tagging
Now let’s do dependency parsing! O(n2) boolean variables for the possible links
(cycle)
84
Local factors in a graphical model
find preferred links ……
t
tt
Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parse
v a n
t
(multiple parents)
f
f
104
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation
104
Local factors for parsing
find preferred links ……
t 2f 1
t 1f 2
t 1f 2
t 1f 6
t 1f 3
as before, goodness of this link can depend on entireobserved input context
t 1f 8
some other linksaren’t as goodgiven this input
sentence
But what if the best assignment isn’t a tree??
105105
Global factors for parsing So what factors shall we multiply to define parse probability?
Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1
find preferred links ……
ffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
106
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1
106
Global factors for parsing
find preferred links ……
ffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
tf
ft
ff 64 entries (0/1)
So far, this is equivalent toedge-factored parsing(McDonald et al. 2005).
Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time.They use combinatorial algorithms; so should we!
optionally require the tree to be projective (no crossing links)
we’relegal!
107
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables
grandparent
107
Local factors for parsing
find preferred links ……
f t
f 1 1
t 1 3
t
t
3
108
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables
grandparent no-cross
108
Local factors for parsing
find preferred links ……
t
by
t
f t
f 1 1
t 1 0.2
109109
Local factors for parsing
find preferred links …… by
So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree
this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables
grandparent no-cross coordination with other parse & alignment hidden POS tags siblings subcategorization …
110110
Exactly Finding the Best Parse
With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming
Non-projective: O(n2) by minimum spanning tree
but to allow fast dynamic programming or MST parsing,only use single-edge features
…find preferred links…
O(n4)
grandparents
O(n5)
grandp.+ siblingbigrams
O(n3g6)
POS trigrams
… O(2n)
sibling pairs (non-adjacent)
NP-hard
•any of the above features•soft penalties for crossing links•pretty much anything else!
111
Two great tastes that taste great together
You got dynamic
programming in my belief propagation!
You got belief propagation in my dynamic
programming!
113113
Loopy Belief Propagation for Parsing
find preferred links ……
Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don’t exist” The 3 7 link tells the Tree factor, “You’ll have to find another
parent for 7” The tree factor tells the 10 7 link, “You’re on!” The 10 7 link tells 10, “Could you please be a noun?” …
114
Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links …
114
Loopy Belief Propagation for Parsing
find preferred links ……
115
Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …
How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
115
Loopy Belief Propagation for Parsing
find preferred links ……
?TREE factorffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
116
How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
116
Loopy Belief Propagation for Parsing
find preferred links ……
?
TREE factorffffff 0ffffft 0fffftf 0
… …fftfft 1
… …tttttt 0
But this is the outside probability of green link!
TREE factor computes all outgoing messages at once (given all incoming messages)
Projective case: total O(n3) time by inside-outside
Non-projective: total O(n3) time by inverting Kirchhoff
matrix (Smith & Smith, 2007)
117
How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
117
Loopy Belief Propagation for Parsing
But this is the outside probability of green link!
TREE factor computes all outgoing messages at once (given all incoming messages)
Projective case: total O(n3) time by inside-outside
Non-projective: total O(n3) time by inverting Kirchhoff
matrix (Smith & Smith, 2007)
Belief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms
(fast, no grammar constant).
118
Some interesting connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)
Global constraints in arc consistency ALLDIFFERENT constraint (Régin 1994)
Matching constraint in max-product BP For computer vision (Duchi et al., 2006) Could be used for machine translation
As far as we know, our parser is the first use of global constraints in sum-product BP.
And nearly the first use of BP in natural language processing.
119
Runtimes for each factor type (see paper) Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)
+=Additive, not multiplicative!
periteration
120
Runtimes for each factor type (see paper) Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)
+=Additive, not multiplicative!Each “global” factor coordinates an unbounded # of variables
Standard belief propagation would take exponential timeto iterate over all configurations of those variables
See paper for efficient propagators
121
Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)
Danish Dutch English
Tree+Link 85.5 87.3 88.6
+NoCross 86.1 88.3 89.1
+Grandparent 86.1 88.6 89.4
+ChildSeq 86.5 88.5 90.1
122
Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)
Danish Dutch English
Tree+Link 85.5 87.3 88.6
+NoCross 86.1 88.3 89.1
+Grandparent 86.1 88.6 89.4
+ChildSeq 86.5 88.5 90.1
Best projective parse with all factors
86.0 84.5 90.2
+hill-climbing 86.1 87.6 90.2
exact, slow
doesn’t fixenough edges
123
Time vs. Projective Search Error
…DP 140
Compared with O(n4) DP Compared with O(n5) DP
iterations
iterations
iterations
125125
Summary of MRF parsing by BP
Output probability defined as product of local and global factors Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently
Let local factors negotiate via “belief propagation” Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast
e.g., existing parsing algorithms using dynamic programming Each iteration takes total time O(n3) or even O(n2); see paper
Compare reranking or stacking
Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy
126
Outline
1. Why use joint models in NLP?
2. Making big joint models tractable:Approximate inference and training by loopy belief propagation
3. Open questions: Semisupervised training of joint models
127
Training with missing data is hard! Semi-supervised learning of HMMs or PCFGs: ouch!
Merialdo: Just stick with the small supervised training set Adding unsupervised data tends to hurt
A stronger model helps (McClosky et al. 2007, Cohen et al. 2009)
So maybe some hope from good models @ factors And from having lots of factors (i.e., take cues from lots of
correlated variables at once; cf. Yarowsky et al.) Naïve Bayes would be okay …
Variables with unknown values can’t hurt you. They have no influence on training or decoding.
But can’t help you, either! And indep. assumptions are flaky. So I’d like to keep discussing joint models …
128
Case #1: Missing data that you can’t impute
sentence parse
translation
word-to-wordalignment
parse oftranslation
Treat like multi-task learning? Shared features between 2 tasks: parse Chinese vs. parse Chinese w/ English translationOr 3 tasks: parse Chinese w/ inferred English gist vs. parse Chinese w/ English translation vs. parse English gist derived from English (supervised)
129
inf xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present Past
Case #2: Missing data you can impute, but maybe badly
Each factor is a sophisticated weighted FST
130
inf xyz
1st Sg
2nd Sg
3rd Sg
1st Pl
2nd Pl
3rd Pl
Present Past
Case #2: Missing data you can impute, but maybe badly
Each factor is a sophisticated weighted FST
This is where simple cases of EM go wrong Could reduce to case #1 and throw away these variables Or: Damp messages from imputed variables to the extent you’re
not confident in them Requires confidence estimation. (cf. strapping) Crude versions: Confidence depends in a fixed way on time, or on
entropy of belief at that node, or on length of input sentence. But could train a confidence estimator on supervised data to
pay attention to all sorts of things! Correspondingly, scale up features for related missing-data tasks
since the damped data are “partially missing”