1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for...

11

Jason EisnerNAACL Workshop Keynote – June 2009

Joint Models with Missing Datafor Semi-Supervised Learning

2

Outline

1. Why use joint models?

2. Making big joint models tractable:Approximate inference and training by loopy belief propagation

3. Open questions: Semi-supervised training of joint models

3

The standard story

Taskx yp(y|x) model

Semi-sup. learning: Train on many (x,?) and a few (x,y)

4

Some running examples

Taskx yp(y|x) model

Semi-sup. learning: Train on many (x,?) and a few (x,y)

sentence parse

lemma morph. paradigm

E.g., in low-resource languages

(with David A. Smith)

(with Markus Dreyer)

5

Semi-supervised learningSemi-sup. learning: Train on many (x,?) and a few (x,y)

Why would knowing p(x) help you learn p(y|x) ??

Shared parameters via joint model e.g., noisy channel:

p(x,y) = p(y) * p(x|y)

Estimate p(x,y) to have appropriate marginal p(x)

This affects the conditional distrib p(y|x)

6

sample of p(x)

7

few params

For any x, can now recover cluster c that probably generated itA few supervised examples may let us predict y from cE.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c)

(joint model!)

sample of p(x)

8

Semi-supervised learningSemi-sup. learning: Train on many (x,?) and a few (x,y)

Why would knowing p(x) help you learn p(y|x) ??

Picture is misleading: No need to assume a distance metric(as in TSVM, label propagation, etc.)

But we do need to choose a model family for p(x,y)

Shared parameters via joint model e.g., noisy channel:

p(x,y) = p(y) * p(x|y)

Estimate p(x,y) to have appropriate marginal p(x)

This affects the conditional distrib p(y|x)

9

NLP + ML = ???

Taskx ystructured

input(may be

only partlyobserved,so infer x,

too)

structured output(so already

need jointinference for decoding, e.g.,dynamicprogramming)

p(y|x) model

depends on features of<x,y>(sparse features?)

or features of <x,z,y> where z are latent

(so infer z, too)

10

Each task in a vacuum?

Task1x1y1

Task2x2y2

Task3x3y3

Task4x4y4

11

Solved tasks help later ones? (e.g, pipeline)

Task1x z1

Task2 z2

Task3 z3

Task4 y

12

Feedback?

Task1x z1

Task2 z2

Task3 z3

Task4 y

What if Task3 isn’t solved yet and we have little

<z2,z3> training data?

13

Feedback?

Task1x z1

Task2 z2

Task3 z3

Task4 y

What if Task3 isn’t solved yet and we have little

<z2,z3> training data?Impute <z2,z3> given x1 and y4!

14

A later step benefits from many earlier ones?

Task1x z1

Task2 z2

Task3 z3

Task4 y

15

A later step benefits from many earlier ones?

Task1x z1

Task2 z2

Task3 z3

Task4 y

And conversely?

16

We end up with a Markov Random Field (MRF)

Φ1x z1

z2

z3

y

Φ2

Φ3

Φ4

17

=Variable-centric, not task-centric

z1

z2

z3

Φ1

Φ2

Φ4

Φ3

x

y

=(1/Z) Φ2(z1,z2)Φ4(z3,y)Φ3(x,z1,z2,z3)

Φ1(x,z1)p(x,z1,z2,z3,y)

Φ5

Φ5(y)

18

First, a familiar example Conditional Random Field (CRF) for POS tagging

18

Familiar MRF example

……

find preferred tags

v v v

Possible tagging (i.e., assignment to remaining variables)

Observed input sentence (shaded)

1919

Familiar MRF example First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v a n

Possible tagging (i.e., assignment to remaining variables)Another possible tagging

Observed input sentence (shaded)

2020

Familiar MRF example: CRF

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

”Binary” factor that measures

compatibility of 2 adjacent tags

Model reusessame parameters

at this position

2121


……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

can’t be adj

v 0.2n 0.2a 0

2222


……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

(could be made to depend onentire observed sentence)

2323


……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagDifferent unary factor at each position

v 0.3n 0.02a 0

v 0.3n 0a 0.1

2424


……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

p(v a n) is proportionalto the product of all

factors’ values on v a n

2525


……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

= … 1*3*0.3*0.1*0.2 …

p(v a n) is proportionalto the product of all

factors’ values on v a n

NOTE: This is not just a pipeline of single-tag prediction tasks(which might work ok in well-trained supervised case …)

26

Task-centered view of the world

Task1x z1

Task2 z2

Task3 z3

Task4 y

27

=Variable-centered view of the world

z1

z2

z3

Φ1

Φ2

Φ4

Φ3

x

y

=(1/Z) Φ2(z1,z2)Φ4(z3,y)Φ3(x,z1,z2,z3)

Φ1(x,z1)p(x,z1,z2,z3,y)

Φ5

Φ5(y)

28

Variable-centric, not task-centric

Throw in any variables that might help!Model and exploit correlations

29

lexicon (word types)semantics

sentences

discourse context

resources

entailmentcorrelation

inflectioncognatestransliterationabbreviationneologismlanguage evolution

translationalignment

editingquotation

speech misspellings,typos formatting entanglement annotation

N

tokens

30

Back to our (simpler!) running examples

sentence parse

lemma morph. paradigm

(with David A. Smith)

(with Markus Dreyer)

31

Parser projection

sentence parse

translation parse of

translation

little directtraining data

much moretraining data

sentence parse (with David A. Smith)

32

Parser projection

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

33

Parser projection

sentence parse

word-to-wordalignment


translation



34

Parser projection



NULL

35

Parser projection

sentence parse



translation



need aninteresting

model

36

Parses are not entirely isomorphic



NULL

monotonic

nullhead-swapping

siblings

37

Dependency Relations

+ “none of the above”

38

Parser projection

sentence parse

translation


parse oftranslation

Typical test data (no translation observed):

39

sentence parse

translation


parse oftranslation

Small supervised training set (treebank):

Parser projection

40

Parser projection

sentence parse

translation


parse oftranslation

Moderate treebank in other language:

41

sentence parse

translation


parse oftranslation

Maybe a few gold alignments:

Parser projection

42

sentence parse

translation


parse oftranslation

Lots of raw bitext:

Parser projection

43

sentence parse

translation


parse oftranslation

Given bitext,

Parser projection

44

sentence parse

translation


parse oftranslation

Given bitext, try to impute other variables:

Parser projection

45

sentence parse

translation


parse oftranslation

Given bitext, try to impute other variables:Now we have more constraints on the parse …

Parser projection

46

sentence parse

translation


parse oftranslation

Given bitext, try to impute other variables:Now we have more constraints on the parse …which should help us train the parser.

Parser projection

We’ll see how belief propagation naturally handles this.

47

English does help us impute Chinese parse

中国在基本建设方面，开始利用国际金融组织的贷款进行国际性竞争性招标采购

In the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurement

China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: ‘s: loans: to implement: international: competitive: bidding: procurement

Seeing noisy output of an English WSJ parser fixes these Chinese links

The corresponding bad versions found without seeing the English parse

Subject attaches to intervening noun

N P J N N ， V V N N N ‘s N V J N N N

Complement verbs swap objects

48

Which does help us train a monolingual Chinese parser

49

(Could add a 3rd language …)

sentence parse


translation

translation’

alignment

parse oftranslation’

alignment

alignment

50

world

sentence parse

translation


parse oftranslation

(Could add world knowledge …)

51

(Could add bilingual dictionary …)

sentence parse

translation


parse oftranslation

dict(since incomplete, treat as partially observed var)

N

52

sentence parse


translation

alignment



NULL

Dynamic Markov Random Field Note: These

are structured vars

Each is expanded into a collection of fine-grained variables (words, dependency links, alignment links,…)

Thus, # of fine-grained variables & factors varies by example (but all examples share a single finite parameter vector)

53

Back to our running examplessentence parse (with David A. Smith)

lemma morph. paradigm (with Markus Dreyer)

54

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Morphological paradigm

55

inf werfen

1st Sg werfe warf

2nd Sg wirfst warfst

3rd Sg wirft warf

1st Pl werfen warfen

2nd Pl werft warft

3rd Pl werfen warfen

Present Past

Morphological paradigm

56

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Morphological paradigm as MRF

Each factor is a sophisticated weighted FST

57

inf 9,393

1st Sg 285 1124

2nd Sg 166 4

3rd Sg 1410 1124

1st Pl 1688 673

2nd Pl 1275 9

3rd Pl 1688 673

Present Past

# observations per form (fine-grained

semisupervision)

rare!

rare!

Question: Does joint inference help?

undertrained

58

inf gelten

1st Sg gelte galt

2nd Sg giltst galtstor: galtest

3rd Sg gilt galt

1st Pl gelten galten

2nd Pl geltet galtet

3rd Pl gelten galten

Present Past

*geltst*geltst giltstgiltst

geltetgeltet geltetgeltet

galtstgaltst galtestgaltest

*gltt*gltt galtetgaltet

gelten ‘to hold, to apply’

59

inf abbrechen

1st Sg abbrecheor: breche ab

abbrachor: brach ab

2nd Sg abbrichstor: brichst ab

abbrachstor: brachst ab

3rd Sg abbrichtor: bricht ab

abbrachor: brach ab

1st Pl abbrechenor: brechen ab

abbrachenor: brachen ab

2nd Pl abbrechtor: brecht ab

abbrachtor: bracht ab

3rd Pl abbrechenor: brechen ab

abbrachenor: brachen ab

Present Past

*abbrachten*abbrachten abbrachenabbrachen

abbrecheabbreche abbrecheabbreche

abbrachtabbracht abbrachtabbrachtabbrechtabbrecht abbrechtabbrecht

*atttrachst*atttrachst abbrachstabbrachst

*abbrechst*abbrechst abbrichstabbrichst

abbrichtabbricht abbrichtabbricht

*abbrachten*abbrachten abbrachenabbrachen

abbrechen ‘to quit’

60

inf gackern

1st Sg gackere gackerte

2nd Sg gackerst gackertest

3rd Sg gackert gackerte

1st Pl gackern gackerten

2nd Pl gackert gackertet

3rd Pl gackern gackerten

Present Past

gackern ‘to cackle’

*gackrt*gackrt gackertetgackertet

*gackart*gackart gackertestgackertest

gackeregackere gackeregackeregackerstgackerst gackerstgackerst

gackertgackert gackertgackertgackerngackern gackerngackern

gackertgackert gackertgackertgackerngackern gackerngackern

gackertegackerte gackertegackerte

gackertegackerte gackertegackertegackertengackerten gackertengackerten

gackertengackerten gackertengackerten

61

inf werfen

1st Sg werfe warf

2nd Sg wirfst warfst

3rd Sg wirft warf

1st Pl werfen warfen

2nd Pl werft warft

3rd Pl werfen warfen

Present Past

werfen ‘to throw’

warftwarft warftwarft

*werfst*werfst wirfstwirfst

werftwerft werftwerft

warfstwarfst warfstwarfst

62

Preliminary results …

joint inference helps a lot on the rare forms

Hurts on the others.Can we fix?? (Is it because our jointdecoder is approx? Or because semi-supervised training is hard and we need a better method for it?)

63

Outline

1. Why use joint models in NLP?


3. Open questions: Semisupervised training of joint models

64

Key Idea! We’re using an MRF to coordinate the solutions to

several NLP problems

Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables

(individual words, tags, links)

Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even

undecidable to work with

65

MRFs great for n-way classification (maxent) Also good for predicting sequences

Also good for dependency parsing

65

Why we need approximate inference

alas, forward-backward algorithmonly allows n-gram features

alas, our combinatorial algorithms only allow single-edge features (more interactions slow them down or introduce NP-hardness)

…find preferred links…

find preferred tags

v a n

66

Great Ideas in ML: Message Passing

66

3 behind

you

2 behind

you

1 behind

you

4 behind

you

5 behind

you

1 before

you

2 before

you

there’s1 of me

3 before

you

4 before

you

5 before

you

adapted from MacKay (2003) textbook

Count the soldiers

67


67

3 behind

you

2 before

you

there’s1 of me

Belief:Must be

2 + 1 + 3 = 6 of us

only seemy incoming

messages

2 31

Count the soldiers


68

Belief:Must be

2 + 1 + 3 = 6 of us

2 31


68

4 behind

you

1 before

you

there’s1 of me

only seemy incoming

messages

Belief:Must be

1 + 1 + 4 = 6 of us

1 41

Count the soldiers


69


69

7 here

3 here

11 here(=

7+3+1)

1 of me

Each soldier receives reports from all branches of tree


70


70

3 here

3 here

7 here(=

3+3+1)



71


71

7 here

3 here

11 here(=

7+3+1)



72


72

7 here

3 here

3 here

Belief:Must be14 of us



73

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

73

7 here

3 here

3 here

Belief:Must be14 of us

wouldn’t work correctly

with a “loopy” (cyclic) graph


7474

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

βv n a

v 0 2 1n 2 1 0a 0 3 1

v 3n 6a 1

v n av 0 2 1n 2 1 0a 0 3 1

75

Extend CRF to “skip chain” to capture non-local factor More influences on belief

75

……

find preferred tags


v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4n 0a 25.2

v 0.3n 0a 0.1

76

Extend CRF to “skip chain” to capture non-local factor More influences on belief Graph becomes loopy

76

……

find preferred tags


v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4`n 0a 25.2`

v 0.3n 0a 0.1

Red messages not independent?Pretend they are!

77

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

MRF over string-valued variables!


78

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past


MRF over string-valued variables!

What are these messages?Probability distributions over strings …

Represented by weighted FSAsConstructed by finite-state operations

Parameters trainable using finite-state methods

Warning: FSAs can get larger and larger;must prune back using k-best or variational approx

79

Key Idea! We’re using an MRF to coordinate the solutions to

several NLP problems

Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables

(individual words, tags, links)

Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even

undecidable to work with We just saw this for morphology; now let’s see it for parsing

8080

Back to simple variables … CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

v a n

Local factors in a graphical model

find preferred links ……

81



81



tf

ft

ff

Possible parse— encoded as an assignment to these vars

v a n

82



82



ff

tf

tf

Possible parse— encoded as an assignment to these varsAnother possible parse

v a n

83



(cycle)

83



ft

ttf

Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parse

v a n

f

84



(cycle)

84



t

tt

Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parse

v a n

t

(multiple parents)

f

f

104

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation

104

Local factors for parsing


t 2f 1

t 1f 2

t 1f 2

t 1f 6

t 1f 3

as before, goodness of this link can depend on entireobserved input context

t 1f 8

some other linksaren’t as goodgiven this input

sentence

But what if the best assignment isn’t a tree??

105105

Global factors for parsing So what factors shall we multiply to define parse probability?

Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1


ffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

106

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1

106

Global factors for parsing


ffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

tf

ft

ff 64 entries (0/1)

So far, this is equivalent toedge-factored parsing(McDonald et al. 2005).

Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time.They use combinatorial algorithms; so should we!

optionally require the tree to be projective (no crossing links)

we’relegal!

107


this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables

grandparent

107



f t

f 1 1

t 1 3

t

t

3

108



grandparent no-cross

108



t

by

t

f t

f 1 1

t 1 0.2

109109


find preferred links …… by



grandparent no-cross coordination with other parse & alignment hidden POS tags siblings subcategorization …

110110

Exactly Finding the Best Parse

With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming

Non-projective: O(n2) by minimum spanning tree

but to allow fast dynamic programming or MST parsing,only use single-edge features

…find preferred links…

O(n4)

grandparents

O(n5)

grandp.+ siblingbigrams

O(n3g6)

POS trigrams

… O(2n)

sibling pairs (non-adjacent)

NP-hard

•any of the above features•soft penalties for crossing links•pretty much anything else!

111

Two great tastes that taste great together

You got dynamic

programming in my belief propagation!

You got belief propagation in my dynamic

programming!

113113

Loopy Belief Propagation for Parsing


Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don’t exist” The 3 7 link tells the Tree factor, “You’ll have to find another

parent for 7” The tree factor tells the 10 7 link, “You’re on!” The 10 7 link tells 10, “Could you please be a noun?” …

114

Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links …

114



115

Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …

How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,

given the messages it receives from all the other links?”

115



?TREE factorffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

116



116



?

TREE factorffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

But this is the outside probability of green link!

TREE factor computes all outgoing messages at once (given all incoming messages)

Projective case: total O(n3) time by inside-outside

Non-projective: total O(n3) time by inverting Kirchhoff

matrix (Smith & Smith, 2007)

117



117


But this is the outside probability of green link!

TREE factor computes all outgoing messages at once (given all incoming messages)

Projective case: total O(n3) time by inside-outside

Non-projective: total O(n3) time by inverting Kirchhoff

matrix (Smith & Smith, 2007)

Belief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms

(fast, no grammar constant).

118

Some interesting connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

Global constraints in arc consistency ALLDIFFERENT constraint (Régin 1994)

Matching constraint in max-product BP For computer vision (Duchi et al., 2006) Could be used for machine translation

As far as we know, our parser is the first use of global constraints in sum-product BP.

And nearly the first use of BP in natural language processing.

119

Runtimes for each factor type (see paper) Factor type degree runtime count total

Tree O(n2) O(n3) 1 O(n3)

Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)

Grandparent 2 O(1) O(n3) O(n3)

Sibling pairs 2 O(1) O(n3) O(n3)

Sibling bigrams O(n) O(n2) O(n) O(n3)

NoCross O(n) O(n) O(n2) O(n3)

Tag 1 O(g) O(n) O(n)

TagLink 3 O(g2) O(n2) O(n2)

TagTrigram O(n) O(ng3) 1 O(n)

TOTAL O(n3)

+=Additive, not multiplicative!

periteration

120

Runtimes for each factor type (see paper) Factor type degree runtime count total

Tree O(n2) O(n3) 1 O(n3)

Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)

Grandparent 2 O(1) O(n3) O(n3)

Sibling pairs 2 O(1) O(n3) O(n3)

Sibling bigrams O(n) O(n2) O(n) O(n3)

NoCross O(n) O(n) O(n2) O(n3)

Tag 1 O(g) O(n) O(n)

TagLink 3 O(g2) O(n2) O(n2)

TagTrigram O(n) O(ng3) 1 O(n)

TOTAL O(n3)

+=Additive, not multiplicative!Each “global” factor coordinates an unbounded # of variables

Standard belief propagation would take exponential timeto iterate over all configurations of those variables

See paper for efficient propagators

121

Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)

Danish Dutch English

Tree+Link 85.5 87.3 88.6

+NoCross 86.1 88.3 89.1

+Grandparent 86.1 88.6 89.4

+ChildSeq 86.5 88.5 90.1

122

Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)

Danish Dutch English

Tree+Link 85.5 87.3 88.6

+NoCross 86.1 88.3 89.1

+Grandparent 86.1 88.6 89.4

+ChildSeq 86.5 88.5 90.1

Best projective parse with all factors

86.0 84.5 90.2

+hill-climbing 86.1 87.6 90.2

exact, slow

doesn’t fixenough edges

123

Time vs. Projective Search Error

…DP 140

Compared with O(n4) DP Compared with O(n5) DP

iterations

iterations

iterations

125125

Summary of MRF parsing by BP

Output probability defined as product of local and global factors Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently

Let local factors negotiate via “belief propagation” Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast

e.g., existing parsing algorithms using dynamic programming Each iteration takes total time O(n3) or even O(n2); see paper

Compare reranking or stacking

Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy

126

Outline

1. Why use joint models in NLP?


3. Open questions: Semisupervised training of joint models

127

Training with missing data is hard! Semi-supervised learning of HMMs or PCFGs: ouch!

Merialdo: Just stick with the small supervised training set Adding unsupervised data tends to hurt

A stronger model helps (McClosky et al. 2007, Cohen et al. 2009)

So maybe some hope from good models @ factors And from having lots of factors (i.e., take cues from lots of

correlated variables at once; cf. Yarowsky et al.) Naïve Bayes would be okay …

Variables with unknown values can’t hurt you. They have no influence on training or decoding.

But can’t help you, either! And indep. assumptions are flaky. So I’d like to keep discussing joint models …

128

Case #1: Missing data that you can’t impute

sentence parse

translation


parse oftranslation

Treat like multi-task learning? Shared features between 2 tasks: parse Chinese vs. parse Chinese w/ English translationOr 3 tasks: parse Chinese w/ inferred English gist vs. parse Chinese w/ English translation vs. parse English gist derived from English (supervised)

129

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Case #2: Missing data you can impute, but maybe badly


130

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Case #2: Missing data you can impute, but maybe badly


This is where simple cases of EM go wrong Could reduce to case #1 and throw away these variables Or: Damp messages from imputed variables to the extent you’re

not confident in them Requires confidence estimation. (cf. strapping) Crude versions: Confidence depends in a fixed way on time, or on

entropy of belief at that node, or on length of input sentence. But could train a confidence estimator on supervised data to

pay attention to all sorts of things! Correspondingly, scale up features for related missing-data tasks

since the damped data are “partially missing”

1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for...

Documents

Transcript of 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for...