Structured Belief Propagation for NLP

381
Structured Belief Propagation for NLP Matthew R. Gormley & Jason Eisner ACL ‘15 Tutorial July 26, 2015 1 For the latest version of these slides, please visit: http://www.cs.jhu.edu/~mrg/bp-tutorial/

description

Structured Belief Propagation for NLP. Matthew R. Gormley & Jason Eisner ACL ‘14 Tutorial June 22, 2014. For the latest version of these slides, please visit: http://www.cs.jhu.edu/~mrg/bp-tutorial/. Structured representations of utterances Structured knowledge of the language . - PowerPoint PPT Presentation

Transcript of Structured Belief Propagation for NLP

Page 1: Structured Belief Propagation for NLP

Structured Belief Propagation for NLPMatthew R. Gormley & Jason Eisner

ACL ‘15 TutorialJuly 26, 2015

1For the latest version of these slides, please visit:

http://www.cs.jhu.edu/~mrg/bp-tutorial/

Page 2: Structured Belief Propagation for NLP

2

Language has a lot going on at once

Structured representations of utterancesStructured knowledge of the language

Many interacting partsfor BP to reason about!

Page 3: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

3

Page 4: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

4

Page 5: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

5

Page 6: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

6

Page 7: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

7

Page 8: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

8

Page 9: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

9

Page 10: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

10

Page 11: Structured Belief Propagation for NLP

Section 1:Introduction

Modeling with Factor Graphs

11

Page 12: Structured Belief Propagation for NLP

Sampling from a Joint Distribution

12time likeflies an arrow

X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0X0

<START>

n v p d nSample 6:

v n v d nSample 5:

v n p d nSample 4:

n v p d nSample 3:

n n v d nSample 2:

n v p d nSample 1:

A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x.

Page 13: Structured Belief Propagation for NLP

Sampling from a Joint Distribution

13

X1

ψ1

ψ2

X2

ψ3

ψ4

X3

ψ5

ψ6

X4

ψ7

ψ8

X5

ψ9

X6

ψ10

X7

ψ12

ψ11

Sample 1:

ψ1

ψ2

ψ3

ψ4

ψ5

ψ6

ψ7

ψ8

ψ9

ψ10

ψ12

ψ11

Sample 2:

ψ1

ψ2

ψ3

ψ4

ψ5

ψ6

ψ7

ψ8

ψ9

ψ10

ψ12

ψ11

Sample 3:

ψ1

ψ2

ψ3

ψ4

ψ5

ψ6

ψ7

ψ8

ψ9

ψ10

ψ12

ψ11

Sample 4:

ψ1

ψ2

ψ3

ψ4

ψ5

ψ6

ψ7

ψ8

ψ9

ψ10

ψ12

ψ11

A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x.

Page 14: Structured Belief Propagation for NLP

n n v d nSample 2: time likeflies an arrow

Sampling from a Joint Distribution

14W1 W2 W3 W4 W5

X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0X0

<START>

n v p d nSample 1: time likeflies an arrow

p n n v vSample 4: with youtime will see

n v p n nSample 3: flies withfly their wings

A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x.

Page 15: Structured Belief Propagation for NLP

W1 W2 W3 W4 W5

X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0X0

<START>

Factors have local opinions (≥ 0)

15

Each black box looks at some of the tags Xi and words Wi

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

Note: We chose to reuse the same factors at

different positions in the sentence.

Page 16: Structured Belief Propagation for NLP

Factors have local opinions (≥ 0)

16

time flies like an arrow

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0<START>

Each black box looks at some of the tags Xi and words Wi

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

p(n, v, p, d, n, time, flies, like, an, arrow) = ?

Page 17: Structured Belief Propagation for NLP

Global probability = product of local opinions

17

time flies like an arrow

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0<START>

Each black box looks at some of the tags Xi and words Wi

p(n, v, p, d, n, time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …) v n p d

v 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

Uh-oh! The probabilities of the various assignments sum

up to Z > 1.So divide them all by Z.

Page 18: Structured Belief Propagation for NLP

Markov Random Field (MRF)

18

time flies like an arrow

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0<START>

p(n, v, p, d, n, time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …) v n p d

v 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

Joint distribution over tags Xi and words Wi

The individual factors aren’t necessarily probabilities.

Page 19: Structured Belief Propagation for NLP

time flies like an arrow

n v p d n<START>

Hidden Markov Model

19

But sometimes we choose to make them probabilities. Constrain each row of a factor to sum to one. Now Z = 1.

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

time

flies

like …

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

time

flies

like …

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

Page 20: Structured Belief Propagation for NLP

Markov Random Field (MRF)

20

time flies like an arrow

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0<START>

p(n, v, p, d, n, time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …) v n p d

v 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

time

flies

like …

v 3 5 3n 4 5 2p 0.

10.1 3

d 0.1

0.2

0.1

Joint distribution over tags Xi and words Wi

Page 21: Structured Belief Propagation for NLP

Conditional Random Field (CRF)

21time flies like an arrow

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0<START>

v 3n 4p 0.

1d 0.

1

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

v 5n 5p 0.

1d 0.

2

Conditional distribution over tags Xi given words wi.The factors and Z are now specific to the sentence w.

p(n, v, p, d, n, time, flies, like, an, arrow) = (4 * 8 * 5 * 3 * …)

Page 22: Structured Belief Propagation for NLP

How General Are Factor Graphs?

• Factor graphs can be used to describe– Markov Random Fields (undirected graphical

models)• i.e., log-linear models over a tuple of variables

– Conditional Random Fields– Bayesian Networks (directed graphical models)

• Inference treats all of these interchangeably.– Convert your model to a factor graph first.– Pearl (1988) gave key strategies for exact

inference:• Belief propagation, for inference on acyclic

graphs• Junction tree algorithm, for making any

graph acyclic(by merging variables and factors: blows up the runtime)

Page 23: Structured Belief Propagation for NLP

Object-Oriented Analogy• What is a sample?

A datum: an immutable object that describes a linguistic structure.

• What is the sample space?The class of all possible sample objects.

• What is a random variable? An accessor method of the class, e.g., one that returns a certain field.– Will give different values when called on different random

samples.

23

class Tagging:int n; // length of sentenceWord[] w; // array of n words (values wi)Tag[] t; // array of n tags (values ti)

Word W(int i) { return w[i]; } // random var Wi

Tag T(int i) { return t[i]; } // random var Ti

String S(int i) { // random var Si

return suffix(w[i], 3); }

Random variable W5 takes value w5 == “arrow” in this sample

Page 24: Structured Belief Propagation for NLP

Object-Oriented Analogy• What is a sample?

A datum: an immutable object that describes a linguistic structure.• What is the sample space?

The class of all possible sample objects.• What is a random variable?

An accessor method of the class, e.g., one that returns a certain field.

• A model is represented by a different object. What is a factor of the model?A method of the model that computes a number ≥ 0 from a sample, based on the sample’s values of a few random variables, and parameters stored in the model.

• What probability does the model assign to a sample? A product of its factors (rescaled). E.g., uprob(tagging) / Z().

• How do you find the scaling factor? Add up the probabilities of all possible samples. If the result Z != 1, divide the probabilities by that Z.

24

class TaggingModel:float transition(Tagging tagging, int i) { // tag-tag bigram return tparam[tagging.t(i-1)][tagging.t(i)]; }float emission(Tagging tagging, int i) { // tag-word bigram return eparam[tagging.t(i)][tagging.w(i)]; }

float uprob(Tagging tagging) { // unnormalized prob float p=1; for (i=1; i <= tagging.n; i++) { p *= transition(i) * emission(i); } return p; }

Page 25: Structured Belief Propagation for NLP

Modeling with Factor Graphs• Factor graphs can be used to

model many linguistic structures.

• Here we highlight a few example NLP tasks.– People have used BP for all of these.

• We’ll describe how variables and factors were used to describe structures and the interactions among their parts. 25

Page 26: Structured Belief Propagation for NLP

Annotating a Tree

26

Given: a sentence and unlabeled parse tree.

n v p d n

time likeflies an arrow

np

vp

pp

s

Page 27: Structured Belief Propagation for NLP

Annotating a Tree

27

Given: a sentence and unlabeled parse tree.

Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals.

X1

ψ1

ψ2 X2

ψ3

ψ4 X3

ψ5

ψ6 X4

ψ7

ψ8 X5

ψ9

time likeflies an arrow

X6

ψ10

X8

ψ12

X7

ψ11

X9

ψ13

Page 28: Structured Belief Propagation for NLP

Annotating a Tree

28

Given: a sentence and unlabeled parse tree.

Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals.

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

Page 29: Structured Belief Propagation for NLP

Annotating a Tree

29

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

Given: a sentence and unlabeled parse tree.

Construct a factor graph which mimics the tree structure, to predict the tags / nonterminals. We could add a linear

chain structure between tags.(This creates cycles!)

Page 30: Structured Belief Propagation for NLP

Constituency Parsing

30

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

What if we needed to predict the tree structure too?

Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent.

∅ψ10

∅ψ10

∅ψ10

∅ψ10

∅ψ10

∅ψ10

Page 31: Structured Belief Propagation for NLP

Constituency Parsing

31

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

What if we needed to predict the tree structure too?

Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent.

But nothing prevents non-tree structures.

∅ψ10

∅ψ10

sψ10

∅ψ10

∅ψ10

∅ψ10

Page 32: Structured Belief Propagation for NLP

Constituency Parsing

32

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

What if we needed to predict the tree structure too?

Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent.

But nothing prevents non-tree structures.

∅ψ10

∅ψ10

sψ10

∅ψ10

∅ψ10

∅ψ10

Page 33: Structured Belief Propagation for NLP

Constituency Parsing

33

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

What if we needed to predict the tree structure too?

Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent.

But nothing prevents non-tree structures.

∅ψ10

∅ψ10

sψ10

∅ψ10

∅ψ10

∅ψ10

Add a factor which multiplies in 1 if the variables form a tree and 0 otherwise.

Page 34: Structured Belief Propagation for NLP

Constituency Parsing

34

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

What if we needed to predict the tree structure too?

Use more variables: Predict the nonterminal of each substring, or ∅ if it’s not a constituent.

But nothing prevents non-tree structures.

∅ψ10

∅ψ10

∅ψ10

∅ψ10

∅ψ10

∅ψ10

Add a factor which multiplies in 1 if the variables form a tree and 0 otherwise.

Page 35: Structured Belief Propagation for NLP

Constituency Parsing• Variables:

– Constituent type (or ∅) for each of O(n2) substrings

• Interactions:– Constituents must

describe a binary tree– Tag bigrams– Nonterminal triples

(parent, left-child, right-child) [these factors not shown]

35

Example Task:

(Naradowsky, Vieira, & Smith, 2012)

n v p d n

time likeflies an arrow

np

vp

pp

s

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

np

ψ10

vp

ψ12

pp

ψ11

s

ψ13

ψ10

ψ10

ψ10

ψ10

ψ10

ψ10

Page 36: Structured Belief Propagation for NLP

Dependency Parsing• Variables:

– POS tag for each word

– Syntactic label (or ∅) for each of O(n2) possible directed arcs

• Interactions: – Arcs must form a

tree– Discourage (or

forbid) crossing edges

– Features on edge pairs that share a vertex

36(Smith & Eisner, 2008)

Example Task:

time likeflies an arrow

*Figure from Burkett & Klein (2012)

• Learn to discourage a verb from having 2 objects, etc.

• Learn to encourage specific multi-arc constructions

Page 37: Structured Belief Propagation for NLP

Joint CCG Parsing and Supertagging• Variables:

– Spans– Labels on non-

terminals– Supertags on pre-

terminals• Interactions:

– Spans must form a tree

– Triples of labels: parent, left-child, and right-child

– Adjacent tags37(Auli & Lopez,

2011)

Example Task:

Page 38: Structured Belief Propagation for NLP

38Figure thanks to Markus Dreyer

Example task: Transliteration or Back-

Transliteration• Variables (string):

– English and Japanese orthographic strings

– English and Japanese phonological strings

• Interactions:– All pairs of strings could be relevant

Page 39: Structured Belief Propagation for NLP

• Variables (string): – Inflected forms

of the same verb

• Interactions: – Between pairs

of entries in the table (e.g. infinitive form affects present-singular)

39(Dreyer & Eisner, 2009)

Example task:Morphological Paradigms

Page 40: Structured Belief Propagation for NLP

Word Alignment / Phrase Extraction• Variables (boolean):

– For each (Chinese phrase, English phrase) pair, are they linked?

• Interactions:– Word fertilities– Few “jumps”

(discontinuities)– Syntactic reorderings– “ITG contraint” on

alignment– Phrases are disjoint (?)

40(Burkett & Klein, 2012)

Application:

Page 41: Structured Belief Propagation for NLP

Congressional Voting

41(Stoyanov & Eisner, 2012)

Application:

• Variables:– Representative’s vote– Text of all speeches of a

representative – Local contexts of

references between two representatives

• Interactions:– Words used by

representative and their vote

– Pairs of representatives and their local context

Page 42: Structured Belief Propagation for NLP

Semantic Role Labeling with Latent Syntax

• Variables:– Semantic predicate

sense– Semantic dependency

arcs– Labels of semantic arcs– Latent syntactic

dependency arcs• Interactions:

– Pairs of syntactic and semantic dependencies

– Syntactic dependency arcs must form a tree

42 (Naradowsky, Riedel, & Smith, 2012) (Gormley, Mitchell, Van Durme, & Dredze, 2014)

Application:

time likeflies an arrow

arg0arg1

0 21 3 4The madebarista coffee<WALL>

R2,1

L2,1

R1,2

L1,2

R3,2

L3,2

R2,3

L2,3

R3,1

L3,1

R1,3

L1,3

R4,3

L4,3

R3,4

L3,4

R4,2

L4,2

R2,4

L2,4

R4,1

L4,1

R1,4

L1,4

L0,1

L0,3

L0,4

L0,2

Page 43: Structured Belief Propagation for NLP

Joint NER & Sentiment Analysis• Variables:

– Named entity spans

– Sentiment directed toward each entity

• Interactions:– Words and

entities– Entities and

sentiment43(Mitchell, Aguilar, Wilson, & Van

Durme, 2013)

Application:

loveI Mark Twain

PERSON

POSITIVE

Page 44: Structured Belief Propagation for NLP

44

Variable-centric view of the world

When we deeply understand language, what representations (type and token) does that understanding comprise?

Page 45: Structured Belief Propagation for NLP

45

lexicon (word types)semantics

sentences

discourse context

resources

entailmentcorrelation

inflectioncognatestransliterationabbreviationneologismlanguage evolution

translationalignment

editingquotation

speech misspellings,typos formatting entanglement annotation

Ntokens

To recover variables, model and exploit their correlations

Page 46: Structured Belief Propagation for NLP

Section 2:Belief Propagation Basics

46

Page 47: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

47

Page 48: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

48

Page 49: Structured Belief Propagation for NLP

Factor Graph Notation

49

• Variables:

• Factors:

Joint Distribution X1

ψ1

ψ2 X2

ψ3

X3

ψ5

X4

ψ7

time likeflies an

X8

X7

X9

ψ{1,8,9}

ψ1

ψ{1,8,9}

ψ2

ψ1

ψ{1,8,9}

ψ3

ψ2

ψ1

ψ{1,8,9}

ψ3

ψ2

ψ1

ψ{1,8,9}

ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ{3}ψ{2}

ψ{1,2}

ψ{1}

ψ{1,8,9}

ψ{2,7,8}

ψ{3,6,7}

ψ{2,3} ψ{3,4}

Page 50: Structured Belief Propagation for NLP

X1

ψ1

ψ2 X2

ψ3

X3

ψ5

X4

ψ7

time likeflies an

X8

X7

X9

ψ{1,8,9}

ψ1

ψ{1,8,9}

ψ2

ψ1

ψ{1,8,9}

ψ3

ψ2

ψ1

ψ{1,8,9}

ψ3

ψ2

ψ1

ψ{1,8,9}

ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ7ψ5ψ3

ψ2

ψ1

ψ{1,8,9}

ψ{3}ψ{2}

ψ{1,2}

ψ{1}

ψ{1,8,9}

ψ{2,7,8}

ψ{3,6,7}

ψ{2,3} ψ{3,4}

Factors are Tensors

50

• Factors:

v 3n 4p 0.

1d 0.

1

v n p dv 1 6 3 4n 8 4 2 0.

1p 1 3 1 3d 0.

1 8 0 0

s vppp …s 0 2 .3

vp 3 4 2pp .1 2 1…

s vppp …s 0 2 .3

vp 3 4 2pp .1 2 1…

s vppp …s 0 2 .3

vp 3 4 2pp .1 2 1…svppp

Page 51: Structured Belief Propagation for NLP

InferenceGiven a factor graph, two common tasks …

– Compute the most likely joint assignment, x* = argmaxx p(X=x)

– Compute the marginal distribution of variable Xi:p(Xi=xi) for each value xi

Both consider all joint assignments.Both are NP-Hard in general.

So, we turn to approximations.51

p(Xi=xi) = sum of p(X=x) over joint assignments with Xi=xi

Page 52: Structured Belief Propagation for NLP

Marginals by Sampling on Factor Graph

52time likeflies an arrow

X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0X0

<START>

n v p d nSample 6:

v n v d nSample 5:

v n p d nSample 4:

n v p d nSample 3:

n n v d nSample 2:

n v p d nSample 1:

Suppose we took many samples from the distribution over taggings:

Page 53: Structured Belief Propagation for NLP

Marginals by Sampling on Factor Graph

53time likeflies an arrow

X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0X0

<START>

n v p d nSample 6:

v n v d nSample 5:

v n p d nSample 4:

n v p d nSample 3:

n n v d nSample 2:

n v p d nSample 1:

The marginal p(Xi = xi) gives the probability that variable Xi takes value xi in a random sample

Page 54: Structured Belief Propagation for NLP

Marginals by Sampling on Factor Graph

54time likeflies an arrow

X1 ψ2 X2 ψ4 X3 ψ6 X4 ψ8 X5

ψ1 ψ3 ψ5 ψ7 ψ9

ψ0X0

<START>

n v p d nSample 6:

v n v d nSample 5:

v n p d nSample 4:

n v p d nSample 3:

n n v d nSample 2:

n v p d nSample 1:

Estimate the marginals as:

n 4/6v 2/6

n 3/6v 3/6

p 4/6v 2/6 d 6/6 n 6/6

Page 55: Structured Belief Propagation for NLP

• Sampling one joint assignment is also NP-hard in general.– In practice: Use MCMC (e.g., Gibbs sampling) as an anytime algorithm.– So draw an approximate sample fast, or run longer for a “good” sample.

• Sampling finds the high-probability values xi efficiently.But it takes too many samples to see the low-probability ones.– How do you find p(“The quick brown fox …”) under a language model?

• Draw random sentences to see how often you get it? Takes a long time.• Or multiply factors (trigram probabilities)? That’s what BP would do.

55

How do we get marginals without sampling?

That’s what Belief Propagation is all about!Why not just sample?

Page 56: Structured Belief Propagation for NLP

Great Ideas in ML: Message Passing

3 behind you

2 behind you

1 behind you

4 behind you

5 behind you

1 beforeyou

2 beforeyou

there's1 of me

3 beforeyou

4 beforeyou

5 beforeyou

Count the soldiers

56adapted from MacKay (2003) textbook

Page 57: Structured Belief Propagation for NLP

Great Ideas in ML: Message Passing

3 behind you

2 beforeyou

there's1 of me

Belief:Must be2 + 1 + 3 = 6 of us

only seemy incomingmessages

2 31

Count the soldiers

57adapted from MacKay (2003) textbook

2 beforeyou

Page 58: Structured Belief Propagation for NLP

Great Ideas in ML: Message Passing

4 behind you

1 beforeyou

there's1 of me

only seemy incomingmessages

Count the soldiers

58adapted from MacKay (2003) textbook

Belief:Must be2 + 1 + 3 = 6 of us2 31

Belief:Must be1 + 1 + 4 = 6 of us1 41

Page 59: Structured Belief Propagation for NLP

Great Ideas in ML: Message Passing

7 here

3 here

11 here(= 7+3+1)

1 of me

Each soldier receives reports from all branches of tree

59adapted from MacKay (2003) textbook

Page 60: Structured Belief Propagation for NLP

Great Ideas in ML: Message Passing

3 here

3 here

7 here(= 3+3+1)

Each soldier receives reports from all branches of tree

60adapted from MacKay (2003) textbook

Page 61: Structured Belief Propagation for NLP

Great Ideas in ML: Message Passing

7 here

3 here

11 here(= 7+3+1)

Each soldier receives reports from all branches of tree

61adapted from MacKay (2003) textbook

Page 62: Structured Belief Propagation for NLP

Great Ideas in ML: Message Passing

7 here

3 here

3 here

Belief:Must be14 of us

Each soldier receives reports from all branches of tree

62adapted from MacKay (2003) textbook

Page 63: Structured Belief Propagation for NLP

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

7 here

3 here

3 here

Belief:Must be14 of us

wouldn't work correctlywith a 'loopy' (cyclic) graph

63adapted from MacKay (2003) textbook

Page 64: Structured Belief Propagation for NLP

Message Passing in Belief Propagation

64

X

Ψ…

My other factors think I’m a noun

But my other variables and I think you’re a

verb

v 1n 6a 3

v 6n 1a 3

v 6n 6a 9

Both of these messages judge the possible values of variable X.Their product = belief at X = product of all 3 messages to X.

Page 65: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

65

Belie

fsM

essa

ges

Variables Factors

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

X1

ψ2

ψ3ψ1

X2

ψ1X1 X3

Page 66: Structured Belief Propagation for NLP

X1

ψ2

ψ3ψ1

v 0.1n 3p 1

Sum-Product Belief Propagation

66

v 1n 2p 2

v 4n 1p 0

v .4n 6p 0

Variable Belief

Page 67: Structured Belief Propagation for NLP

X1

ψ2

ψ3ψ1

Sum-Product Belief Propagation

67

v 0.1n 6p 2

Variable Message

v 0.1n 3p 1

v 1n 2p 2

Page 68: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

68

Factor Belief

ψ1X1 X3

v 8n 0.2

p 4d 1n 0

v np 0.1 8d 3 0n 1 1

v np 3.26.4d 24 0n 0 0

Page 69: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

69

Factor Belief

ψ1X1 X3

v np 3.26.4d 24 0n 0 0

Page 70: Structured Belief Propagation for NLP

70

Sum-Product Belief Propagation

ψ1X1 X3

v 8n 0.2

v np 0.1 8d 3 0n 1 1

p 0.8 + 0.16

d 24 + 0n 8 + 0.2

Factor Message

Page 71: Structured Belief Propagation for NLP

Sum-Product Belief PropagationFactor Message

71

ψ1X1 X3

matrix-vector product

(for a binary factor)

Page 72: Structured Belief Propagation for NLP

Input: a factor graph with no cyclesOutput: exact marginals for each variable and factor

Algorithm:1. Initialize the messages to the uniform distribution.

2. Choose a root node.3. Send messages from the leaves to the root.

Send messages from the root to the leaves.

4. Compute the beliefs (unnormalized marginals).

5. Normalize beliefs and return the exact marginals.

Sum-Product Belief Propagation

72

Page 73: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

73

Belie

fsM

essa

ges

Variables Factors

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

X1

ψ2

ψ3ψ1

X2

ψ1X1 X3

Page 74: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

74

Belie

fsM

essa

ges

Variables Factors

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

X1

ψ2

ψ3ψ1

X2

ψ1X1 X3

Page 75: Structured Belief Propagation for NLP

X2 X3X1

CRF Tagging Model

75

find preferred tagsCould be adjective or verbCould be noun or verbCould be verb or noun

Page 76: Structured Belief Propagation for NLP

76

……

find preferred tags

CRF Tagging by Belief Propagation

v 0.3n 0a 0.1

v 1.8

n 0a 4.

2α βα

belief

message message

v 2n 1a 7

• Forward-backward is a message passing algorithm.• It’s the simplest case of belief propagation.

v 7n 2a 1

v 3n 1a 6

βv n a

v 0 2 1n 2 1 0a 0 3 1

v 3n 6a 1

v n av 0 2 1n 2 1 0a 0 3 1

Forward algorithm =message passing(matrix-vector products)

Backward algorithm =message passing(matrix-vector products)

Page 77: Structured Belief Propagation for NLP

X2 X3X1

77

find preferred tagsCould be adjective or verbCould be noun or verbCould be verb or noun

So Let’s Review Forward-Backward …

Page 78: Structured Belief Propagation for NLP

X2 X3X1

So Let’s Review Forward-Backward …

78

v

n

a

v

n

a

v

n

a

START END

• Show the possible values for each variablefind preferred tags

Page 79: Structured Belief Propagation for NLP

X2 X3X1

79

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable

• One possible assignment

find preferred tags

So Let’s Review Forward-Backward …

Page 80: Structured Belief Propagation for NLP

X2 X3X1

80

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

find preferred tags

So Let’s Review Forward-Backward …

Page 81: Structured Belief Propagation for NLP

X2 X3X1

Viterbi Algorithm: Most Probable Assignment

81

v

n

a

v

n

a

v

n

a

START END

find preferred tags• So p(v a n) = (1/Z) * product of 7 numbers• Numbers associated with edges and nodes of

path• Most probable assignment = path with

highest product

ψ {0,1}(START,v)

ψ{1,2} (v,a)

ψ {2,3}(a,n)

ψ{3,4}(a,END)ψ{1}(v)

ψ{2}(a)

ψ{3}(n)

Page 82: Structured Belief Propagation for NLP

X2 X3X1

Viterbi Algorithm: Most Probable Assignment

82

v

n

a

v

n

a

v

n

a

START END

find preferred tags• So p(v a n) = (1/Z) * product weight of one path

ψ {0,1}(START,v)

ψ{1,2} (v,a)

ψ {2,3}(a,n)

ψ{3,4}(a,END)ψ{1}(v)

ψ{2}(a)

ψ{3}(n)

Page 83: Structured Belief Propagation for NLP

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

83

v

n

a

v

n

a

v

n

a

START END

find preferred tags• So p(v a n) = (1/Z) * product weight of one

path• Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

a

Page 84: Structured Belief Propagation for NLP

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

84

v

n

a

v

n

a

v

n

a

START END

find preferred tags• So p(v a n) = (1/Z) * product weight of one

path• Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

n

Page 85: Structured Belief Propagation for NLP

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

85

v

n

a

v

n

a

v

n

a

START END

find preferred tags• So p(v a n) = (1/Z) * product weight of one

path• Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

v

Page 86: Structured Belief Propagation for NLP

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

86

v

n

a

v

n

a

v

n

a

START END

find preferred tags• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(X2 = a)

= (1/Z) * total weight of all paths through

n

Page 87: Structured Belief Propagation for NLP

α2(n) = total weight of these path prefixes

(found by dynamic programming: matrix-vector products)

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

87

v

n

a

v

n

a

v

n

a

START END

find preferred tags

Page 88: Structured Belief Propagation for NLP

= total weight of these path suffixes

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

88

v

n

a

v

n

a

v

n

a

START END

find preferred tags2(n)

(found by dynamic programming: matrix-vector products)

Page 89: Structured Belief Propagation for NLP

α2(n) = total weight of these path prefixes

= total weight of these path suffixes

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

89

v

n

a

v

n

a

v

n

a

START END

find preferred tags2(n)(a + b + c) (x + y + z)

Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths

Page 90: Structured Belief Propagation for NLP

total weight of all paths through=

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

90

v

n

a

v

n

a

v

n

a

START END

find preferred tagsn

ψ{2}(n)

α2(n) 2(n)

α2(n) ψ{2}(n) 2(n)

“belief that X2 = n”

Oops! The weight of a path through a state

also includes a weight at that state.

So α(n)∙β(n) isn’t enough.

The extra weight is the opinion of the unigram factor at this variable.

Page 91: Structured Belief Propagation for NLP

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

91

v

n

a

n

a

v

n

a

START END

find preferred tags

ψ{2}(v)

α2(v) 2(v)

“belief that X2 = v”v

“belief that X2 = n”

total weight of all paths through=

vα2(v) ψ{2}(v) 2(v)

Page 92: Structured Belief Propagation for NLP

X2 X3X1

Forward-Backward Algorithm: Finds Marginals

92

v

n

a

v

n

a

v

n

a

START END

find preferred tags

ψ{2}(a)

α2(a) 2(a)“belief that X2 = a”

“belief that X2 = v”

“belief that X2 = n”

sum = Z(total probabilityof all paths)

v 1.8

n 0a 4.

2v 0.3n 0a 0.7

divide by Z=6 to get marginal probs

total weight of all paths through=

aα2(a) ψ{2}(a) 2(a)

Page 93: Structured Belief Propagation for NLP

(Acyclic) Belief Propagation

93

In a factor graph with no cycles: 1. Pick any node to serve as the root.2.Send messages from the leaves to the root.3.Send messages from the root to the leaves.A node computes an outgoing message along an edge

only after it has received incoming messages along all its other edges.

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

time likeflies an arrow

X6

ψ10

X8

ψ12

X7

ψ11

X9

ψ13

Page 94: Structured Belief Propagation for NLP

(Acyclic) Belief Propagation

94

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

time likeflies an arrow

X6

ψ10

X8

ψ12

X7

ψ11

X9

ψ13

In a factor graph with no cycles: 1. Pick any node to serve as the root.2.Send messages from the leaves to the root.3.Send messages from the root to the leaves.A node computes an outgoing message along an edge

only after it has received incoming messages along all its other edges.

Page 95: Structured Belief Propagation for NLP

Acyclic BP as Dynamic Programming

95

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

time likeflies an arrow

X6

ψ10

ψ12

Xi

ψ14

X9

ψ13

ψ11

F

G H

Figure adapted from Burkett & Klein (2012)

Subproblem:Inference using just the factors in subgraph H

Page 96: Structured Belief Propagation for NLP

Acyclic BP as Dynamic Programming

96

X1 X2 X3 X4

ψ7

X5

ψ9

time likeflies an arrow

X6

ψ10

Xi

X9

ψ11

H

Subproblem:Inference using just the factors in subgraph H

The marginal of Xi in

that smaller model is the message sent to Xi from subgraph HMessage toa variable

Page 97: Structured Belief Propagation for NLP

Acyclic BP as Dynamic Programming

97

X1 X2 X3

ψ5

X4 X5

time likeflies an arrow

X6

Xi

ψ14

X9

G

Subproblem:Inference using just the factors in subgraph H

The marginal of Xi in

that smaller model is the message sent to Xi from subgraph HMessage toa variable

Page 98: Structured Belief Propagation for NLP

Acyclic BP as Dynamic Programming

98

X1

ψ1

X2

ψ3

X3 X4 X5

time likeflies an arrow

X6

X 8

ψ12

Xi

X9

ψ13

F

Subproblem:Inference using just the factors in subgraph H

The marginal of Xi in

that smaller model is the message sent to Xi from subgraph HMessage toa variable

Page 99: Structured Belief Propagation for NLP

Acyclic BP as Dynamic Programming

99

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

time likeflies an arrow

X6

ψ10

ψ12

Xi

ψ14

X9

ψ13

ψ11

F

H

Subproblem:Inference using just the factors in subgraph FH

The marginal of Xi in

that smaller model is the message sent by Xi out of subgraph FHMessage froma variable

Page 100: Structured Belief Propagation for NLP

• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs.

• Each subgraph is obtained by cutting some edge of the tree.• The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

100time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Page 101: Structured Belief Propagation for NLP

• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs.

• Each subgraph is obtained by cutting some edge of the tree.• The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

101time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Page 102: Structured Belief Propagation for NLP

• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs.

• Each subgraph is obtained by cutting some edge of the tree.• The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

102time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Page 103: Structured Belief Propagation for NLP

• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs.

• Each subgraph is obtained by cutting some edge of the tree.• The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

103time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Page 104: Structured Belief Propagation for NLP

• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs.

• Each subgraph is obtained by cutting some edge of the tree.• The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

104time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Page 105: Structured Belief Propagation for NLP

• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs.

• Each subgraph is obtained by cutting some edge of the tree.• The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

105time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Page 106: Structured Belief Propagation for NLP

• If you want the marginal pi(xi) where Xi has degree k, you can think of that summation as a product of k marginals computed on smaller subgraphs.

• Each subgraph is obtained by cutting some edge of the tree.• The message-passing algorithm uses dynamic programming to compute

the marginals on all such subgraphs, working from smaller to bigger. So you can compute all the marginals.

Acyclic BP as Dynamic Programming

106time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Page 107: Structured Belief Propagation for NLP

Loopy Belief Propagation

107time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

• Messages from different subgraphs are no longer independent!– Dynamic

programming can’t help. It’s now #P-hard in general to compute the exact marginals.

• But we can still run BP -- it's a local algorithm so it doesn't "see the cycles."

ψ2 ψ4 ψ6 ψ8

What if our graph has cycles?

Page 108: Structured Belief Propagation for NLP

What can go wrong with loopy BP?

108

FAll 4 factors

on cycle enforce equality

F

F

F

Page 109: Structured Belief Propagation for NLP

What can go wrong with loopy BP?

109

TAll 4 factors

on cycle enforce equality

T

T

T

This factor saysupper variable

is twice as likelyto be true as false

(and that’s the truemarginal!)

Page 110: Structured Belief Propagation for NLP

This is an extreme example. Often in practice, the cyclic influences are weak. (As cycles are long or include at least one weak correlation.)

• BP incorrectly treats this message as separate evidence that the variable is T.

• Multiplies these two messages as if they were independent.

• But they don’t actually come from independent parts of the graph.

• One influenced the other (via a cycle).

What can go wrong with loopy BP?

110

T 2F 1

All 4 factorson cycle enforce equality

T 2F 1

T 2F 1 T 2

F 1

T 2F 1

This factor saysupper variable

is twice as likelyto be T as F

(and that’s the truemarginal!)

T 4F 1

T 4F 1

• Messages loop around and around …

• 2, 4, 8, 16, 32, ... More and more convinced that these variables are T!

• So beliefs converge to marginal distribution (1, 0) rather than (2/3, 1/3).

Page 111: Structured Belief Propagation for NLP

Kathy

says so

Bob says so Charlie

says so

Alice says so

Your prior doesn’t think Obama owns it.But everyone’s saying he does. Under a Naïve Bayes model, you therefore believe it.

What can go wrong with loopy BP?

111

Obama owns it

T 1F 9

9T 2F 1

T 2F 1

T 2F 1

T 2F 1

T 2048

F 99

A lie told often enough becomes truth.-- Lenin

A rumor is circulating that Obama secretly owns an insurance company. (Obamacare is actually designed to maximize his profit.)

Page 112: Structured Belief Propagation for NLP

Kathy

says so

Bob says so Charlie

says so

Alice says so

Better model ... Rush can influence conversation.

– Now there are 2 ways to explain why everyone’s repeating the story: it’s true, or Rush said it was.

– The model favors one solution (probably Rush).

– Yet BP has 2 stable solutions. Each solution is self-reinforcing around cycles; no impetus to switch.

What can go wrong with loopy BP?

112

Obama owns it

T 1F 9

9

T ???F ???

Rush

says so

A lie told often enough becomes truth.-- Lenin

If everyone blames Obama, then no one has to blame Rush.

But if no one blames Rush, then everyone has to continue to blame Obama (to explain the gossip).

T 1F 2

4

Actually 4 ways: but “both” has a low prior and “neither” has a low likelihood, so only 2 good ways.

Page 113: Structured Belief Propagation for NLP

Loopy Belief Propagation Algorithm

• Run the BP update equations on a cyclic graph– Hope it “works” anyway (good approximation)

• Though we multiply messages that aren’t independent

• No interpretation as dynamic programming– If largest element of a message gets very big

or small,• Divide the message by a constant to prevent

over/underflow• Can update messages in any order

– Stop when the normalized messages converge• Compute beliefs from final messages

– Return normalized beliefs as approximate marginals

113e.g., Murphy, Weiss & Jordan (1999)

Page 114: Structured Belief Propagation for NLP

Input: a factor graph with cyclesOutput: approximate marginals for each variable and factor

Algorithm:1. Initialize the messages to the uniform distribution.

2. Send messages until convergence.Normalize them when they grow too large.

3. Compute the beliefs (unnormalized marginals).

4. Normalize beliefs and return the approximate marginals.

Loopy Belief Propagation

114

Page 115: Structured Belief Propagation for NLP

Section 2 Appendix

Tensor Notation for BP

115

Page 116: Structured Belief Propagation for NLP

Tensor Notation for BPIn section 2, BP was introduced with a notation which defined messages and beliefs as functions.

This Appendix includes an alternate (and very concise) notation for the Belief Propagation algorithm using tensors.

Page 117: Structured Belief Propagation for NLP

Tensor Notation• Tensor multiplication:

• Tensor marginalization:

117

Page 118: Structured Belief Propagation for NLP

Tensor Notation

118

A real function with r keyword arguments

Axis-labeled array with arbitrary indices

Database with column headers

A rank-r tensor is…

= =

X1 32 5

X Y value

1 red 122 red 201 blue 182 blue 30

Y value

red 4blue 6

Tensor multiplication: (vector outer product)

Y

red blue

X1 12 182 20 30

Yred 4blue

6X valu

e1 32 5

Page 119: Structured Belief Propagation for NLP

Tensor Notation

119

A real function with r keyword arguments

Axis-labeled array with arbitrary indices

Database with column headers

A rank-r tensor is…

= =

Xa 3b 5

X value

a 4b 6

Tensor multiplication: (vector pointwise product)

X value

a 3b 5

Xa 4b 6

X

a 12

b 30

X value

a 12b 30

Page 120: Structured Belief Propagation for NLP

Tensor Notation

120

A real function with r keyword arguments

Axis-labeled array with arbitrary indices

Database with column headers

A rank-r tensor is…

= =

Tensor multiplication: (matrix-vector product)

X value

1 72 8X

1 72 8

Y

red

blue

X1 3 42 5 6

Y

red

blue

X

1 21

28

2 40

48

X Y value

1 red 32 red 51 blu

e4

2 blue

6

X Y value

1 red 212 red 401 blu

e28

2 blue

48

Page 121: Structured Belief Propagation for NLP

Tensor Notation

121

A real function with r keyword arguments

Axis-labeled array with arbitrary indices

Database with column headers

A rank-r tensor is…

= =

Yred blu

e

X1 3 42 5 6

Yred blu

e8 10

X Y value

1 red 32 red 51 blue 42 blue 6

Y value

red 8blue 10

Tensor marginalization:

Page 122: Structured Belief Propagation for NLP

Input: a factor graph with no cyclesOutput: exact marginals for each variable and factor

Algorithm:1. Initialize the messages to the uniform distribution.

2. Choose a root node.3. Send messages from the leaves to the root.

Send messages from the root to the leaves.

4. Compute the beliefs (unnormalized marginals).

5. Normalize beliefs and return the exact marginals.

Sum-Product Belief Propagation

122

Page 123: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

123

Belie

fsM

essa

ges

Variables Factors

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

X1

ψ2

ψ3ψ1

X2

ψ1X1 X3

Page 124: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

124

Belie

fsM

essa

ges

Variables Factors

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

X1

ψ2

ψ3ψ1

X2

ψ1X1 X3

Page 125: Structured Belief Propagation for NLP

X1

ψ2

ψ3ψ1

v 0.1n 3p 1

Sum-Product Belief Propagation

125

v 1n 2p 2

v 4n 1p 0

v .4n 6p 0

Variable Belief

Page 126: Structured Belief Propagation for NLP

X1

ψ2

ψ3ψ1

v 0.1n 3p 1

Sum-Product Belief Propagation

126

v 1n 2p 2

v 0.1n 6p 2

Variable Message

Page 127: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

127

Factor Belief

ψ1X1 X3

v 8n 0.2

p 4d 1n 0

v np 0.1 8d 3 0n 1 1

v np 3.26.4d 18 0n 0 0

Page 128: Structured Belief Propagation for NLP

Sum-Product Belief Propagation

128

Factor Message

ψ1X1 X3

v 8n 0.2

v np 0.1 8d 3 0n 1 1

p 0.8 + 0.16

d 24 + 0n 8 + 0.2

Page 129: Structured Belief Propagation for NLP

Input: a factor graph with cyclesOutput: approximate marginals for each variable and factor

Algorithm:1. Initialize the messages to the uniform distribution.

2. Send messages until convergence.Normalize them when they grow too large.

3. Compute the beliefs (unnormalized marginals).

4. Normalize beliefs and return the approximate marginals.

Loopy Belief Propagation

129

Page 130: Structured Belief Propagation for NLP

130

Section 3:Belief Propagation Q&A

Methods like BP and in what sense they work

Page 131: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

131

Page 132: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

132

Page 133: Structured Belief Propagation for NLP

133

Q&AQ: Forward-backward is to the Viterbi algorithm

as sum-product BP is to __________ ?

A:max-product BP

Page 134: Structured Belief Propagation for NLP

Max-product Belief Propagation• Sum-product BP can be used to

compute the marginals, pi(Xi)

• Max-product BP can be used to compute the most likely assignment, X* = argmaxX p(X)

134

Page 135: Structured Belief Propagation for NLP

Max-product Belief Propagation• Change the sum to a max:

• Max-product BP computes max-marginals – The max-marginal bi(xi) is the (unnormalized)

probability of the MAP assignment under the constraint Xi = xi.

– For an acyclic graph, the MAP assignment (assuming there are no ties) is given by:

135

Page 136: Structured Belief Propagation for NLP

Max-product Belief Propagation• Change the sum to a max:

• Max-product BP computes max-marginals – The max-marginal bi(xi) is the (unnormalized)

probability of the MAP assignment under the constraint Xi = xi.

– For an acyclic graph, the MAP assignment (assuming there are no ties) is given by:

136

Page 137: Structured Belief Propagation for NLP

Deterministic AnnealingMotivation: Smoothly transition from sum-product to max-product

1. Incorporate inverse temperature parameter into each factor:

2. Send messages as usual for sum-product BP3. Anneal T from 1 to 0:

4. Take resulting beliefs to power T137

Annealed Joint Distribution

T = 1 Sum-productT 0 Max-product

Page 138: Structured Belief Propagation for NLP

138

Q&AQ: This feels like Arc Consistency…

Any relation?A:Yes, BP is doing (with probabilities) what

people were doing in AI long before.

Page 139: Structured Belief Propagation for NLP

139

=

From Arc Consistency to BPGoal: Find a satisfying assignmentAlgorithm: Arc Consistency

1. Pick a constraint2. Reduce domains to satisfy the

constraint3. Repeat until convergence

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Propagation completely solved the problem!

Note: These steps can occur in

somewhat arbitrary order

Slide thanks to Rina Dechter (modified)

Page 140: Structured Belief Propagation for NLP

140

=

From Arc Consistency to BPGoal: Find a satisfying assignmentAlgorithm: Arc Consistency

1. Pick a constraint2. Reduce domains to satisfy the

constraint3. Repeat until convergence

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Propagation completely solved the problem!

Note: These steps can occur in

somewhat arbitrary order

Slide thanks to Rina Dechter (modified)

Arc Consistency is a special case of Belief Propagation.

Page 141: Structured Belief Propagation for NLP

From Arc Consistency to BPSolve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

141

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

0 1 10 0 10 0 0

0 1 10 0 10 0 0

0 0 01 0 01 1 0

1 0 00 1 00 0 1

Page 142: Structured Belief Propagation for NLP

From Arc Consistency to BPSolve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

142

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

210

1 1 10 1 10 0 10 0 0

Page 143: Structured Belief Propagation for NLP

From Arc Consistency to BP

143

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

210

1 1 10 1 10 0 10 0 0

Solve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

Page 144: Structured Belief Propagation for NLP

From Arc Consistency to BP

144

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

210

1 1 10 1 10 0 10 0 0

Solve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

Page 145: Structured Belief Propagation for NLP

From Arc Consistency to BP

145

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

210

1 1 10 1 10 0 10 0 0

210

1 0 0

0 0 01 0 01 1 0

Solve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

Page 146: Structured Belief Propagation for NLP

From Arc Consistency to BP

146

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

210

1 1 10 1 10 0 10 0 0

210

1 0 0

0 0 01 0 01 1 0

Solve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

Page 147: Structured Belief Propagation for NLP

From Arc Consistency to BP

147

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

210

1 1 10 1 10 0 10 0 0

210

1 0 0

0 0 01 0 01 1 0

Solve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

Page 148: Structured Belief Propagation for NLP

From Arc Consistency to BP

148

32,1,

32,1, 32,1,

X, Y, U, T {∈ 1, 2, 3}X YY = UT UX < T

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

Solve the same problem with BP• Constraints become “hard”

factors with only 1’s or 0’s• Send messages until

convergence

Loopy BP will converge to the equivalent solution!

Page 149: Structured Belief Propagation for NLP

From Arc Consistency to BP

149

32,1,

32,1, 32,1,

X Y

T U

32,1,

Slide thanks to Rina Dechter (modified)

=

Loopy BP will converge to the equivalent solution!

Takeaways:• Arc Consistency is

a special case of Belief Propagation.

• Arc Consistency will only rule out impossible values.

• BP rules out those same values (belief = 0).

Page 150: Structured Belief Propagation for NLP

150

Q&AQ: Is BP totally divorced from

sampling?A:Gibbs Sampling is also a kind of

message passing algorithm.

Page 151: Structured Belief Propagation for NLP

151

From Gibbs Sampling to Particle BP to BP

Message Representation:

A. Belief Propagation: full distribution

B. Gibbs sampling: single particle

C. Particle BP:multiple particles

# of particl

es

Gibbs Sampling 1

Particle BP k

BP +∞

Page 152: Structured Belief Propagation for NLP

152

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to typeman too tightmeant two taipeimean to type

Page 153: Structured Belief Propagation for NLP

153

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to typeman too tightmeant two taipeimean to type

Approach 1: Gibbs Sampling• For each variable, resample the value by conditioning on all the other variables

– Called the “full conditional” distribution– Computationally easy because we really only need to condition on the Markov Blanket

• We can view the computation of the full conditional in terms of message passing– Message puts all its probability mass on the current particle (i.e. current value)

Page 154: Structured Belief Propagation for NLP

154

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to typeman too tightmeant two taipeimean to type

Approach 1: Gibbs Sampling• For each variable, resample the value by conditioning on all the other variables

– Called the “full conditional” distribution– Computationally easy because we really only need to condition on the Markov Blanket

• We can view the computation of the full conditional in terms of message passing– Message puts all its probability mass on the current particle (i.e. current value)

mean 1 type 1

abacus

… to too two …

zythum

abacus 0.1 0.2 0.1 0.1 0.1…

man 0.1 2 4 0.1 0.1mean 0.1 7 1 2 0.1meant 0.2 8 1 3 0.1

…zythum0.1 0.1 0.2 0.2 0.1

abacus

… type

tight

taipei …

zythum

abacus 0.1 0.1 0.2 0.1 0.1…to 0.2 8 3 2 0.1too 0.1 7 6 1 0.1two 0.2 0.1 3 1 0.1…

zythum0.1 0.2 0.2 0.1 0.1

Page 155: Structured Belief Propagation for NLP

155

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to typeman too tightmeant two taipeimean to type

Approach 1: Gibbs Sampling• For each variable, resample the value by conditioning on all the other variables

– Called the “full conditional” distribution– Computationally easy because we really only need to condition on the Markov Blanket

• We can view the computation of the full conditional in terms of message passing– Message puts all its probability mass on the current particle (i.e. current value)

mean 1 type 1

abacus

… to too two …

zythum

abacus 0.1 0.2 0.1 0.1 0.1…

man 0.1 2 4 0.1 0.1mean 0.1 7 1 2 0.1meant 0.2 8 1 3 0.1

…zythum0.1 0.1 0.2 0.2 0.1

abacus

… type

tight

taipei …

zythum

abacus 0.1 0.1 0.2 0.1 0.1…to 0.2 8 3 2 0.1too 0.1 7 6 1 0.1two 0.2 0.1 3 1 0.1…

zythum0.1 0.2 0.2 0.1 0.1

Page 156: Structured Belief Propagation for NLP

156

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to type

man too tightmeant

two taipeimean

to type

too

tight

to

to

Page 157: Structured Belief Propagation for NLP

157

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to type

man too tightmeant

two taipeimean

to type

too

tight

to

to

Approach 2: Multiple Gibbs Samplers• Run each Gibbs Sampler independently• Full conditionals computed independently

– k separate messages that are each a pointmass distribution

mean 1 taipei 1

meant 1 type 1

Page 158: Structured Belief Propagation for NLP

158

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to type

man too tightmeant

two taipeimean

to type

too

tight

to

to

Approach 3: Gibbs Sampling w/Averaging• Keep k samples for each variable• Resample from the average of the full conditionals

for each possible pair of variables– Message is a uniform distribution over current particles

Page 159: Structured Belief Propagation for NLP

159

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to type

man too tightmeant

two taipeimean

to type

too

tight

to

to

Approach 3: Gibbs Sampling w/Averaging• Keep k samples for each variable• Resample from the average of the full conditionals

for each possible pair of variables– Message is a uniform distribution over current particles

mean 1meant 1

taipei 1type 1

abacus

… to too two …

zythum

abacus 0.1 0.2 0.1 0.1 0.1…

man 0.1 2 4 0.1 0.1mean 0.1 7 1 2 0.1meant 0.2 8 1 3 0.1

…zythum0.1 0.1 0.2 0.2 0.1

abacus

… type

tight

taipei …

zythum

abacus 0.1 0.1 0.2 0.1 0.1…to 0.2 8 3 2 0.1too 0.1 7 6 1 0.1two 0.2 0.1 3 1 0.1…

zythum0.1 0.2 0.2 0.1 0.1

Page 160: Structured Belief Propagation for NLP

160

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant to type

man too tightmeant

two taipeimean

to type

too

tight

to

to

Approach 3: Gibbs Sampling w/Averaging• Keep k samples for each variable• Resample from the average of the full conditionals

for each possible pair of variables– Message is a uniform distribution over current particles

mean 1meant 1

taipei 1type 1

abacus

… to too two …

zythum

abacus 0.1 0.2 0.1 0.1 0.1…

man 0.1 2 4 0.1 0.1mean 0.1 7 1 2 0.1meant 0.2 8 1 3 0.1

…zythum0.1 0.1 0.2 0.2 0.1

abacus

… type

tight

taipei …

zythum

abacus 0.1 0.1 0.2 0.1 0.1…to 0.2 8 3 2 0.1too 0.1 7 6 1 0.1two 0.2 0.1 3 1 0.1…

zythum0.1 0.2 0.2 0.1 0.1

Page 161: Structured Belief Propagation for NLP

161

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …meant type

man tightmeant

taipeimean

type

Approach 4: Particle BP• Similar in spirit to Gibbs Sampling w/Averaging• Messages are a weighted distribution over k

particles

mean 3meant 4

taipei 2type 6

(Ihler & McAllester, 2009)

Page 162: Structured Belief Propagation for NLP

162

From Gibbs Sampling to Particle BP to BP

W ψ2 X ψ3 Y… …

Approach 5: BP• In Particle BP, as the number of particles goes to +∞, the estimated

messages approach the true BP messages • Belief propagation represents messages as the full distribution

– This assumes we can store the whole distribution compactly

abacus 0.1… …

man 3mean 4meant 5

… …zythum 0.1

abacus 0.1… …

type 2tight 2taipei 1

… …zythum 0.1

(Ihler & McAllester, 2009)

Page 163: Structured Belief Propagation for NLP

163

From Gibbs Sampling to Particle BP to BP

Message Representation:

A. Belief Propagation: full distribution

B. Gibbs sampling: single particle

C. Particle BP:multiple particles

# of particl

es

Gibbs Sampling 1

Particle BP k

BP +∞

Page 164: Structured Belief Propagation for NLP

164

From Gibbs Sampling to Particle BP to BP

Sampling values or combinations of values:• quickly get a good

estimate of the frequent cases

• may take a long time to estimate probabilities of infrequent cases

• may take a long time to draw a sample (mixing time)

• exact if you run forever

Enumerating each value and computing its probability exactly:• have to spend time on all values• but only spend O(1) time on each

value (don’t sample frequent values over and over while waiting for infrequent ones)

• runtime is more predictable• lets you tradeoff exactness for

greater speed (brute force exactly enumerates exponentially many assignments, BP approximates this by enumerating local configurations)

Tension between approaches…

Page 165: Structured Belief Propagation for NLP

165

Background: Convergence

When BP is run on a tree-shaped factor graph, the beliefs converge to the marginals of the distribution after two passes.

Page 166: Structured Belief Propagation for NLP

166

Q&AQ: How long does loopy BP take to

converge?A:It might never converge. Could

oscillate.

ψ1

ψ2

ψ1

ψ2

ψ2 ψ2

Page 167: Structured Belief Propagation for NLP

167

Q&AQ: When loopy BP converges, does it

always get the same answer?A:No. Sensitive to initialization and

update order.

ψ1

ψ2

ψ1

ψ2

ψ2 ψ2ψ1

ψ2

ψ1

ψ2

ψ2 ψ2

Page 168: Structured Belief Propagation for NLP

168

Q&AQ: Are there convergent variants of

loopy BP?A:Yes. It's actually trying to minimize a certain

differentiable function of the beliefs, so you could just minimize that function directly.

Page 169: Structured Belief Propagation for NLP

169

Q&AQ: But does that function have a

unique minimum?A:No, and you'll only be able to find a local

minimum in practice. So you're still dependent on initialization.

Page 170: Structured Belief Propagation for NLP

170

Q&AQ: If you could find the global

minimum, would its beliefs give the marginals of the true distribution?

A:No.

We’ve found the

bottom!!

Page 171: Structured Belief Propagation for NLP

171

Q&AQ: Is it finding the

marginals of some other distribution (as mean field would)?

A:No, just a collection of beliefs. Might not be globally consistent in the sense of all being views of the same elephant.

*Cartoon by G. Renee Guzlas

Page 172: Structured Belief Propagation for NLP

172

Q&AQ: Does the global minimum give beliefs that

are at least locally consistent?

A:Yes.

X1 ψα

X2

p 5d 2n 10

v np 2 3d 1 1n 4 6

v n7 10

p 5d 2n 10

v n7 10

A variable belief and a factor belief are locally consistent if the marginal of the factor’s belief equals the variable’s belief.

Page 173: Structured Belief Propagation for NLP

173

Q&AQ: In what sense are the beliefs at the

global minimum any good?A:They are the global minimum of the

Bethe Free Energy.We’ve found the

bottom!!

Page 174: Structured Belief Propagation for NLP

174

Q&AQ: When loopy BP converges, in what

sense are the beliefs any good?A:They are a local minimum of the

Bethe Free Energy.

Page 175: Structured Belief Propagation for NLP

175

Q&AQ: Why would you want to minimize

the Bethe Free Energy?A:1) It’s easy to minimize* because it’s a sum of

functions on the individual beliefs.

[*] Though we can’t just minimize each function separately – we need message passing to keep the beliefs locally consistent.

2) On an acyclic factor graph, it measures KL divergence between beliefs and true marginals, and so is minimized when beliefs = marginals. (For a loopy graph, we close our eyes and hope it still works.)

Page 176: Structured Belief Propagation for NLP

176

Section 3: Appendix

BP as an Optimization Algorithm

Page 177: Structured Belief Propagation for NLP

177

BP as an Optimization Algorithm

This Appendix provides a more in-depth study of BP as an optimization algorithm.

Our focus is on the Bethe Free Energy and its relation to KL divergence, Gibbs Free Energy, and the Helmholtz Free Energy.

We also include a discussion of the convergence properties of max-product BP.

Page 178: Structured Belief Propagation for NLP

178

KL and Free Energies

Gibbs Free Energy

Kullback–Leibler (KL) divergence

Helmholtz Free Energy

Page 179: Structured Belief Propagation for NLP

179

Minimizing KL Divergence• If we find the distribution b that

minimizes the KL divergence, then b = p

• Also, true of the minimum of the Gibbs Free Energy

• But what if b is not (necessarily) a probability distribution?

Page 180: Structured Belief Propagation for NLP

180

True distribution:

BP on a 2 Variable ChainX

ψ1

Y

Beliefs at the end of BP:

We successfully minimized the KL

divergence!

*where U(x) is the uniform distribution

Page 181: Structured Belief Propagation for NLP

181

True distribution:

BP on a 3 Variable ChainW

ψ1

X

ψ2

Y

KL decomposes over the marginals

Define the joint belief to have the same form:

The true distribution can be expressed in terms of its marginals:

Page 182: Structured Belief Propagation for NLP

182

True distribution:

BP on a 3 Variable ChainW

ψ1

X

ψ2

Y

Gibbs Free Energydecomposes over

the marginals

Define the joint belief to have the same form:

The true distribution can be expressed in terms of its marginals:

Page 183: Structured Belief Propagation for NLP

183

True distribution:

BP on an Acyclic Graph

KL decomposes over the marginals

time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Define the joint belief to have the same form:

The true distribution can be expressed in terms of its marginals:

Page 184: Structured Belief Propagation for NLP

184

True distribution:

BP on an Acyclic Graph

The true distribution can be expressed in terms of its marginals:

Define the joint belief to have the same form:

time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

Gibbs Free Energydecomposes over

the marginals

Page 185: Structured Belief Propagation for NLP

185

True distribution:

BP on a Loopy Graph

Construct the joint belief as before:

time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

ψ2 ψ4 ψ6 ψ8

KL is no longer well defined, because the joint belief is not a proper distribution.

This might not be a distribution!

1. The beliefs are distributions: are non-negative and sum-to-one.

2. The beliefs are locally consistent:

So add constraints…

Page 186: Structured Belief Propagation for NLP

186

True distribution:

BP on a Loopy Graph

Construct the joint belief as before:

time likeflies an arrow

X1

ψ1

X2

ψ3

X3

ψ5

X4

ψ7

X5

ψ9

X6

ψ10

X8

ψ12

X7

ψ14

X9

ψ13

ψ11

This is called the Bethe Free Energy and decomposes over the marginals

ψ2 ψ4 ψ6 ψ8

But we can still optimize the same objective as before, subject to our belief constraints:This might not be a

distribution!

1. The beliefs are distributions: are non-negative and sum-to-one.

2. The beliefs are locally consistent:

So add constraints…

Page 187: Structured Belief Propagation for NLP

187

BP as an Optimization Algorithm• The Bethe Free Energy, a function of the beliefs:

• BP minimizes a constrained version of the Bethe Free Energy– BP is just one local optimization algorithm: fast but not

guaranteed to converge– If BP converges, the beliefs are called fixed points– The stationary points of a function have a gradient of zero

The fixed points of BP are local stationary points of the Bethe Free Energy (Yedidia, Freeman, & Weiss, 2000)

Page 188: Structured Belief Propagation for NLP

188

BP as an Optimization Algorithm• The Bethe Free Energy, a function of the beliefs:

• BP minimizes a constrained version of the Bethe Free Energy– BP is just one local optimization algorithm: fast but not

guaranteed to converge– If BP converges, the beliefs are called fixed points– The stationary points of a function have a gradient of zero

The stable fixed points of BP are local minima of the Bethe Free Energy (Heskes, 2003)

Page 189: Structured Belief Propagation for NLP

189

BP as an Optimization AlgorithmFor graphs with no cycles:

– The minimizing beliefs = the true marginals

– BP finds the global minimum of the Bethe Free Energy

– This global minimum is –log Z (the “Helmholtz Free Energy”)

For graphs with cycles:– The minimizing beliefs

only approximate the true marginals

– Attempting to minimize may get stuck at local minimum or other critical point

– Even the global minimum only approximates –log Z

Page 190: Structured Belief Propagation for NLP

190

Convergence of Sum-product BP• The fixed point beliefs:

– Do not necessarily correspond to marginals of any joint distribution over all the variables (Mackay, Yedidia, Freeman, & Weiss, 2001; Yedidia, Freeman, & Weiss, 2005)

• Unbelievable probabilities– Conversely, the true

marginals for many joint distributions cannot be reached by BP (Pitkow, Ahmadian, & Miller, 2011)

Figure adapted from (Pitkow, Ahmadian, & Miller, 2011)

GBe

the(b

)

b1(x1)

b2(x2)

The figure shows a two-dimensional slice of the Bethe Free Energy for a binary graphical model with pairwise interactions

Page 191: Structured Belief Propagation for NLP

191

Convergence of Max-product BPIf the max-marginals bi(xi) are a fixed point of BP, and x* is the corresponding assignment (assumed unique), then p(x*) > p(x) for every x ≠ x* in a rather large neighborhood around x* (Weiss & Freeman, 2001).

Figure from (Weiss & Freeman, 2001)

Informally: If you take the fixed-point solution x* and arbitrarily change the values of the dark nodes in the figure, the overall probability of the configuration will decrease.

The neighbors of x* are constructed as follows: For any set of vars S of disconnected trees and single loops, set the variables in S to arbitrary values, and the rest to x*.

Page 192: Structured Belief Propagation for NLP

192

Convergence of Max-product BPIf the max-marginals bi(xi) are a fixed point of BP, and x* is the corresponding assignment (assumed unique), then p(x*) > p(x) for every x ≠ x* in a rather large neighborhood around x* (Weiss & Freeman, 2001).

Figure from (Weiss & Freeman, 2001)

Informally: If you take the fixed-point solution x* and arbitrarily change the values of the dark nodes in the figure, the overall probability of the configuration will decrease.

The neighbors of x* are constructed as follows: For any set of vars S of disconnected trees and single loops, set the variables in S to arbitrary values, and the rest to x*.

Page 193: Structured Belief Propagation for NLP

193

Convergence of Max-product BPIf the max-marginals bi(xi) are a fixed point of BP, and x* is the corresponding assignment (assumed unique), then p(x*) > p(x) for every x ≠ x* in a rather large neighborhood around x* (Weiss & Freeman, 2001).

Figure from (Weiss & Freeman, 2001)

Informally: If you take the fixed-point solution x* and arbitrarily change the values of the dark nodes in the figure, the overall probability of the configuration will decrease.

The neighbors of x* are constructed as follows: For any set of vars S of disconnected trees and single loops, set the variables in S to arbitrary values, and the rest to x*.

Page 194: Structured Belief Propagation for NLP

194

Section 4:Incorporating Structure into

Factors and Variables

Page 195: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

195

Page 196: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

196

Page 197: Structured Belief Propagation for NLP

197

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

labl

anca

casa

T ψ2 T ψ4 T

FF

F

the housewhite

T T T

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

Page 198: Structured Belief Propagation for NLP

198

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

labl

anca

casa

T ψ2 T ψ4 T

FF

F

the housewhite

T T T

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

Page 199: Structured Belief Propagation for NLP

199

Sending Messages:Computational Complexity

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

From Variables To Variables

O(d*kd)d = # of neighboring variablesk = maximum # possible values for a neighboring variable

O(d*k) d = # of neighboring factorsk = # possible values for Xi

Page 200: Structured Belief Propagation for NLP

200

Sending Messages:Computational Complexity

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

From Variables To Variables

O(d*kd)d = # of neighboring variablesk = maximum # possible values for a neighboring variable

O(d*k) d = # of neighboring factorsk = # possible values for Xi

Page 201: Structured Belief Propagation for NLP

201

INCORPORATING STRUCTURE INTO FACTORS

Page 202: Structured Belief Propagation for NLP

202

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 203: Structured Belief Propagation for NLP

203

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 204: Structured Belief Propagation for NLP

204

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Tψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 205: Structured Belief Propagation for NLP

205

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 206: Structured Belief Propagation for NLP

206

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 207: Structured Belief Propagation for NLP

207

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Fψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

Fψ10

Fψ10

Tψ10

Fψ10

Fψ10

Fψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 208: Structured Belief Propagation for NLP

208

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Sending a messsage to a variable from its unary factors takes only O(d*kd) time where k=2 and d=1.

(Naradowsky, Vieira, & Smith, 2012)

Page 209: Structured Belief Propagation for NLP

209

Unlabeled Constituency Parsing

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

But nothing prevents non-tree structures.

Fψ10

Fψ10

Tψ10

Fψ10

Fψ10

Fψ10

Sending a messsage to a variable from its unary factors takes only O(d*kd) time where k=2 and d=1.

(Naradowsky, Vieira, & Smith, 2012)

Page 210: Structured Belief Propagation for NLP

210

Unlabeled Constituency Parsing

time likeflies an arrow

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

But nothing prevents non-tree structures.

Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise.

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Fψ10

Fψ10

Tψ10

Fψ10

Fψ10

Fψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 211: Structured Belief Propagation for NLP

211

Unlabeled Constituency Parsing

time likeflies an arrow

Given: a sentence.Predict: unlabeled parse.

We could predict whether each span is present T or not F.

But nothing prevents non-tree structures.

Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise.

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

(Naradowsky, Vieira, & Smith, 2012)

Page 212: Structured Belief Propagation for NLP

212

Unlabeled Constituency Parsing

time likeflies an arrow

Add a CKYTree factor which multiplies in 1 if the variables form a tree and 0 otherwise.

T

ψ1

ψ2 T

ψ3

ψ4 T

ψ5

ψ6 T

ψ7

ψ8 T

ψ9

time likeflies an arrow

Tψ10

Tψ12

Tψ11

Tψ13

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

Fψ10

How long does it take to send a message to a variable from the the CKYTree factor?

For the given sentence, O(d*kd) time where k=2 and d=15.

For a length n sentence, this will be O(2n*n).

But we know an algorithm (inside-outside) to compute all the marginals in O(n3)…So can’t we do better?

(Naradowsky, Vieira, & Smith, 2012)

Page 213: Structured Belief Propagation for NLP

213

Example: The Exactly1 Factor

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

(Smith & Eisner, 2008)

Variables: d binary variables X1, …, Xd

Global Factor: 1 if exactly one of the d binary variables Xi is on,

0 otherwise

Exactly1(X1, …, Xd) =

Page 214: Structured Belief Propagation for NLP

214

Example: The Exactly1 Factor

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

(Smith & Eisner, 2008)

Variables: d binary variables X1, …, Xd

Global Factor: 1 if exactly one of the d binary variables Xi is on,

0 otherwise

Exactly1(X1, …, Xd) =

Page 215: Structured Belief Propagation for NLP

215

Example: The Exactly1 Factor

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

(Smith & Eisner, 2008)

Variables: d binary variables X1, …, Xd

Global Factor: 1 if exactly one of the d binary variables Xi is on,

0 otherwise

Exactly1(X1, …, Xd) =

Page 216: Structured Belief Propagation for NLP

216

Example: The Exactly1 Factor

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

(Smith & Eisner, 2008)

Variables: d binary variables X1, …, Xd

Global Factor: 1 if exactly one of the d binary variables Xi is on,

0 otherwise

Exactly1(X1, …, Xd) =

Page 217: Structured Belief Propagation for NLP

217

Example: The Exactly1 Factor

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

(Smith & Eisner, 2008)

Variables: d binary variables X1, …, Xd

Global Factor: 1 if exactly one of the d binary variables Xi is on,

0 otherwise

Exactly1(X1, …, Xd) =

Page 218: Structured Belief Propagation for NLP

218

Example: The Exactly1 Factor

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

(Smith & Eisner, 2008)

Variables: d binary variables X1, …, Xd

Global Factor: 1 if exactly one of the d binary variables Xi is on,

0 otherwise

Exactly1(X1, …, Xd) =

Page 219: Structured Belief Propagation for NLP

219

Messages: The Exactly1 FactorFrom Variables To Variables

O(d*2d)d = # of neighboring variables

O(d*2) d = # of neighboring factors

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

Page 220: Structured Belief Propagation for NLP

220

Messages: The Exactly1 FactorFrom Variables To Variables

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

O(d*2d)d = # of neighboring variables

O(d*2) d = # of neighboring factors

Fast!

Page 221: Structured Belief Propagation for NLP

221

Messages: The Exactly1 FactorTo Variables

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

O(d*2d)d = # of neighboring variables

Page 222: Structured Belief Propagation for NLP

222

Messages: The Exactly1 FactorTo Variables

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

But the outgoing messages from the Exactly1 factor are defined as a sum over the 2d possible assignments to X1, …, Xd.

Conveniently, ψE1(xa) is 0 for all but d values – so the sum is sparse!

So we can compute all the outgoing messages from ψE1 in O(d) time!

O(d*2d)d = # of neighboring variables

Page 223: Structured Belief Propagation for NLP

223

Messages: The Exactly1 FactorTo Variables

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

But the outgoing messages from the Exactly1 factor are defined as a sum over the 2d possible assignments to X1, …, Xd.

Conveniently, ψE1(xa) is 0 for all but d values – so the sum is sparse!

So we can compute all the outgoing messages from ψE1 in O(d) time!

O(d*2d)d = # of neighboring variables

Fast!

Page 224: Structured Belief Propagation for NLP

224

Messages: The Exactly1 Factor

X1 X2 X3 X4

ψE1

ψ2 ψ3 ψ4ψ1

(Smith & Eisner, 2008)

A factor has a belief about each of its variables.

An outgoing message from a factor is the factor's belief with the incoming message divided out.

We can compute the Exactly1 factor’s beliefs about each of its variables efficiently. (Each of the parenthesized terms needs to be computed only once for all the variables.)

Page 225: Structured Belief Propagation for NLP

225

Example: The CKYTree FactorVariables: O(n2) binary variables Sij

Global Factor:

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

1 if the span variables form a constituency tree,

0 otherwise

CKYTree(S01, S12, …, S04) =

(Naradowsky, Vieira, & Smith, 2012)

Page 226: Structured Belief Propagation for NLP

226

Messages: The CKYTree FactorFrom Variables To Variables

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

O(d*2d)d = # of neighboring variables

O(d*2) d = # of neighboring factors

Page 227: Structured Belief Propagation for NLP

227

Messages: The CKYTree FactorFrom Variables To Variables

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

O(d*2d)d = # of neighboring variables

O(d*2) d = # of neighboring factors

Fast!

Page 228: Structured Belief Propagation for NLP

228

Messages: The CKYTree FactorTo Variables

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

O(d*2d)d = # of neighboring variables

Page 229: Structured Belief Propagation for NLP

229

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

To Variables

But the outgoing messages from the CKYTree factor are defined as a sum over the O(2n*n) possible assignments to {Sij}.

Messages: The CKYTree Factor

ψCKYTree(xa) is 1 for exponentially many values in the sum – but they all correspond to trees!

With inside-outside we can compute all the outgoing messages from CKYTree in O(n3) time!

O(d*2d)d = # of neighboring variables

Page 230: Structured Belief Propagation for NLP

230

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

To Variables

But the outgoing messages from the CKYTree factor are defined as a sum over the O(2n*n) possible assignments to {Sij}.

Messages: The CKYTree Factor

ψCKYTree(xa) is 1 for exponentially many values in the sum – but they all correspond to trees!

With inside-outside we can compute all the outgoing messages from CKYTree in O(n3) time!

O(d*2d)d = # of neighboring variables

Fast!

Page 231: Structured Belief Propagation for NLP

231

Example: The CKYTree Factor

0 21 3 4the madebarista coffee

S01 S12 S23 S34

S02 S13 S24

S03 S14

S04

(Naradowsky, Vieira, & Smith, 2012)

For a length n sentence, define an anchored weighted context free grammar (WCFG).Each span is weighted by the ratio of the incoming messages from the corresponding span variable.

Run the inside-outside algorithm on the sentence a1, a1, …, an with the anchored WCFG.

Page 232: Structured Belief Propagation for NLP

232

Example: The TrigramHMM Factor

(Smith & Eisner, 2008)

time likeflies an arrow

X1 X2 X3 X4 X5

W1 W2 W3 W4 W5

Factors can compactly encode the preferences of an entire sub-model.Consider the joint distribution of a trigram HMM over 5 variables:

– It’s traditionally defined as a Bayes Network– But we can represent it as a (loopy) factor graph– We could even pack all those factors into a single TrigramHMM factor

(Smith & Eisner, 2008)

Page 233: Structured Belief Propagation for NLP

233

Example: The TrigramHMM Factor

(Smith & Eisner, 2008)

time likeflies an arrow

X1

ψ1

ψ2 X2

ψ3

ψ4 X3

ψ5

ψ6 X4

ψ7

ψ8 X5

ψ9

ψ10 ψ11 ψ12

Factors can compactly encode the preferences of an entire sub-model.Consider the joint distribution of a trigram HMM over 5 variables:

– It’s traditionally defined as a Bayes Network– But we can represent it as a (loopy) factor graph– We could even pack all those factors into a single TrigramHMM factor

(Smith & Eisner, 2008)

Page 234: Structured Belief Propagation for NLP

234

Example: The TrigramHMM Factor

(Smith & Eisner, 2008)

time likeflies an arrow

X1

ψ1

ψ2 X2

ψ3

ψ4 X3

ψ5

ψ6 X4

ψ7

ψ8 X5

ψ9

ψ10 ψ11 ψ12

Factors can compactly encode the preferences of an entire sub-model.Consider the joint distribution of a trigram HMM over 5 variables:

– It’s traditionally defined as a Bayes Network– But we can represent it as a (loopy) factor graph– We could even pack all those factors into a single TrigramHMM factor

Page 235: Structured Belief Propagation for NLP

235

Example: The TrigramHMM FactorFactors can compactly encode the preferences of an entire sub-model.Consider the joint distribution of a trigram HMM over 5 variables:

– It’s traditionally defined as a Bayes Network– But we can represent it as a (loopy) factor graph– We could even pack all those factors into a single TrigramHMM factor

(Smith & Eisner, 2008)

time likeflies an arrow

X1 X2 X3 X4 X5

Page 236: Structured Belief Propagation for NLP

236

Example: The TrigramHMM Factor

(Smith & Eisner, 2008)

time likeflies an arrow

X1 X2 X3 X4 X5

Variables: d discrete variables X1, …, Xd

Global Factor: p(X1, …, Xd) according to a trigram HMM model

TrigramHMM (X1, …, Xd) =

Page 237: Structured Belief Propagation for NLP

237

Example: The TrigramHMM Factor

(Smith & Eisner, 2008)

time likeflies an arrow

X1 X2 X3 X4 X5

Variables: d discrete variables X1, …, Xd

Global Factor: p(X1, …, Xd) according to a trigram HMM model

TrigramHMM (X1, …, Xd) =

Compute outgoing messages efficiently with the standard trigram HMM dynamic programming algorithm (junction tree)!

Page 238: Structured Belief Propagation for NLP

238

Combinatorial Factors• Usually, it takes O(kd) time to compute

outgoing messages from a factor over d variables with k possible values each.

• But not always:1. Factors like Exactly1 with only

polynomially many nonzeroes in the potential table

2. Factors like CKYTree with exponentially many nonzeroes but in a special pattern

3. Factors like TrigramHMM with all nonzeroes but which factor further

Page 239: Structured Belief Propagation for NLP

239

Example NLP constraint factors:– Projective and non-projective dependency

parse constraint (Smith & Eisner, 2008)– CCG parse constraint (Auli & Lopez, 2011)– Labeled and unlabeled constituency parse

constraint (Naradowsky, Vieira, & Smith, 2012)– Inversion transduction grammar (ITG)

constraint (Burkett & Klein, 2012)

Combinatorial FactorsFactor graphs can encode structural constraints on many variables via constraint factors.

Page 240: Structured Belief Propagation for NLP

241

Structured BP vs. Dual Decomposition

Sum-product BP

Max-product BP

Dual Decomposition

Output Approximate marginals

Approximate MAP assignment

True MAP assignment (with branch-and-bound)

Structured Variant

Coordinates marginal inference algorithms

Coordinates MAP inference algorithms

Coordinates MAP inference algorithms

Example Embedded Algorithms

- Inside-outside- Forward-backward

- CKY- Viterbi algorithm

- CKY- Viterbi algorithm

(Duchi, Tarlow, Elidan, & Koller, 2006)

(Koo et al., 2010; Rush et al., 2010)

Page 241: Structured Belief Propagation for NLP

242

Additional ResourcesSee NAACL 2012 / ACL 2013 tutorial by Burkett & Klein “Variational Inference in Structured NLP Models” for…

– An alternative approach to efficient marginal inference for NLP: Structured Mean Field

– Also, includes Structured BPhttp://nlp.cs.berkeley.edu/tutorials/variational-tutorial-slides.pdf

Page 242: Structured Belief Propagation for NLP

243

Sending Messages:Computational Complexity

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

From Variables To Variables

O(d*kd)d = # of neighboring variablesk = maximum # possible values for a neighboring variable

O(d*k) d = # of neighboring factorsk = # possible values for Xi

Page 243: Structured Belief Propagation for NLP

244

Sending Messages:Computational Complexity

X2

ψ1X1 X3X1

ψ2

ψ3ψ1

From Variables To Variables

O(d*kd)d = # of neighboring variablesk = maximum # possible values for a neighboring variable

O(d*k) d = # of neighboring factorsk = # possible values for Xi

Page 244: Structured Belief Propagation for NLP

245

INCORPORATING STRUCTURE INTO VARIABLES

Page 245: Structured Belief Propagation for NLP

246

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

labl

anca

casa

T ψ2 T ψ4 T

FF

F

the housewhite

T T T

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

Page 246: Structured Belief Propagation for NLP

247

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

laca

sa

T ψ2 T ψ4 T

FF

F

the housewhite

T T T

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

T

Page 247: Structured Belief Propagation for NLP

248

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

laca

sa

T ψ2 T ψ4 T

FF

F

the housewhite

T T T

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

T

Page 248: Structured Belief Propagation for NLP

249

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

laca

sa

T ψ2 T ψ4 T

FF

F

the housewhite

F T F

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

blan

ca

Page 249: Structured Belief Propagation for NLP

250

String-Valued VariablesConsider two examples from Section 1:

• Variables (string):– English and Japanese

orthographic strings– English and Japanese

phonological strings• Interactions:

– All pairs of strings could be relevant

• Variables (string): – Inflected forms

of the same verb• Interactions:

– Between pairs of entries in the table (e.g. infinitive form affects present-singular)

Page 250: Structured Belief Propagation for NLP

251

Graphical Models over Strings

ψ1

X2

X1

(Dreyer & Eisner, 2009)

ring 1rang 2rung 2

ring 10.2

rang 13rung 16

ring

rang

rung

ring 2 4 0.1

rang 7 1 2rung 8 1 3

• Most of our problems so far:– Used discrete variables– Over a small finite set of

string values– Examples:

• POS tagging• Labeled constituency parsing• Dependency parsing

• We use tensors (e.g. vectors, matrices) to represent the messages and factors

Page 251: Structured Belief Propagation for NLP

252

Graphical Models over Strings

ψ1

X2

X1

(Dreyer & Eisner, 2009)

ring 1rang 2rung 2

ring 10.2

rang 13rung 16

ring

rang

rung

ring 2 4 0.1

rang 7 1 2rung 8 1 3

ψ1

X2

X1

abacus 0.1… …

rang 3ring 4rung 5

… …zymurg

y 0.1

abacus

… rang

ring

rung …

zymurgy

abacus 0.1 0.2 0.1 0.1 0.1…

rang 0.1 2 4 0.1 0.1ring 0.1 7 1 2 0.1rung 0.2 8 1 3 0.1

…zymurg

y 0.1 0.1 0.2 0.2 0.1

What happens as the # of possible values for a variable, k, increases? We can still keep the computational complexity down by including only low arity factors (i.e. small d).

Time Complexity:

var. fac. O(d*kd)fac. var. O(d*k)

Page 252: Structured Belief Propagation for NLP

253

Graphical Models over Strings

ψ1

X2

X1

(Dreyer & Eisner, 2009)

ring 1rang 2rung 2

ring 10.2

rang 13rung 16

ring

rang

rung

ring 2 4 0.1

rang 7 1 2rung 8 1 3

ψ1

X2

X1

abacus 0.1… …

rang 3ring 4rung 5

… …

abacus

… rang

ring

rung …

abacus 0.1 0.2 0.1 0.1…

rang 0.1 2 4 0.1ring 0.1 7 1 2rung 0.2 8 1 3

But what if the domain of a variable is Σ*, the infinite set of all possible strings?

How can we represent a distribution over one or more infinite sets?

Page 253: Structured Belief Propagation for NLP

254

Graphical Models over Strings

ψ1

X2

X1

(Dreyer & Eisner, 2009)

ring 1rang 2rung 2

ring 10.2

rang 13rung 16

ring

rang

rung

ring 2 4 0.1

rang 7 1 2rung 8 1 3

ψ1

X2

X1

abacus 0.1… …

rang 3ring 4rung 5

… …

abacus

… rang

ring

rung …

abacus 0.1 0.2 0.1 0.1…

rang 0.1 2 4 0.1ring 0.1 7 1 2rung 0.2 8 1 3

ψ1

X2

X1r i n g

ue ε ee

s e ha

s i n gr a n guaeε εa

rs au

r i n gue ε

s e ha

Finite State Machines let us represent something infinite in finite space!

Page 254: Structured Belief Propagation for NLP

255

Graphical Models over Strings

(Dreyer & Eisner, 2009)

ψ1

X2

X1

abacus 0.1… …

rang 3ring 4rung 5

… …

abacus

… rang

ring

rung …

abacus 0.1 0.2 0.1 0.1…

rang 0.1 2 4 0.1ring 0.1 7 1 2rung 0.2 8 1 3

ψ1

X2

X1r i n g

ue ε ee

s e ha

s i n gr a n guaeε εa

rs au

r i n gue ε

s e ha

Finite State Machines let us represent something infinite in finite space!

messages and beliefs are Weighted Finite State Acceptors (WFSA)

factors are Weighted Finite State Transducers (WFST)

Page 255: Structured Belief Propagation for NLP

256

Graphical Models over Strings

(Dreyer & Eisner, 2009)

ψ1

X2

X1

abacus 0.1… …

rang 3ring 4rung 5

… …

abacus

… rang

ring

rung …

abacus 0.1 0.2 0.1 0.1…

rang 0.1 2 4 0.1ring 0.1 7 1 2rung 0.2 8 1 3

ψ1

X2

X1r i n g

ue ε ee

s e ha

s i n gr a n guaeε εa

rs au

r i n gue ε

s e ha

Finite State Machines let us represent something infinite in finite space!

messages and beliefs are Weighted Finite State Acceptors (WFSA)

factors are Weighted Finite State Transducers (WFST)

That solves the problem of

representation.

But how do we manage the problem

of computation? (We still need to

compute messages and beliefs.)

Page 256: Structured Belief Propagation for NLP

257

Graphical Models over Strings

(Dreyer & Eisner, 2009)

ψ1

X2

X1r i n g

ue ε ee

s e ha

s i n gr a n guaeε εa

rs

au

r i n gue ε

s e ha

ψ1

X2

r i n gue ε ee

s e ha

r i n gue ε

s e ha

ψ1ψ1

r i n gue ε ee

s e ha

All the message and belief computations simply reuse standard FSM dynamic programming algorithms.

Page 257: Structured Belief Propagation for NLP

258

Graphical Models over Strings

(Dreyer & Eisner, 2009)

ψ1

X2

r i n gue ε ee

s e ha

r i n gue ε

s e ha

ψ1ψ1

r i n gue ε ee

s e ha

The pointwise product of two WFSAs is…

…their intersection.

Compute the product of (possibly many) messages μαi (each of which is a WSFA) via WFSA intersection

Page 258: Structured Belief Propagation for NLP

259

Graphical Models over Strings

(Dreyer & Eisner, 2009)

ψ1

X2

X1r i n g

ue ε ee

s e ha

s i n gr a n guaeε εa

rs au

r i n gue ε

s e ha

Compute marginalized product of WFSA message μkα and WFST factor ψα, with: domain(compose(ψα, μkα))

– compose: produces a new WFST with a distribution over (Xi, Xj)

– domain: marginalizes over Xj to obtain a WFSA over Xi only

Page 259: Structured Belief Propagation for NLP

260

Graphical Models over Strings

(Dreyer & Eisner, 2009)

ψ1

X2

X1r i n g

ue ε ee

s e ha

s i n gr a n guaeε εa

rs

au

r i n gue ε

s e ha

ψ1

X2

r i n gue ε ee

s e ha

r i n gue ε

s e ha

ψ1ψ1

r i n gue ε ee

s e ha

All the message and belief computations simply reuse standard FSM dynamic programming algorithms.

Page 260: Structured Belief Propagation for NLP

261

The usual NLP toolbox• WFSA: weighted finite state automata• WFST: weighted finite state transducer• k-tape WFSM: weighted finite state machine

jointly mapping between k stringsThey each assign a score to a set of strings.

We can interpret them as factors in a graphical model.

The only difference is the arity of the factor.

Page 261: Structured Belief Propagation for NLP

262

• WFSA: weighted finite state automata• WFST: weighted finite state transducer• k-tape WFSM: weighted finite state machine

jointly mapping between k strings

WFSA as a Factor Graph

X1

ψ1

b r e c h e nx1 =

ψ1 =a b

c…z

ψ1(x1) = 4.25

A WFSA is a function which maps a string to a score.

Page 262: Structured Belief Propagation for NLP

263

• WFSA: weighted finite state automata• WFST: weighted finite state transducer• k-tape WFSM: weighted finite state machine

jointly mapping between k strings

WFST as a Factor Graph

X2

X1

ψ1

(Dreyer, Smith, & Eisner, 2008)

b r e c h e nx1 =

b r a c h tx2 =

b r e c h e εb r a c h ε

nε tn

t

ψ1 =

ψ1(x1, x2) = 13.26

A WFST is a function that maps a pair of strings to a score.

Page 263: Structured Belief Propagation for NLP

264

k-tape WFSM as a Factor Graph• WFSA: weighted finite state automata• WFST: weighted finite state transducer• k-tape WFSM: weighted finite state machine

jointly mapping between k strings

X2X1

ψ1

X4X3

ψ1 =

bbbb

rrrr

eεεε

εaaa

cccc

hhhh

eεeε

nεnε

εtεε

ψ1(x1, x2, x3, x4) = 13.26

A k-tape WFSM is a function that maps k strings to a score.

What's wrong with a 100-tape WFSM for jointly modeling the 100 distinct forms of a Polish verb?

– Each arc represents a 100-way edit operation– Too many arcs!

Page 264: Structured Belief Propagation for NLP

265

Factor Graphs over Multiple StringsP(x1, x2, x3, x4) = 1/Z ψ1(x1, x2) ψ2(x1, x3) ψ3(x1, x4) ψ4(x2, x3) ψ5(x3, x4)

ψ1ψ2

ψ3

ψ4

ψ5

X2

X1

X4

X3

(Dreyer & Eisner, 2009)

Instead, just build factor graphs with WFST factors (i.e. factors of arity 2)

Page 265: Structured Belief Propagation for NLP

266

1st

2nd

3rd

singular

plural singular

plural

present past

infinitive

Factor Graphs over Multiple StringsP(x1, x2, x3, x4) = 1/Z ψ1(x1, x2) ψ2(x1, x3) ψ3(x1, x4) ψ4(x2, x3) ψ5(x3, x4)

ψ1ψ2

ψ3

ψ4

ψ5

X2

X1

X4

X3

(Dreyer & Eisner, 2009)

Instead, just build factor graphs with WFST factors (i.e. factors of arity 2)

Page 266: Structured Belief Propagation for NLP

267

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

laca

sa

T ψ2 T ψ4 T

FF

F

the housewhite

F T F

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

blan

ca

Page 267: Structured Belief Propagation for NLP

268

BP for Coordination of Algorithms• Each factor is tractable

by dynamic programming• Overall model is no

longer tractable, but BP lets us pretend it is

Tψ 2

Tψ 4

T

FF

F

laca

sa

T ψ2 T ψ4 T

FF

F

the housewhite

F T F

T T T

T T T

aligner

tagger

parser

tagg

er

pars

er

blan

ca

Page 268: Structured Belief Propagation for NLP

Section 5:What if even BP is slow?

Computing fewer message updates

Computing them faster

269

Page 269: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

270

Page 270: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

271

Page 271: Structured Belief Propagation for NLP

Loopy Belief Propagation Algorithm

1. For every directed edge, initialize its message to the uniform distribution.

2. Repeat until all normalized beliefs converge:a. Pick a directed edge u v. b. Update its message: recompute u v

from its “parent” messages v’ u for v’ ≠ v.

Or if u has high degree, can share work for speed:• Compute all outgoing messages u … at once, based on all

incoming messages … u.

Page 272: Structured Belief Propagation for NLP

Loopy Belief Propagation Algorithm

1. For every directed edge, initialize its message to the uniform distribution.

2. Repeat until all normalized beliefs converge:a. Pick a directed edge u v. b. Update its message: recompute u v

from its “parent” messages v’ u for v’ ≠ v. Which edge do we pick and recompute?

A “stale” edge?

Page 273: Structured Belief Propagation for NLP

Message Passing in Belief Propagation

274

X

Ψ…

My other factors think I’m a noun

But my other variables and I think you’re a

verb

v 1n 6a 3

v 6n 1a 3

Page 274: Structured Belief Propagation for NLP

Stale Messages

275

X

Ψ…

We update this message from its antecedents.Now it’s “fresh.” Don’t need to update it again. antecedents

Page 275: Structured Belief Propagation for NLP

Stale Messages

276

X

Ψ…

antecedents

But it again becomes “stale” – out of sync with its antecedents – if they change.Then we do need to revisit.

We update this message from its antecedents.Now it’s “fresh.” Don’t need to update it again.

The edge is very stale if its antecedents have changed a lot since its last update. Especially in a way that might make this edge change a lot.

Page 276: Structured Belief Propagation for NLP

Stale Messages

277

Ψ…

For a high-degree node that likes to update all its outgoing messages at once …We say that the whole node is very stale if its incoming messages have changed a lot.

Page 277: Structured Belief Propagation for NLP

Stale Messages

278

Ψ…

For a high-degree node that likes to update all its outgoing messages at once …We say that the whole node is very stale if its incoming messages have changed a lot.

Page 278: Structured Belief Propagation for NLP

Maintain an Queue of Stale Messages to Update

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Messages from factors are stale.Messages from variablesare actually fresh (in syncwith their uniform antecedents).

Initially all messages are

uniform.

Page 279: Structured Belief Propagation for NLP

Maintain an Queue of Stale Messages to Update

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Page 280: Structured Belief Propagation for NLP

Maintain an Queue of Stale Messages to Update

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Page 281: Structured Belief Propagation for NLP

Maintain an Queue of Stale Messages to Update

a priority queue! (heap)

• Residual BP: Always update the message that is most stale (would be most changed by an update).

• Maintain a priority queue of stale edges (& perhaps variables).– Each step of residual BP: “Pop and update.”– Prioritize by degree of staleness.– When something becomes stale, put it on the queue.– If it becomes staler, move it earlier on the queue.– Need a measure of staleness.

• So, process biggest updates first.• Dramatically improves speed of convergence.

– And chance of converging at all. (Elidan et al., 2006)

Page 282: Structured Belief Propagation for NLP

But what about the topology?

283

In a graph with no cycles:1. Send messages from the leaves to the root.2. Send messages from the root to the leaves.Each outgoing message is sent only after all its incoming messages have been received.

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Page 283: Structured Belief Propagation for NLP

A bad update order for residual BP!

time likeflies an arrow

X1 X2 X3 X4 X5X0

<START>

Page 284: Structured Belief Propagation for NLP

Try updating an entire acyclic subgraph

285

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Tree-based Reparameterization (Wainwright et al. 2001); also see Residual Splash

Page 285: Structured Belief Propagation for NLP

Try updating an entire acyclic subgraph

286

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Tree-based Reparameterization (Wainwright et al. 2001); also see Residual Splash

Pick this subgraph;update leaves to root, then root to leaves

Page 286: Structured Belief Propagation for NLP

Try updating an entire acyclic subgraph

287

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Tree-based Reparameterization (Wainwright et al. 2001); also see Residual Splash

Another subgraph;update leaves to root, then root to leaves

Page 287: Structured Belief Propagation for NLP

Try updating an entire acyclic subgraph

288

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Tree-based Reparameterization (Wainwright et al. 2001); also see Residual Splash

Another subgraph;update leaves to root, then root to leaves

At every step, pick a spanningtree (or spanning forest)that covers many stale edges

As we update messages in the tree, it affects staleness of messages outside the tree

Page 288: Structured Belief Propagation for NLP

Acyclic Belief Propagation

289

In a graph with no cycles:1. Send messages from the leaves to the root.2. Send messages from the root to the leaves.Each outgoing message is sent only after all its incoming messages have been received.

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

Page 289: Structured Belief Propagation for NLP

Don’t update a message if its antecedents will get a big update.Otherwise, will have to re-update.

Summary of Update Ordering

• Asynchronous– Pick a directed edge:

update its message– Or, pick a vertex:

update all its outgoing messages at once

290

In what order do we send messages for Loopy BP?

X1 X2 X3 X4 X5

time likeflies an arrow

X6

X8

X7

X9

• Size. Send big updates first.• Forces other messages to

wait for them.• Topology. Use graph structure.

• E.g., in an acyclic graph, a message can wait for all updates before sending.

Wait for your antecedents

Page 290: Structured Belief Propagation for NLP

Message Scheduling

• Synchronous (bad idea)– Compute all the messages– Send all the messages

• Asynchronous– Pick an edge: compute and

send that message• Tree-based

Reparameterization – Successively update

embedded spanning trees – Choose spanning trees such

that each edge is included in at least one

• Residual BP – Pick the edge whose message

would change the most if sent: compute and send that message

291Figure from (Elidan, McGraw, & Koller, 2006)

The order in which messages are sent has a significant effect on convergence

Page 291: Structured Belief Propagation for NLP

Message Scheduling

292(Jiang, Moon, Daumé III, & Eisner, 2013)

Even better dynamic scheduling may be possible by reinforcement learning of a problem-specific heuristic for choosing which edge to update next.

Page 292: Structured Belief Propagation for NLP

Section 5:What if even BP is slow?

Computing fewer message updates

Computing them faster

293

A variable has k possible values.What if k is large or infinite?

Page 293: Structured Belief Propagation for NLP

Computing Variable Beliefs

294

X

ring 0.1rang 3rung 1

ring .4rang 6rung 0

ring 1rang 2rung 2

ring 4rang 1rung 0

Suppose…– Xi is a discrete

variable– Each incoming

messages is a Multinomial

Pointwise product is easy when the variable’s domain is small and discrete

Page 294: Structured Belief Propagation for NLP

Computing Variable BeliefsSuppose…

– Xi is a real-valued variable

– Each incoming message is a Gaussian

The pointwise product of n Gaussians is……a Gaussian!

295

X

Page 295: Structured Belief Propagation for NLP

Computing Variable BeliefsSuppose…

– Xi is a real-valued variable

– Each incoming messages is a mixture of k Gaussians

The pointwise product explodes!

296

X

p(x) = p1(x) p2(x)…pn(x)

( 0.3 q1,1(x)+ 0.7 q1,2(x))

( 0.5 q2,1(x)+ 0.5 q2,2(x))

Page 296: Structured Belief Propagation for NLP

Computing Variable Beliefs

297

X

Suppose…– Xi is a string-valued

variable (i.e. its domain is the set of all strings)

– Each incoming messages is a FSA

The pointwise product explodes!

Page 297: Structured Belief Propagation for NLP

Example: String-valued Variables

• Messages can grow larger when sent through a transducer factor

• Repeatedly sending messages through a transducer can cause them to grow to unbounded size!

298

X2

X1

ψ2

a

a

εa

aa

a

(Dreyer & Eisner, 2009)

ψ1

aa

Page 298: Structured Belief Propagation for NLP

Example: String-valued Variables

• Messages can grow larger when sent through a transducer factor

• Repeatedly sending messages through a transducer can cause them to grow to unbounded size!

299

X2

X1

ψ2

a

εa

aa

a

(Dreyer & Eisner, 2009)

ψ1

aa

a a

Page 299: Structured Belief Propagation for NLP

Example: String-valued Variables

• Messages can grow larger when sent through a transducer factor

• Repeatedly sending messages through a transducer can cause them to grow to unbounded size!

300

X2

X1

ψ2εa

aa

(Dreyer & Eisner, 2009)

ψ1

aa

a a

a a a

Page 300: Structured Belief Propagation for NLP

Example: String-valued Variables

• Messages can grow larger when sent through a transducer factor

• Repeatedly sending messages through a transducer can cause them to grow to unbounded size!

301

X2

X1

ψ2εa

aa

(Dreyer & Eisner, 2009)

ψ1

aa

a a a

a a a

Page 301: Structured Belief Propagation for NLP

Example: String-valued Variables

• Messages can grow larger when sent through a transducer factor

• Repeatedly sending messages through a transducer can cause them to grow to unbounded size!

302

X2

X1

ψ2εa

aa

(Dreyer & Eisner, 2009)

ψ1

aa

a a a a a a a a a a a a

a a

Page 302: Structured Belief Propagation for NLP

Example: String-valued Variables

• Messages can grow larger when sent through a transducer factor

• Repeatedly sending messages through a transducer can cause them to grow to unbounded size!

303

X2

X1

ψ2εa

aa

(Dreyer & Eisner, 2009)

ψ1

aa

a a a a a a a a a a a a

a a

• The domain of these variables is infinite (i.e. Σ*);

• WSFA’s representation is finite – but the size of the representation can grow

• In cases where the domain of each variable is small and finite this is not an issue

Page 303: Structured Belief Propagation for NLP

Message ApproximationsThree approaches to dealing with complex messages:

1. Particle Belief Propagation (see Section 3)

2. Message pruning3. Expectation propagation

304

Page 304: Structured Belief Propagation for NLP

For real variables, try a mixture of K Gaussians:– E.g., true product is a mixture of Kd Gaussians

– Prune back: Randomly keep just K of them– Chosen in proportion to weight in full mixture– Gibbs sampling to efficiently choose them

– What if incoming messages are not Gaussian mixtures?– Could be anything sent by the factors …– Can extend technique to this case.

Message Pruning• Problem: Product of d messages = complex

distribution.– Solution: Approximate with a simpler distribution.– For speed, compute approximation without computing

full product.

305

(Sudderth et al., 2002 –“Nonparametric BP”)

X

Page 305: Structured Belief Propagation for NLP

• Problem: Product of d messages = complex distribution.– Solution: Approximate with a simpler distribution.– For speed, compute approximation without computing

full product.For string variables, use a small finite set:– Each message µi gives positive probability to …

– … every word in a 50,000 word vocabulary– … every string in ∑* (using a weighted FSA)

– Prune back to a list L of a few “good” strings– Each message adds its own K best strings to L– For each x L, let µ(x) = i µi(x) – each message scores x– For each x L, let µ(x) = 0

Message Pruning

306

(Dreyer & Eisner, 2009)

X

Page 306: Structured Belief Propagation for NLP

• Problem: Product of d messages = complex distribution.– Solution: Approximate with a simpler distribution.– For speed, compute approximation without computing

full product.EP provides four special advantages over pruning:1.General recipe that can be used in many settings.2.Efficient. Uses approximations that are very fast.3.Conservative. Unlike pruning, never forces b(x) to 0.

• Never kills off a value x that had been possible.4.Adaptive. Approximates µ(x) more carefully

if x is favored by the other messages.• Tries to be accurate on the most “plausible” values.

Expectation Propagation (EP)

307(Minka, 2001; Heskes & Zoeter,

2002)

Page 307: Structured Belief Propagation for NLP

Expectation Propagation (EP)

X1

X2

X3

X4

X7

X5

exponential-familyapproximations inside

Belief at X3 will be simple!

Messages to and from X3 will be simple!

Page 308: Structured Belief Propagation for NLP

Key idea: Approximate variable X’s incoming messages µ.We force them to have a simple parametric form: µ(x) = exp (θ ∙ f(x)) “log-linear model” (unnormalized)where f(x) extracts a feature vector from the value x.For each variable X, we’ll choose a feature function f.

Expectation Propagation (EP)

309

So by storing a few parameters θ, we’ve defined µ(x) for all x.Now the messages are super-easy to multiply:

µ1(x) µ2(x) = exp (θ ∙ f(x)) exp (θ ∙ f(x)) = exp ((θ1+θ2) ∙ f(x))

Represent a message by its parameter vector θ.To multiply messages, just add their θ vectors!So beliefs and outgoing messages also have this simple form.

Maybe unnormalizable, e.g.,

initial message θ=0 is uniform

“distribution”

Page 309: Structured Belief Propagation for NLP

Expectation Propagation

X1

X2

X3

X4

X7

exponential-familyapproximations inside

X5

• Form of messages/beliefs at X3?– Always µ(x) = exp (θ ∙ f(x))

• If x is real:– Gaussian: Take f(x) = (x,x2)

• If x is string:– Globally normalized trigram

model: Take f(x) = (count of aaa, count of aab, … count of zzz)

• If x is discrete:– Arbitrary discrete

distribution (can exactly represent original message, so we get ordinary BP)

– Coarsened discrete distribution, based on features of x

• Can’t use mixture models, or other models that use latent variables to define µ(x) = ∑y p(x, y)

Page 310: Structured Belief Propagation for NLP

Expectation Propagation

X1

X2

X3

X4

X7

exponential-familyapproximations inside

X5

• Each message to X3 is µ(x) = exp (θ ∙ f(x))for some θ. We only store θ.

• To take a product of such messages, just add their θ– Easily compute belief at

X3 (sum of incoming θ vectors)

– Then easily compute each outgoing message(belief minus one incoming θ)

• All very easy …

Page 311: Structured Belief Propagation for NLP

Expectation Propagation

X1

X2

X3

X4

X7

X5

• But what about messages from factors?– Like the message

M4.– This is not

exponential family! Uh-oh!

– It’s just whatever the factor happens to send.

• This is where we need to approximate, by µ4 .

M4µ4

Page 312: Structured Belief Propagation for NLP

X5

Expectation Propagation

X1

X2

X3

X4

X7 µ1

µ2

µ3

µ4M4

• blue = arbitrary distribution, green = simple distribution exp (θ ∙ f(x))

• The belief at x “should” bep(x) = µ1(x) ∙ µ2(x) ∙ µ3 (x) ∙ M4(x)

• But we’ll be usingb(x) = µ1(x) ∙ µ2(x) ∙ µ3 (x) ∙ µ4(x)

• Choose the simple distribution b that minimizes KL(p || b).

• Then, work backward from belief b to message µ4. – Take θ vector of b and

subtract off the θ vectors of µ1, µ2, µ3.

– Chooses µ4 to preserve belief well.

That is, choose b that assigns high probability to samples from p.Find b’s params θ in closed

form – or follow gradient:Ex~p[f(x)] – Ex~b[f(x)]

Page 313: Structured Belief Propagation for NLP

fo 2.6foo -0.5bar 1.2az 3.1xy -6.0

Fit model that predicts same counts

Broadcast n-gram counts

ML Estimation = Moment Matching

fo = 3

bar = 2

az = 4

foo = 1

Page 314: Structured Belief Propagation for NLP

FSA Approx. = Moment Matching

(can compute with forward-backward)

xx = 0.1

zz= 0.1

fo = 3.1

bar = 2.2

az = 4.1

foo = 0.9 fo 2.6foo -0.5bar 1.2az 3.1xy -6.0

r i n gue ε

s e har i n gue ε

s e ha

A distribution over strings

Fit model that predicts same fractional counts

Page 315: Structured Belief Propagation for NLP

fobarazKL( || )

Finds parameters θ that minimize KL “error” in belief

r i n gue εs e ha

θ

FSA Approx. = Moment Matching

minθ

Page 316: Structured Belief Propagation for NLP

fobaraz ?

KL( || )

How to approximate a message?

fo 1.2

bar 0.5

az 4.3

fo 0.2bar 1.1az -

0.3

fo 1.2

bar 0.5

az 4.3

i n gu εs e ha

Finds message parameters θ that minimize KL “error”

of resulting belief

r i n gue εs e ha

θ

fo 0.2bar 1.1az -

0.3i n gu ε

s e ha= i n gu εs e ha=

fo 0.6bar 0.6az -

1.0fo 2.0

bar -1.0

az 3.0

Wisely, KL doesn’t insist on good approximations for values that are

low-probability in the belief

Page 317: Structured Belief Propagation for NLP

KL( || )

Analogy: Scheduling by approximate email messages

Wisely, KL doesn’t insist on good approximations for values that are

low-probability in the belief

“I prefer Tue/Thu, and I’m away the last week of July.”

Tue 1.0Thu 1.0last -

3.0

(This is an approximation to my true schedule. I’m not actually free on all Tue/Thu, but the bad Tue/Thu dates have already been ruled out by messages from other folks.)

Page 318: Structured Belief Propagation for NLP

Expectation Propagation

319(Hall & Klein, 2012)

Example: Factored PCFGs

• Task: Constituency parsing, with factored annotations– Lexical annotations– Parent annotations– Latent annotations

• Approach:– Sentence specific

approximation is an anchored grammar:q(A B C, i, j, k)

– Sending messages is equivalent to marginalizing out the annotations

adaptiveapproximation

Page 319: Structured Belief Propagation for NLP

320

Section 6:Approximation-aware Training

Page 320: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

321

Page 321: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

322

Page 322: Structured Belief Propagation for NLP

323

Modern NLP

Linguistics

Mathematical

Modeling

Machine Learning

Combinatorial

Optimization

NLP

Page 323: Structured Belief Propagation for NLP

324

Machine Learning for NLPLinguistics

No semantic interpretatio

n

Linguistics

inspires the

structures we

want to predict

Page 324: Structured Belief Propagation for NLP

325

Machine Learning for NLPLinguistics Mathematica

l Modeling

pθ( ) = 0.50

pθ( ) = 0.25

pθ( ) = 0.10

pθ( ) = 0.01

Our model

defines a score for

each structure

Page 325: Structured Belief Propagation for NLP

326

Machine Learning for NLPLinguistics Mathematica

l Modeling

pθ( ) = 0.50

pθ( ) = 0.25

pθ( ) = 0.10

pθ( ) = 0.01

It also tells us what to optimize

Our model

defines a score for

each structure

Page 326: Structured Belief Propagation for NLP

327

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Learning tunes the parameters of the model

x1:y1:

x2:y2:

x3:y3:

Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}

Find the best model parameters, θ

Page 327: Structured Belief Propagation for NLP

328

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Learning tunes the parameters of the model

x1:y1:

x2:y2:

x3:y3:

Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}

Find the best model parameters, θ

Page 328: Structured Belief Propagation for NLP

329

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Learning tunes the parameters of the model

x1:y1:

x2:y2:

x3:y3:

Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}

Find the best model parameters, θ

Page 329: Structured Belief Propagation for NLP

330

xnew:y*:

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Combinatorial

Optimization

Inference finds

the best structure for a new sentence xnew:

y*:

• Given a new sentence, xnew

• Search over the set of all possible structures (often exponential in size of xnew)

• Return the highest scoring structure, y*

(Inference is usually called as a subroutine in

learning)

Page 330: Structured Belief Propagation for NLP

331

xnew:y*:

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Combinatorial

Optimization

Inference finds

the best structure for a new sentence xnew:

y*:

• Given a new sentence, xnew

• Search over the set of all possible structures (often exponential in size of xnew)

• Return the Minimum Bayes Risk (MBR) structure, y*

(Inference is usually called as a subroutine in

learning)

Page 331: Structured Belief Propagation for NLP

332

Machine Learning for NLPLinguistics Mathematica

l Modeling

Machine Learning

Combinatorial

Optimization

Inference finds

the best structure for a new sentence

• Given a new sentence, xnew

• Search over the set of all possible structures (often exponential in size of xnew)

• Return the Minimum Bayes Risk (MBR) structure, y*

(Inference is usually called as a subroutine in

learning) 332

Easy Polynomial time NP-hard

Page 332: Structured Belief Propagation for NLP

333

Modern NLPLinguistics inspires the

structures we want to predict It also tells us

what to optimize

Our model defines a score

for each structure

Learning tunes the parameters

of the model

Inference finds the best structure for a new sentence

Linguistics Mathematical Modeling

Machine Learning

Combinatorial

Optimization

NLP

(Inference is usually called as a subroutine

in learning)

Page 333: Structured Belief Propagation for NLP

334

An Abstraction for ModelingMathematical Modeling

Now we can work at this

level of abstraction.

Page 334: Structured Belief Propagation for NLP

335

TrainingThus far, we’ve seen how to compute (approximate) marginals, given a factor graph…

…but where do the potential tables ψα come from?

– Some have a fixed structure (e.g. Exactly1, CKYTree)

– Others could be trained ahead of time (e.g. TrigramHMM)

– For the rest, we define them parametrically and learn the parameters!

Two ways to learn:

1. Standard CRF Training(very simple; often yields state-of-the-art results)

2. ERMA (less simple; but takes approximations and loss function into account)

Page 335: Structured Belief Propagation for NLP

336

Standard CRF ParameterizationDefine each potential function in terms of a fixed set of feature functions:

Observed

variablesPredictedvariables

Page 336: Structured Belief Propagation for NLP

337

Standard CRF ParameterizationDefine each potential function in terms of a fixed set of feature functions:

time flies like an arrow

n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

Page 337: Structured Belief Propagation for NLP

338

Standard CRF ParameterizationDefine each potential function in terms of a fixed set of feature functions:

n

ψ1

ψ2 v

ψ3

ψ4 p

ψ5

ψ6 d

ψ7

ψ8 n

ψ9

time likeflies an arrow

npψ10

vpψ12

ppψ11

sψ13

Page 338: Structured Belief Propagation for NLP

339

What is Training?That’s easy:

Training = picking good model parameters!

But how do we know if the model parameters are any “good”?

Page 339: Structured Belief Propagation for NLP

340

Machine Learning

Conditional Log-likelihood Training1. Choose model

Such that derivative in #3 is ea

2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule

4. Replace exact inference by approximate inference

Page 340: Structured Belief Propagation for NLP

341

Conditional Log-likelihood Training1. Choose model

Such that derivative in #3 is easy

2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule

4. Replace exact inference by approximate inference

Machine Learning

We can approximate the factor marginals by the (normalized) factor beliefs from BP!

Page 341: Structured Belief Propagation for NLP

343

What’s wrong with the usual approach?

If you add too many factors, your predictions might get worse!

• The model might be richer, but we replace the true marginals with approximate marginals (e.g. beliefs computed by BP)

• Approximate inference can cause gradients for structured learning to go awry! (Kulesza & Pereira, 2008).

Page 342: Structured Belief Propagation for NLP

344

What’s wrong with the usual approach?

Mistakes made by Standard CRF Training:1. Using BP (approximate)2. Not taking loss function into account3. Should be doing MBR decoding

Big pile of approximations……which has tunable

parameters.

Treat it like a neural net, and run backprop!

Page 343: Structured Belief Propagation for NLP

345

Modern NLPLinguistics inspires the

structures we want to predict It also tells us

what to optimize

Our model defines a score

for each structure

Learning tunes the parameters

of the model

Inference finds the best structure for a new sentence

Linguistics Mathematical Modeling

Machine Learning

Combinatorial

Optimization

NLP

(Inference is usually called as a subroutine

in learning)

Page 344: Structured Belief Propagation for NLP

346

Empirical Risk Minimization

1. Given training data:

2. Choose each of these:

– Decision function

– Loss function

Examples: Linear regression, Logistic regression, Neural Network

Examples: Mean-squared error, Cross Entropy

x1:y1:

x2:y2:

x3:y3:

Page 345: Structured Belief Propagation for NLP

347

Empirical Risk Minimization

1. Given training data:

3. Define goal:

2. Choose each of these:

– Decision function

– Loss function

4. Train with SGD:(take small steps opposite the gradient)

Page 346: Structured Belief Propagation for NLP

348

1. Given training data:

3. Define goal:

2. Choose each of these:

– Decision function

– Loss function

4. Train with SGD:(take small steps opposite the gradient)

Empirical Risk Minimization

Page 347: Structured Belief Propagation for NLP

349

Conditional Log-likelihood Training1. Choose model

Such that derivative in #3 is easy

2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule

4. Replace true inference by approximate inference

Machine Learning

Page 348: Structured Belief Propagation for NLP

350

What went wrong?How did we compute these approximate marginal probabilities anyway?

By Belief Propagation of course!

Machine Learning

Page 349: Structured Belief Propagation for NLP

Error Back-Propagation

351Slide from (Stoyanov & Eisner, 2012)

Page 350: Structured Belief Propagation for NLP

Error Back-Propagation

352Slide from (Stoyanov & Eisner, 2012)

Page 351: Structured Belief Propagation for NLP

Error Back-Propagation

353Slide from (Stoyanov & Eisner, 2012)

Page 352: Structured Belief Propagation for NLP

Error Back-Propagation

354Slide from (Stoyanov & Eisner, 2012)

Page 353: Structured Belief Propagation for NLP

Error Back-Propagation

355Slide from (Stoyanov & Eisner, 2012)

Page 354: Structured Belief Propagation for NLP

Error Back-Propagation

356Slide from (Stoyanov & Eisner, 2012)

Page 355: Structured Belief Propagation for NLP

Error Back-Propagation

357Slide from (Stoyanov & Eisner, 2012)

Page 356: Structured Belief Propagation for NLP

Error Back-Propagation

358Slide from (Stoyanov & Eisner, 2012)

Page 357: Structured Belief Propagation for NLP

Error Back-Propagation

359Slide from (Stoyanov & Eisner, 2012)

Page 358: Structured Belief Propagation for NLP

Error Back-Propagation

360

y3

P(y3=noun|x)

μ(y1y2)=μ(y3y1)*μ(y4y1)

ϴ

Slide from (Stoyanov & Eisner, 2012)

Page 359: Structured Belief Propagation for NLP

Error Back-Propagation• Applying the chain rule of

differentiation over and over.• Forward pass:

– Regular computation (inference + decoding) in the model (+ remember intermediate quantities).

• Backward pass:– Replay the forward pass in reverse,

computing gradients.

361

Page 360: Structured Belief Propagation for NLP

362

Background: Backprop through timeRecurrent neural network:

BPTT: 1. Unroll the computation over time

(Robinson & Fallside, 1987)

(Werbos, 1988)

(Mozer, 1995)

a xt

bt

xt+1

yt+1

a x1

b1

x2

b2

x3

b3

x4

y4

2. Run backprop through the resulting feed-forward network

Page 361: Structured Belief Propagation for NLP

363

What went wrong?How did we compute these approximate marginal probabilities anyway?

By Belief Propagation of course!

Machine Learning

Page 362: Structured Belief Propagation for NLP

364

ERMAEmpirical Risk Minimization under Approximations (ERMA)• Apply Backprop through

time to Loopy BP• Unrolls the BP

computation graph• Includes inference,

decoding, loss and all the approximations along the way

(Stoyanov, Ropson, & Eisner, 2011)

Page 363: Structured Belief Propagation for NLP

365

ERMA1. Choose model to be

the computation with all its approximations

2. Choose objective to likewise include the approximations

3. Compute derivative by backpropagation (treating the entire computation as if it were a neural network)

4. Make no approximations!(Our gradient is exact)

Machine Learning

Key idea: Open up the black box!

(Stoyanov, Ropson, & Eisner, 2011)

Page 364: Structured Belief Propagation for NLP

366

ERMAEmpirical Risk Minimization

Machine Learning

Key idea: Open up the black box!

Minimum Bayes Risk (MBR) Decoder

(Stoyanov, Ropson, & Eisner, 2011)

Page 365: Structured Belief Propagation for NLP

367

Approximation-aware LearningWhat if we’re using Structured BP instead of regular BP?• No problem, the

same approach still applies!

• The only difference is that we embed dynamic programming algorithms inside our computation graph.

Machine Learning

Key idea: Open up the black box!

(Gormley, Dredze, & Eisner, 2015)

Page 366: Structured Belief Propagation for NLP

368

Connection to Deep Learning

y

exp(Θyf(x))

(Gormley, Yu, & Dredze, In submission)

Page 367: Structured Belief Propagation for NLP

369

Empirical Risk Minimization under Approximations (ERMA)

Approximation AwareNo Yes

Loss Aware

No

Yes SVMstruct [Finley and Joachims, 2008]M3N [Taskar et al., 2003]Softmax-margin [Gimpel & Smith, 2010]

ERMA

MLE

Figure from (Stoyanov & Eisner, 2012)

Page 368: Structured Belief Propagation for NLP

370

Section 7:Software

Page 369: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

371

Page 370: Structured Belief Propagation for NLP

Outline• Do you want to push past the simple NLP models (logistic

regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!

1. Models: Factor graphs can express interactions among linguistic structures.

2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.

3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic programming

algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps

faster.6. Learning: Tune the parameters. Approximately improve the true

predictions -- or truly improve the approximate predictions.7. Software: Build the model you want!

372

Page 371: Structured Belief Propagation for NLP

373

PacayaFeatures:

– Structured Loopy BP over factor graphs with:• Discrete variables• Structured constraint factors

(e.g. projective dependency tree constraint factor)• ERMA training with backpropagation• Backprop through structured factors

(Gormley, Dredze, & Eisner, 2015)Language: JavaAuthors: Gormley, Mitchell, & WolfeURL: http://www.cs.jhu.edu/~mrg/software/

(Gormley, Mitchell, Van Durme, & Dredze, 2014)(Gormley, Dredze, & Eisner, 2015)

Page 372: Structured Belief Propagation for NLP

374

ERMAFeatures: ERMA performs inference and training on CRFs and MRFs with arbitrary model structure over discrete variables. The training regime, Empirical Risk Minimization under Approximations is loss-aware and approximation-aware. ERMA can optimize several loss functions such as Accuracy, MSE and F-score.

Language: JavaAuthors: StoyanovURL: https://sites.google.com/site/ermasoftware/

(Stoyanov, Ropson, & Eisner, 2011)(Stoyanov & Eisner, 2012)

Page 373: Structured Belief Propagation for NLP

375

Graphical Models Libraries• Factorie (McCallum, Shultz, & Singh, 2012) is a Scala library allowing modular specification of

inference, learning, and optimization methods. Inference algorithms include belief propagation and MCMC. Learning settings include maximum likelihood learning, maximum margin learning, learning with approximate inference, SampleRank, pseudo-likelihood.http://factorie.cs.umass.edu/

• LibDAI (Mooij, 2010) is a C++ library that supports inference, but not learning, via Loopy BP, Fractional BP, Tree-Reweighted BP, (Double-loop) Generalized BP, variants of Loop Corrected Belief Propagation, Conditioned Belief Propagation, and Tree Expectation Propagation.http://www.libdai.org

• OpenGM2 (Andres, Beier, & Kappes, 2012) provides a C++ template library for discrete factor graphs with support for learning and inference (including tie-ins to all LibDAI inference algorithms).http://hci.iwr.uni-heidelberg.de/opengm2/

• FastInf (Jaimovich, Meshi, Mcgraw, Elidan) is an efficient Approximate Inference Library in C++.http://compbio.cs.huji.ac.il/FastInf/fastInf/FastInf_Homepage.html

• Infer.NET (Minka et al., 2012) is a .NET language framework for graphical models with support for Expectation Propagation and Variational Message Passing.http://research.microsoft.com/en-us/um/cambridge/projects/infernet

Page 374: Structured Belief Propagation for NLP

376

References

Page 375: Structured Belief Propagation for NLP

377

• M. Auli and A. Lopez, “A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, 2011, pp. 470–480.

• M. Auli and A. Lopez, “Training a Log-Linear Parser with Loss Functions via Softmax-Margin,” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK., 2011, pp. 333–343.

• Y. Bengio, “Training a neural network with a financial criterion rather than a prediction criterion,” in Decision Technologies for Financial Engineering: Proceedings of the Fourth International Conference on Neural Networks in the Capital Markets (NNCM’96), World Scientific Publishing, 1997, pp. 36–48.

• D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Prentice-Hall, Inc., 1989.

• D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Athena Scientific, 1997.

• L. Bottou and P. Gallinari, “A Framework for the Cooperation of Learning Algorithms,” in Advances in Neural Information Processing Systems, vol. 3, D. Touretzky and R. Lippmann, Eds. Denver: Morgan Kaufmann, 1991.

• R. Bunescu and R. J. Mooney, “Collective information extraction with relational Markov networks,” 2004, p. 438–es.

• C. Burfoot, S. Bird, and T. Baldwin, “Collective Classification of Congressional Floor-Debate Transcripts,” presented at the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies, 2011, pp. 1506–1515.

• D. Burkett and D. Klein, “Fast Inference in Phrase Extraction Models with Belief Propagation,” presented at the Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012, pp. 29–38.

• T. Cohn and P. Blunsom, “Semantic Role Labelling with Tree Conditional Random Fields,” presented at the Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), 2005, pp. 169–172.

Page 376: Structured Belief Propagation for NLP

378

• F. Cromierès and S. Kurohashi, “An Alignment Algorithm Using Belief Propagation and a Structure-Based Distortion Model,” in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece, 2009, pp. 166–174.

• M. Dreyer, “A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings,” Johns Hopkins University, Baltimore, MD, USA, 2011.

• M. Dreyer and J. Eisner, “Graphical Models over Multiple Strings,” presented at the Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 101–110.

• M. Dreyer and J. Eisner, “Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model,” presented at the Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 616–627.

• J. Duchi, D. Tarlow, G. Elidan, and D. Koller, “Using Combinatorial Optimization within Max-Product Belief Propagation,” Advances in neural information processing systems, 2006.

• G. Durrett, D. Hall, and D. Klein, “Decentralized Entity-Level Modeling for Coreference Resolution,” presented at the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013, pp. 114–124.

• G. Elidan, I. McGraw, and D. Koller, “Residual belief propagation: Informed scheduling for asynchronous message passing,” in Proceedings of the Twenty-second Conference on Uncertainty in AI (UAI, 2006.

• K. Gimpel and N. A. Smith, “Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, 2010, pp. 733–736.

• J. Gonzalez, Y. Low, and C. Guestrin, “Residual splash for optimally parallelizing belief propagation,” in International Conference on Artificial Intelligence and Statistics, 2009, pp. 177–184.

• D. Hall and D. Klein, “Training Factored PCFGs with Expectation Propagation,” presented at the Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 1146–1156.

Page 377: Structured Belief Propagation for NLP

379

• T. Heskes, “Stable fixed points of loopy belief propagation are minima of the Bethe free energy,” Advances in Neural Information Processing Systems, vol. 15, pp. 359–366, 2003.

• T. Heskes and O. Zoeter, “Expectation propagation for approximate inference in dynamic Bayesian networks,” Uncertainty in Artificial Intelligence, 2002, pp. 216-233.

• A. T. Ihler, J. W. Fisher III, A. S. Willsky, and D. M. Chickering, “Loopy belief propagation: convergence and effects of message errors.,” Journal of Machine Learning Research, vol. 6, no. 5, 2005.

• A. T. Ihler and D. A. McAllester, “Particle belief propagation,” in International Conference on Artificial Intelligence and Statistics, 2009, pp. 256–263.

• J. Jancsary, J. Matiasek, and H. Trost, “Revealing the Structure of Medical Dictations with Conditional Random Fields,” presented at the Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 2008, pp. 1–10.

• J. Jiang, T. Moon, H. Daumé III, and J. Eisner, “Prioritized Asynchronous Belief Propagation,” in ICML Workshop on Inferning, 2013.

• A. Kazantseva and S. Szpakowicz, “Linear Text Segmentation Using Affinity Propagation,” presented at the Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011, pp. 284–293.

• T. Koo and M. Collins, “Hidden-Variable Models for Discriminative Reranking,” presented at the Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 2005, pp. 507–514.

• A. Kulesza and F. Pereira, “Structured Learning with Approximate Inference.,” in NIPS, 2007, vol. 20, pp. 785–792.

• J. Lee, J. Naradowsky, and D. A. Smith, “A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, 2011, pp. 885–894.

• S. Lee, “Structured Discriminative Model For Dialog State Tracking,” presented at the Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 442–451.

Page 378: Structured Belief Propagation for NLP

380

• X. Liu, M. Zhou, X. Zhou, Z. Fu, and F. Wei, “Joint Inference of Named Entity Recognition and Normalization for Tweets,” presented at the Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2012, pp. 526–535.

• D. J. C. MacKay, J. S. Yedidia, W. T. Freeman, and Y. Weiss, “A Conversation about the Bethe Free Energy and Sum-Product,” MERL, TR2001-18, 2001.

• A. Martins, N. Smith, E. Xing, P. Aguiar, and M. Figueiredo, “Turbo Parsers: Dependency Parsing by Approximate Variational Inference,” presented at the Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010, pp. 34–44.

• D. McAllester, M. Collins, and F. Pereira, “Case-Factor Diagrams for Structured Probabilistic Modeling,” in In Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI’04), 2004.

• T. Minka, “Divergence measures and message passing,” Technical report, Microsoft Research, 2005.• T. P. Minka, “Expectation propagation for approximate Bayesian inference,” in Uncertainty in Artificial

Intelligence, 2001, vol. 17, pp. 362–369.• M. Mitchell, J. Aguilar, T. Wilson, and B. Van Durme, “Open Domain Targeted Sentiment,” presented at the

Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1643–1654.

• K. P. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propagation for approximate inference: An empirical study,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 467–475.

• T. Nakagawa, K. Inui, and S. Kurohashi, “Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables,” presented at the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 786–794.

• J. Naradowsky, S. Riedel, and D. Smith, “Improving NLP through Marginalization of Hidden Syntactic Structure,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 810–820.

• J. Naradowsky, T. Vieira, and D. A. Smith, Grammarless Parsing for Joint Inference. Mumbai, India, 2012.• J. Niehues and S. Vogel, “Discriminative Word Alignment via Alignment Matrix Modeling,” presented at the

Proceedings of the Third Workshop on Statistical Machine Translation, 2008, pp. 18–25.

Page 379: Structured Belief Propagation for NLP

381

• J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988.

• X. Pitkow, Y. Ahmadian, and K. D. Miller, “Learning unbelievable probabilities,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 738–746.

• V. Qazvinian and D. R. Radev, “Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.,” presented at the Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 555–564.

• H. Ren, W. Xu, Y. Zhang, and Y. Yan, “Dialog State Tracking using Conditional Random Fields,” presented at the Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 457–461.

• D. Roth and W. Yih, “Probabilistic Reasoning for Entity & Relation Recognition,” presented at the COLING 2002: The 19th International Conference on Computational Linguistics, 2002.

• A. Rudnick, C. Liu, and M. Gasser, “HLTDI: CL-WSD Using Markov Random Fields for SemEval-2013 Task 10,” presented at the Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), 2013, pp. 171–177.

• T. Sato, “Inside-Outside Probability Computation for Belief Propagation.,” in IJCAI, 2007, pp. 2605–2610.

• D. A. Smith and J. Eisner, “Dependency Parsing by Belief Propagation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Honolulu, 2008, pp. 145–156.

• V. Stoyanov and J. Eisner, “Fast and Accurate Prediction via Evidence-Specific MRF Structure,” in ICML Workshop on Inferning: Interactions between Inference and Learning, Edinburgh, 2012.

• V. Stoyanov and J. Eisner, “Minimum-Risk Training of Approximate CRF-Based NLP Systems,” in Proceedings of NAACL-HLT, 2012, pp. 120–130.

Page 380: Structured Belief Propagation for NLP

382

• V. Stoyanov, A. Ropson, and J. Eisner, “Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, 2011, vol. 15, pp. 725–733.

• E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” MIT, Technical Report 2551, 2002.

• E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” in In Proceedings of CVPR, 2003.

• E. B. Sudderth, A. T. Ihler, M. Isard, W. T. Freeman, and A. S. Willsky, “Nonparametric belief propagation,” Communications of the ACM, vol. 53, no. 10, pp. 95–103, 2010.

• C. Sutton and A. McCallum, “Collective Segmentation and Labeling of Distant Entities in Information Extraction,” in ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields, 2004.

• C. Sutton and A. McCallum, “Piecewise Training of Undirected Models,” in Conference on Uncertainty in Artificial Intelligence (UAI), 2005.

• C. Sutton and A. McCallum, “Improved dynamic schedules for belief propagation,” UAI, 2007.• M. J. Wainwright, T. Jaakkola, and A. S. Willsky, “Tree-based reparameterization for approximate

inference on loopy graphs.,” in NIPS, 2001, pp. 1001–1008.• Z. Wang, S. Li, F. Kong, and G. Zhou, “Collective Personal Profile Summarization with Social

Networks,” presented at the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 715–725.

• Y. Watanabe, M. Asahara, and Y. Matsumoto, “A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,” presented at the Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007, pp. 649–657.

Page 381: Structured Belief Propagation for NLP

383

• Y. Weiss and W. T. Freeman, “On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 736–744, 2001.

• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Bethe free energy, Kikuchi approximations, and belief propagation algorithms,” MERL, TR2001-16, 2001.

• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy approximations and generalized belief propagation algorithms,” IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2282–2312, Jul. 2005.

• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Generalized belief propagation,” in NIPS, 2000, vol. 13, pp. 689–695.

• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding belief propagation and its generalizations,” Exploring artificial intelligence in the new millennium, vol. 8, pp. 236–239, 2003.

• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing Free Energy Approximations and Generalized Belief Propagation Algorithms,” MERL, TR-2004-040, 2004.

• J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy approximations and generalized belief propagation algorithms,” Information Theory, IEEE Transactions on, vol. 51, no. 7, pp. 2282–2312, 2005.