Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf ·...

Transcript of Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf ·...

Dual Decomposition for Parsingwith Non-Projective Head Automata

Terry Koo, Alexander M. Rush, Michael Collins,David Sontag, and Tommi Jaakkola

Page 2: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

The Cost of Model Complexity

We are always looking for better ways to model natural language.

Tradeoff: Richer models ⇒ Harder decoding

Added complexity is both computational and implementational.

Tasks with challenging decoding problems:

I Speech Recognition

I Sequence Modeling (e.g. extensions to HMM/CRF)

I Parsing

I Machine Translation

y∗ = arg maxy

f (y) Decoding

I Parsing

y∗ = arg maxy

f (y) Decoding

I Parsing

y∗ = arg maxy

f (y) Decoding

Non-Projective Dependency Parsing

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Important problem in many languages.

Problem is NP-Hard for all but the simplest models.

Dual Decomposition

A classical technique for constructing decoding algorithms.

Solve complicated models

y∗ = arg maxy

f (y)

by decomposing into smaller problems.

Upshot: Can utilize a toolbox of combinatorial algorithms.

I Dynamic programming

I Minimum spanning tree

I Shortest path

I Min-Cut

I ...

Page 7: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

A Dual Decomposition Algorithmfor Non-Projective Dependency Parsing

Simple - Uses basic combinatorial algorithms

Efficient - Faster than previously proposed algorithms

Strong Guarantees - Gives a certificate of optimality when exact

Solves 98% of examples exactly, even though the problem isNP-Hard

Widely Applicable - Similar techniques extend to other problems

Page 8: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Roadmap

Algorithm

Experiments

Derivation

Page 9: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Non-Projective Dependency Parsing

I Starts at the root symbol *

I Each word has a exactly one parent word

I Produces a tree structure (no cycles)

I Dependencies can cross

Page 10: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Algorithm Outline

Arc-Factored Model

Dual Decomposition

Sibling Model

Algorithm Outline

+

Arc-Factored Model

Dual Decomposition

Sibling Model

Arc-Factored

f (y) =

score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Arc-Factored

f (y) = score(head =∗0,mod =saw2)

+score(saw2, John1)

y∗ = arg maxy

Arc-Factored

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

y∗ = arg maxy

Arc-Factored

+score(saw2,movie4)

+score(saw2, today5)

y∗ = arg maxy

Arc-Factored

y∗ = arg maxy

Arc-Factored

y∗ = arg maxy

Arc-Factored

y∗ = arg maxy

Arc-Factored

y∗ = arg maxy

Arc-Factored

y∗ = arg maxy

Sibling Models

f (y) =

score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Sibling Models

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Sibling Models

+score(saw2,NULL, John1)

+score(saw2,NULL,movie4)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Sibling Models

y∗ = arg maxy

f (y) ⇐ NP-Hard

Sibling Models

y∗ = arg maxy

f (y) ⇐ NP-Hard

Sibling Models

y∗ = arg maxy

f (y) ⇐ NP-Hard

Sibling Models

y∗ = arg maxy

f (y) ⇐ NP-Hard

Sibling Models

y∗ = arg maxy

f (y) ⇐ NP-Hard

Thought Experiment: Individual Decoding

score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)

score(saw2,NULL, John1) + score(saw2,NULL, that6)

score(saw2,NULL, a3) + score(saw2, a3,he7)

2n−1

possibilities

Under Sibling Model, can solve for each word with Viterbi decoding.

Page 30: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

2n−1

possibilities

Page 31: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

2n−1

possibilities

Page 32: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

2n−1

possibilities

Page 33: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

2n−1

possibilities

Page 34: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

2n−1

possibilities

Page 35: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 36: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 37: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 38: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 39: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 40: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 41: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 42: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 43: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Page 44: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Dual Decomposition Idea

NoConstraints

TreeConstraints

Arc-Factored

MinimumSpanning Tree

SiblingModel

IndividualDecoding

Page 45: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Dual Decomposition Idea

NoConstraints

TreeConstraints

Arc-Factored

MinimumSpanning Tree

SiblingModel

IndividualDecoding

DualDecomposition

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid TreesAll Possible

Sibling Arc-Factored

Constraint

Page 47: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Constraint

Page 48: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid Trees

All Possible

Constraint

Page 49: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Constraint

Page 50: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Sibling

Arc-Factored

Constraint

Page 51: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Constraint

Page 52: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Constraint

Page 53: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Algorithm Sketch

Set penalty weights equal to 0 for all edges.

For k = 1 to K

z(k) ← Decode (f (z) + penalty) by Individual Decoding

y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree

If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))

Page 54: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Algorithm Sketch

For k = 1 to K

Page 55: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Algorithm Sketch

For k = 1 to K

Page 56: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Algorithm Sketch

For k = 1 to K

Page 57: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Algorithm Sketch

For k = 1 to K

Else Update penalty weights based on y (k)(i , j)− z(k)(i , j)

Page 58: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 59: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 60: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 61: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 62: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 63: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 64: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 65: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 66: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 67: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 68: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 69: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Individual Decoding

z∗ = arg maxz∈Z

(f (z) +∑i ,j

u(i , j)z(i , j))

y∗ = arg maxy∈Y

(g(y)−∑i ,j

u(i , j)y(i , j))

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

y∈Yf (y) + g(y)

Page 70: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Guarantees

TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global

optimum.

In experiments, we find the global optimum on 98% of examples.

If we do not converge to a match, we can still return anapproximate solution (more in the paper).

Page 71: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Guarantees

TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global

optimum.

In experiments, we find the global optimum on 98% of examples.

If we do not converge to a match, we can still return anapproximate solution (more in the paper).

Page 72: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Extensions

I Grandparent Models

f (y) =...+ score(gp =∗0, head = saw2, prev =movie4,mod =today5)

I Head Automata (Eisner, 2000)

Generalization of Sibling models

Allow arbitrary automata as local scoring function.

Page 73: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Roadmap

Algorithm

Experiments

Derivation

Page 74: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

ExperimentsProperties:

I Exactness

I Parsing Speed

I Parsing Accuracy

I Comparison to Individual Decoding

I Comparison to LP/ILP

Training:I Averaged Perceptron (more details in paper)

Experiments on:

I CoNLL Datasets

I English Penn Treebank

I Czech Dependency Treebank

Page 75: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

How often do we exactly solve the problem?

90

92

94

96

98

100

CzeEng

DanDut

PorSlo

SweTur

I Percentage of examples where the dual decomposition findsan exact solution.

Page 76: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Parsing Speed

0

10

20

30

40

50

CzeEng

DanDut

PorSlo

SweTur

Sibling model

0

5

10

15

20

25

CzeEng

DanDut

PorSlo

SweTur

Grandparent model

I Number of sentences parsed per second

I Comparable to dynamic programming for projective parsing

Page 77: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Accuracy

Arc-Factored Prev Best Grandparent

Dan 89.7 91.5 91.8Dut 82.3 85.6 85.8Por 90.7 92.1 93.0Slo 82.4 85.6 86.2Swe 88.9 90.6 91.4Tur 75.7 76.4 77.6Eng 90.1 — 92.5Cze 84.4 — 87.3

Prev Best - Best reported results for CoNLL-X data set, includes

I Approximate search (McDonald and Pereira, 2006)

I Loop belief propagation (Smith and Eisner, 2008)

I (Integer) Linear Programming (Martins et al., 2009)

Page 78: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Comparison to Subproblems

88

89

90

91

92

93

Eng

IndividualMSTDual

F1 for dependency accuracy

Page 79: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Comparison to LP/ILP

Martins et al.(2009): Proposes two representations ofnon-projective dependency parsing as a linear programmingrelaxation as well as an exact ILP.

I LP (1)I LP (2)I ILP

Use an LP/ILP Solver for decoding

We compare:

I AccuracyI ExactnessI Speed

Both LP and dual decomposition methods use the same model,features, and weights w .

Page 80: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Comparison to LP/ILP: Accuracy

80

85

90

95

100LP(1)LP(2)

ILPDual

Dependency Accuracy

I All decoding methods have comparable accuracy

Page 81: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Comparison to LP/ILP: Exactness and Speed

80

85

90

95

100LP(1)LP(2)

ILPDual

Percentage with exact solution

0

2

4

6

8

10

12

14LP(1)LP(2)

ILPDual

Sentences per second

Page 82: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Roadmap

Algorithm

Experiments

Derivation

Page 83: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Deriving the Algorithm

Goal:y∗ = arg max

y∈Yf (y)

Rewrite:arg max

z∈Z,y∈Yf (z) + g(y)

s.t. z(i , j) = y(i , j) for all i , j

Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j

u(i , j) (z(i , j)− y(i , j))

The dual problem is to find minu

L(u) where

L(u) = maxy∈Y,z∈Z

L(u, y , z) = maxz∈Z

f (z) +∑i,j

u(i , j)z(i , j)

+ max

y∈Y

g(y)−∑i,j

u(i , j)y(i , j)

Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u

Page 84: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Deriving the Algorithm

Goal:y∗ = arg max

y∈Yf (y)

Rewrite:arg max

z∈Z,y∈Yf (z) + g(y)

s.t. z(i , j) = y(i , j) for all i , j

Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j

u(i , j) (z(i , j)− y(i , j))

The dual problem is to find minu

L(u) where

L(u) = maxy∈Y,z∈Z

L(u, y , z) = maxz∈Z

f (z) +∑i,j

u(i , j)z(i , j)

+ max

y∈Y

g(y)−∑i,j

u(i , j)y(i , j)

Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u

Page 85: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

A Subgradient Algorithm for Minimizing L(u)

L(u) = maxz∈Z

f (z) +∑i,j

u(i , j)z(i , j)

+ maxy∈Y

g(y)−∑i,j

u(i , j)y(i , j)

L(u) is convex, but not differentiable. A subgradient of L(u) at uis a vector gu such that for all v ,

L(v) ≥ L(u) + gu · (v − u)

Subgradient methods use updates u′ = u − αgu

In fact, for our L(u), gu(i , j) = z∗(i , j)− y∗(i , j)

Page 86: Dual Decomposition for Parsing with Non-Projective Head ...people.csail.mit.edu/srush/talk2.pdf · y = arg max y f (y)Decoding. The Cost of Model Complexity We are always looking

Related Work

I Methods that use general purpose linear programming orinteger linear programming solvers (Martins et al. 2009;Riedel and Clarke 2006; Roth and Yih 2005)

I Dual decomposition/Lagrangian relaxation in combinatorialoptimization (Dantzig and Wolfe, 1960; Held and Karp, 1970;Fisher 1981)

I Dual decomposition for inference in MRFs (Komodakis et al.,2007; Wainwright et al., 2005)

I Methods that incorporate combinatorial solvers within loopybelief propagation (Duchi et al. 2007; Smith and Eisner 2008)

Summary

y∗ = arg maxy

f (y) ⇐ NP-Hard

Arc-Factored Model

Dual Decomposition

Sibling Model

Summary

y∗ = arg maxy

f (y) ⇐ NP-Hard

+

Arc-Factored Model

Dual Decomposition

Sibling Model

Other Applications

I Dual decomposition can be applied to other decodingproblems.

I Rush et al. (2010) focuses on integrated dynamicprogramming algorithms.

I Integrated Parsing and Tagging

I Integrated Constituency and Dependency Parsing

Parsing and Tagging

y∗ = arg maxy

f (y) ⇐ Slow

S N V D N A D N V

S

NP

N

John

VP

V

saw

NP

D

a

NP

N

movie

ADVP

ADV

today

ADVP

D

that

VP

N

he

V

liked

HMM Model

Dual Decomposition

CFG Model

Parsing and Tagging

y∗ = arg maxy

f (y) ⇐ Slow

S N V D N A D N V

+

S

NP

N

John

VP

V

saw

NP

D

a

NP

N

movie

ADVP

ADV

today

ADVP

D

that

VP

N

he

V

liked

HMM Model

Dual Decomposition

CFG Model

Dependency and Constituency

y∗ = arg maxy

f (y) ⇐ Slow

S

NP

N

John

VP

V

saw

NP

D

a

NP

N

movie

ADVP

ADV

today

ADVP

D

that

VP

N

he

V

liked

Dependency Model

Dual Decomposition

Lexicalized CFG

Dependency and Constituency

y∗ = arg maxy

f (y) ⇐ Slow

+

S

NP

N

John

VP

V

saw

NP

D

a

NP

N

movie

ADVP

ADV

today

ADVP

D

that

VP

N

he

V

liked

Dependency Model

Dual Decomposition

Lexicalized CFG

Future Directions

There is much more to explore around dual decomposition in NLP.

I Known TechniquesI Generalization to more than two modelsI K-best decodingI Approximate subgradientI Heuristic for branch-and-bound type search

I Possible NLP ApplicationsI Machine TranslationI Speech RecognitionI “Loopy” Sequence Models

I Open QuestionsI Can we speed up subalgorithms when running repeatedly?I What are the trade-offs of different decompositions?I Are there better methods for optimizing the dual?