Efficient Decomposed Learning for Structured Prediction

Efficient Decomposed Learning for Structured Prediction

Rajhans SamdaniJoint work with Dan Roth

University of Illinois at Urbana-Champaign

Structured Prediction

Structured prediction: predicting a structured output variable y based on the input variable x

y = {y1,y2,…,yn} variables form a structure Structure comes from interactions between the output

variables through mutual correlations and constraints Such problems occur frequently in

NLP – e.g. predicting the tree structured parse of a sentence, predicting the entity-relation structure from a document.

Computer vision – scene segmentation, body-part identification Speech processing – capturing relations between phonemes Computational Biology – protein folding and interactions between different

sub-structures Etc.

Example Problem: Information Extraction

Given citation text, extract author, booktitle, title, etc.Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. Given ad text, extract features, size, neighborhood, etc.Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … Structure introduced by correlations between words

E.g. if treated as sequence-tagging Structure is also introduced by declarative constraints that

define the set of feasible assignments E.g. the ‘author’ tokens are likely to appear together in a single block A paper should have at most one ‘title’

Example problem: Body Part Identification

Count the number of people

Predict the body parts

Correlations Position of shoulders and

heads correlated Position of torso and legs

correlated

Predict variables in y = {y1,y2,…,yn} 2 Y together to leverage dependencies (e.g. entity-relation, shoulders-head, information fields, document labels etc.) between these variables

Inference constitutes predicting the best scoring structure

f(x,y) = w¢ Á(x,y) is called the scoring function

Set of allowed structuresoften specified by constraints

Weight parameters (to be estimated during learning)

Features on input-output

Structured Prediction: Inference

Structural Learning: Quick Overview

Consider a big monolithic structured prediction problem

Given labeled data pairs ( xj, yj = {yj1,yj

2,…,yjn} ),

how do we learn w and perform inference?

y1y3

y6y5

y2

y4

Learning w: Two Extreme Styles

Global Learning (GL) Consider all the variables

together

Collins’02; Taskar et al’04; Tsochantiridis et al’04

Local Learning (LL) Ignore hard to learn

structural aspects e.g. global constraints/consider variables in isolation

Punyakanok et al’05; Roth and Yih’05; Koo et al’10…

y1

y3

y6y5

y2

y4

y1

y3

y6y5

y2

y4

LL+C: apply constraints, if available, only at test-time inference

Expensive Inconsistent

Our Contribution: Decomposed Learning

We consider learning with subsets of variables at a time

We give conditions under which this decomposed learning is actually identical to global learning and exhibit the advantage of our learning paradigm experimentally.

y1

y3

y6

y5

y2

y4

00011011

00 01 10 11

000001010011100101110111

Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10

Existing Global Structural learning algorithms

Decomposed Learning (DecL): Efficient structural learning Intuition Formalization

Theoretical properties of DecL

Experimental evaluation

Outline

Page 10

Supervised Structural Learning

We focus on structural SVM style algorithms which learn w by minimizing regularized structured-hinge loss

Literature: Taskar et al’04; Tsochantiridis et al’04;

Structured hinge-lossScore of non-ground truth y

Score of the ground truth y

Loss-based margin

Page 11

Global Inference over all the variables

Limitations of Global Learning

Exact global inference as an intermediate step

Expressive models don’t admit exact and efficient (poly-time) inference algorithms e.g. HMM with global constraints, Arbitrary Pairwise Markov Networks

Hence Global Learning is expensive for expressive features (Á(x,y)) and constraints (y 2 Y)

The problem is using inference as a black box during learning

Our proposal: change the inference-during-learning to inference over a smaller output space: Decomposed inference for learning

Page 13

Existing Structural learning algorithms

Decomposed Learning (DecL): Efficient structural learning Intuition Formalization



Outline

Page 15

Decomposed Structural Learning (DecL)

GENERAL IDEA: For (xj,yj), reduce the argmax inference from the intractable output space Y to a “neighborhood” around yj: nbr(yj)µY

Small and tractable nbr(yj) ) efficient learning

Use domain knowledge to create neighborhoods which preserve the structure of the problem

Page 16

{0,1}n

Ynbr(y)

n outputs in y

Neighborhoods via Decompositions

Generate nbr(yj) by varying a subset of the output variables, while fixing the rest of them to their gold labels in yj …

… and repeat the same for different subsets of the output variables

A decomposition is a collection of different (non-inclusive, possibly overlapping) sets of variables which vary together

Sj = {s1,…,sl| 8 i, si µ {1,…,n}; 8 i, k, si * sk}

Inference could be exponential in the size of sets Smaller set sizes yield efficient learning Under some conditions, DecL with smaller set sizes is identical to Global Learning

Page 17

Creating Decompositions

Allow different decompositions Sj for different training instances yj

Aim to get results close to doing exact inference: we need decompositions which yield exactness (next few slides)

Example: Learning with Decompositions in which all subsets of size k are considered: DecL-k DecL-1 same as Pseudomax (Sontag et al, 2010) which is similar to

Pseudolikelihood (Besag, 77) learning

In practice, decompositions should be based on domain knowledge – put highly coupled variables in the same set

Page 19


DecL: Efficient decomposed structural learning Intuition Formalization

Theoretical results: exactness


Outline

Page 20

Theoretical Results: Assume Separability

Ideally we want Decomposed Learning with decompositions having small sets to give the same results as Global Learning

For analyzing the equivalence between DecL and GL, we assume that the training data is separable

Separability: existence of a set of weights W* that satisfyW*: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8 y 2 Y}

Separating weights for DecLWdecl: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8 y 2 nbr(yj)}

Naturally: W* µ Wdecl

Score of ground truth yj Score of non

ground-truth y

Loss-based margin

Page 21

Theoretical Results: Exactness

The property we desire is Exactness: Wdecl = W*

as a property of constraints, ground truth yj, and the globally separating weight W*

Different from asymptotic consistency results of Pseudolikelihood/Pseudomax!

Exactness much more useful – learning with DecL yields the same weights as GL

Main theorem in the paper: providing general exactness condition

Page 22

One Example of Exactness: Pairwise Markov Networks

Scoring function define over a graph with edges E

Assume domain knowledge on W*: we know that for correct (separating) w, if Ái,k (.;w) is:

Submodular: Ái,k(0,0)+ Ái,k(1,1) > Ái,k(0,1) + Ái,k(1,0)

OR Supermodular: Ái,k(0,0)+ Ái,k(1,1) < Ái,k(0,1) + Ái,k(1,0)

y1

y3

y6y5

y2

y4

Page 26

Singleton/Vertex components

Pairwise/Edge components

Decomposition for PMNs

Define

Theorem: Spair decomposition consisting of connected components of Ej yields Exactness

E Ej

sub(Á) sup(Á)

1 0

Page 27


DecL: Efficient decomposed structural learning Intuition Formalization



Outline

Page 28

Experiments

Experimentally compare Decomposed Learning (DecL) to Global Learning (GL), Local Learning (LL) and Local Learning + Constraints (if available, during test-time inference)

(LL+C)

Study the robustness of DecL in conditions where our theoretical assumptions may not hold

Page 29

Synthetic Experiments

Experiments on random synthetic data with 10 binary variables

Labels assigned with random singleton scoring functions and random linear constraints

Local Learning (LL) baselines

Global Learning (GL) and Dec. Learning (DecL)-2,3

DecL-1 aka Pseudomax

Page 30

No. of training examples

Avg.

Ham

min

g Lo

ss

Multi-label Document Classification

Experiments on multi-label document classification Documents with multi-labels corn, crude, earn, grain, interest…

Modeled as a Pairwise Markov Network over a complete graph over all the labels – singleton and pairwise components

LL – local learning baseline that ignores pairwise interactions

Page 31

Results: Per Instance F1 and training time (hours)

Local

Learn

ing (LL)

Gobal Le

arning (

GL)

Dec. Learn

ing-2 (D

ecL-2)

Dec. Learn

ing-3 (D

ecL-3)

757677787980818283

020406080100120140

Time

Page 32

F1 S

core

s

Time taken to train (hours)

Results: Per Instance F1 and training time (hours)

Local

Learn

ing (LL)

Gobal Le

arning (

GL)

Dec. Learn

ing-2 (D

ecL-2)

Dec. Learn

ing-3 (D

ecL-3)

Dec. Learn

ing-1 (D

ecL-1)

5055606570758085

020406080100120140

Time

Page 33

F1 S

core

s

Time taken to train (hours)

Example Problem: Information Extraction

Given citation text, extract author, booktitle, title, etc.Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages….

Given ad text, extract features, size, neighborhood, etc.

Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines …

Constraints like: The ‘title’ tokens are likely to appear together in a single block, A paper should have at most one ‘title’

Page 34

Information Extraction: Modeling

Modeled as HMM with additional constraints The constraints make inference with HMM Hard

Local Learning (LL) in this case is HMM with no constraints

Domain Knolwedge: HMM transition matrix is likely to be diagonal heavy – generalization of submodular pairwise potentials for Pairwise Markov Networks ) use decomposition Spair

Bottomline: DecL is 2 to 8 times faster than GL and gives same accuracies

Page 35

Citation Info. Extraction: Accuracy and Training Time

Page 36

F1 S

core

sTim

e taken to train (hours)

Cit-HMM/LL Cit-LL+C Cit-GL Cit-DecL70

75

80

85

90

95

100

0

5

10

15

20

25

30

35

40

45

Time

Ads. Info. Extraction: Accuracy and Training Time

Page 37

F1 S

core

sTim

e taken to train (hours)

Ads-HMM/LL

Ads-LL+C Ads-GL Ads-DecL70

72

74

76

78

80

82

0

10

20

30

40

50

60

70

80

Time

Take Home: Efficient Structural Learning with DecL

We presented Decomposed Learning (DecL): efficient learning by reducing the inference to a small output space

Exactness: Provided conditions for when DecL is provably identical to global structural learning (GL)

Experiments: DecL performs as good as GL on real-world data, with significant cost reduction (with 50% - 90% reduction in training time)

QUESTIONS?

Page 38

Efficient Decomposed Learning for Structured Prediction

Documents

Transcript of Efficient Decomposed Learning for Structured Prediction