Efficient Decomposed Learning for Structured Prediction #icml2012

Efficient Decomposed Learning for Structured

Prediction Rajhans Samdani, Dan Roth (Illinois)

Presenter: Yoh Okuno

Abstract

•  Structured learning is important for NLP or CV

•  Enormous output space is often intractable

•  Proposed DecL: decomposed learning

•  DecL restrict output space to limited part

•  Efficient and accurate in experiment and theory

Introduction •  What is Structured learning?

–  Predict output variables which mutually depend

–  Problem: enormous output space (exponential)

•  Applications: NLP, CV or Bioinformatics

– Multi label document classification (binary) [Crammer+02]

–  Information extraction (sequence) [Lafferty+ 01]

–  Dependency parsing (tree) [Koo+ 10]

Example: Conditional Random Fields Output Space

[Lafferty+ 01]

Example: Markov Random Fields Output Space

[Boykov+ 98]

Related Work

•  There are two major approaches

1.  Global Learning (GL): Exact but Slow [Tsochantaridis+ 04]

–  Search the entre output space in learning phase

–  Often implemented by ILP (Integer Linear Programming)

2.  Local Learning (LL): Inaccurate but Fast

–  Ignore structure of output for fast search

•  DecL is exact in some assumption but faster than LL

Problem Setting •  Given training data:

•  Output y is represented as binary variables

•  Model is linear combination of features

y = {y1, ..., yn} ∈ {0, 1}n

D = {(x1,y1), ..., (xm,ym)}

Structured SVM •  Minimize loss function below:

•  Generalized hinge-‐loss to multi dimension

•  Regularization term is omitted for space issue

•  See [Tokunaga 2011] for more information

[Tsochantaridis+ 04]

l(w) =m�

j=1

(maxy∈Y

f(xj ,y;w) +∆(yj ,y))− f(xj ,yj ;w)

Rewarding incorrect output

Figure 1: GL and DecL •  Search neighborhood around gold output rather than entire search space

DecL: Decomposed Learning

•  Define neighborhood around gold output:

•  Note: prediction phase need global search

•  How can we define neighborhood for learning?

l(w) =m�

j=1

( maxy∈nbr(yj)

f(xj ,y;w) +∆(yj ,y))− f(xj ,yj ;w)

Sub Gradient Descent for DecL

DecL-‐k: Special Case of DecL •  Restrict output space to k-‐dimension

– Take all subsets of size k from indices of y

– Other dimensions are equal to gold output

•  Domain knowledge can be used in general

– Group coupled variables into same groups

– Complexity depends on size of decomposition

Experiments on Synthetic Data •  Compared DecL, LL and GL (Oracle)

•  Synthetic training data: – 10 binary output with random linear constraints

– 20-‐dimensional input, 320 training examples

•  Running time in seconds:

Multi Label Document Classification

•  Dataset: Reuter corpus

•  Size: 6,000 documents and 30 labels

•  DecL performs well as GL and 6x faster

Information Extraction: Sequence Tagging •  Data 1: citation recognition

– Recognize author, title.. from citation text

•  Data 2: advertisement for real estate

– Recognize facility, roommates.. from ads

Conclusion •  Structured learning has a tradeoff between speed and

accuracy

•  Decomposition learning (DecL) splits output space

into small space for fast inference

•  Fast and accurate in real world dataset

•  Theoretical guarantee for exact search under some

assumptions (skipped)

Reference •  [Collins+ 02] Discriminative training methods for hidden Markov models: Theory

and experiments with perceptron algorithms.

•  [Lafferty+ 01] Conditional random fields: Probabilistic models for segmenting and

labeling sequence data.

•  [Koo+ 10] Dual decomposition for parsing with nonprojective head automata.

•  [Boykov+ 98] Markov Random Fields with Efficient Approximations.

•  [Tsochantaridis+ 04] Support vector machine learning for interdependent and

structured output spaces.

•  [Crammer+ 02] On the algorithmic implementation of multiclass kernel-‐based

vector machines.

Efficient Decomposed Learning for Structured Prediction #icml2012

Technology

Transcript of Efficient Decomposed Learning for Structured Prediction #icml2012