Efficient Decomposed Learning for Structured Prediction #icml2012

18
Efficient Decomposed Learning for Structured Prediction Rajhans Samdani, Dan Roth (Illinois) Presenter:Yoh Okuno

Transcript of Efficient Decomposed Learning for Structured Prediction #icml2012

Page 1: Efficient Decomposed Learning for Structured Prediction #icml2012

Efficient  Decomposed  Learning  for  Structured  

Prediction Rajhans  Samdani,  Dan  Roth  (Illinois)  

Presenter:  Yoh  Okuno

Page 2: Efficient Decomposed Learning for Structured Prediction #icml2012

Abstract

•  Structured  learning  is  important  for  NLP  or  CV  

•  Enormous  output  space  is  often  intractable  

•  Proposed  DecL:  decomposed  learning  

•  DecL  restrict  output  space  to  limited  part  

•  Efficient  and  accurate  in  experiment  and  theory

Page 3: Efficient Decomposed Learning for Structured Prediction #icml2012

Introduction •  What  is  Structured  learning?  

–  Predict  output  variables  which  mutually  depend  

–  Problem:  enormous  output  space  (exponential)  

•  Applications:  NLP,  CV  or  Bioinformatics  

– Multi  label  document  classification  (binary)  [Crammer+02]  

–  Information  extraction  (sequence)  [Lafferty+  01]    

–  Dependency  parsing  (tree)  [Koo+  10]    

Page 4: Efficient Decomposed Learning for Structured Prediction #icml2012

Example:  Conditional  Random  Fields Output  Space

[Lafferty+  01]  

Page 5: Efficient Decomposed Learning for Structured Prediction #icml2012

Example:  Markov  Random  Fields Output  Space

[Boykov+  98]  

Page 6: Efficient Decomposed Learning for Structured Prediction #icml2012

Related  Work

•  There  are  two  major  approaches  

1.  Global  Learning  (GL):  Exact  but  Slow  [Tsochantaridis+  04]    

–  Search  the  entre  output  space  in  learning  phase  

–  Often  implemented  by  ILP  (Integer  Linear  Programming)  

2.  Local  Learning  (LL):  Inaccurate  but  Fast  

–  Ignore  structure  of  output  for  fast  search  

•  DecL  is  exact  in  some  assumption  but  faster  than  LL

Page 7: Efficient Decomposed Learning for Structured Prediction #icml2012

Problem  Setting •  Given  training  data:  

•  Output  y  is  represented  as  binary  variables  

•  Model  is  linear  combination  of  features  

 

y = {y1, ..., yn} ∈ {0, 1}n

D = {(x1,y1), ..., (xm,ym)}

Page 8: Efficient Decomposed Learning for Structured Prediction #icml2012

Structured  SVM •  Minimize  loss  function  below:  

 

 

•  Generalized  hinge-­‐loss  to  multi  dimension  

•  Regularization  term  is  omitted  for  space  issue  

•  See  [Tokunaga  2011]  for  more  information

[Tsochantaridis+  04]  

l(w) =m�

j=1

(maxy∈Y

f(xj ,y;w) +∆(yj ,y))− f(xj ,yj ;w)

Rewarding  incorrect  output

Page 9: Efficient Decomposed Learning for Structured Prediction #icml2012

Figure  1:  GL  and  DecL •  Search  neighborhood  around  gold  output  rather  than  entire  search  space  

Page 10: Efficient Decomposed Learning for Structured Prediction #icml2012

DecL:  Decomposed  Learning

•  Define  neighborhood  around  gold  output:  

•  Note:  prediction  phase  need  global  search  

•  How  can  we  define  neighborhood  for  learning?

l(w) =m�

j=1

( maxy∈nbr(yj)

f(xj ,y;w) +∆(yj ,y))− f(xj ,yj ;w)

Page 11: Efficient Decomposed Learning for Structured Prediction #icml2012

Sub  Gradient  Descent  for  DecL

Page 12: Efficient Decomposed Learning for Structured Prediction #icml2012

DecL-­‐k:  Special  Case  of  DecL •  Restrict  output  space  to  k-­‐dimension  

– Take  all  subsets  of  size  k  from  indices  of  y  

– Other  dimensions  are  equal  to  gold  output  

•  Domain  knowledge  can  be  used  in  general  

– Group  coupled  variables  into  same  groups  

– Complexity  depends  on  size  of  decomposition

Page 13: Efficient Decomposed Learning for Structured Prediction #icml2012

Experiments  on  Synthetic  Data •  Compared  DecL,  LL  and  GL  (Oracle)  

•  Synthetic  training  data:  – 10  binary  output  with  random  linear  constraints  

– 20-­‐dimensional  input,  320  training  examples  

•  Running  time  in  seconds:  

 

Page 14: Efficient Decomposed Learning for Structured Prediction #icml2012
Page 15: Efficient Decomposed Learning for Structured Prediction #icml2012

Multi  Label  Document  Classification

•  Dataset:  Reuter  corpus  

•  Size:  6,000  documents  and  30  labels  

•  DecL  performs  well  as  GL  and  6x  faster  

Page 16: Efficient Decomposed Learning for Structured Prediction #icml2012

Information  Extraction:  Sequence  Tagging •  Data  1:  citation  recognition  

– Recognize  author,  title..  from  citation  text  

•  Data  2:  advertisement  for  real  estate  

– Recognize  facility,  roommates..  from  ads  

Page 17: Efficient Decomposed Learning for Structured Prediction #icml2012

Conclusion •  Structured  learning  has  a  tradeoff  between  speed  and  

accuracy  

•  Decomposition  learning  (DecL)  splits  output  space  

into  small  space  for  fast  inference  

•  Fast  and  accurate  in  real  world  dataset  

•  Theoretical  guarantee  for  exact  search  under  some  

assumptions  (skipped)  

Page 18: Efficient Decomposed Learning for Structured Prediction #icml2012

Reference •  [Collins+  02]  Discriminative  training  methods  for  hidden  Markov  models:  Theory  

and  experiments  with  perceptron  algorithms.  

•  [Lafferty+  01]  Conditional  random  fields:  Probabilistic  models  for  segmenting  and  

labeling  sequence  data.  

•  [Koo+  10]  Dual  decomposition  for  parsing  with  nonprojective  head  automata.  

•  [Boykov+  98]  Markov  Random  Fields  with  Efficient  Approximations.  

•  [Tsochantaridis+  04]  Support  vector  machine  learning  for  interdependent  and  

structured  output  spaces.  

•  [Crammer+  02]  On  the  algorithmic  implementation  of  multiclass  kernel-­‐based  

vector  machines.