1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague.
-
Upload
tomas-chisley -
Category
Documents
-
view
219 -
download
4
Transcript of 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague.
1
Second Order Learning
Koby CrammerDepartment of Electrical Engineering
ECML PKDD 2013 Prague
Thanks
• Mark Dredze• Alex Kulesza• Avihai Mejer• Edward Moroshko• Francesco Orabona• Fernando Pereira• Yoram Singer• Nina Vaitz
2
3
Tutorial Context
OnlineLearning
Tutorial
OptimizationTheory
Real-WorldData
SVMs
4
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
5
Online Learning
Tyrannosaurus rex
6
Online Learning
Triceratops
7
Online Learning
Tyrannosaurus rex
Velocireptor
8
Formal Setting – Binary Classification
• Instances – Images, Sentences
• Labels– Parse tree, Names
• Prediction rule– Linear predictions rules
• Loss– No. of mistakes
9
Predictions
• Discrete Predictions:– Hard to optimize
• Continuous predictions :
– Label
– Confidence
10
Loss Functions
• Natural Loss:– Zero-One loss:
• Real-valued-predictions loss:– Hinge loss:
– Exponential loss (Boosting)– Log loss (Max Entropy, Boosting)
11
Loss Functions
1
1Zero-One Loss
Hinge Loss
Online Learning
Maintain Model M Get Instance x
Predict Label y=M(x)
Get True Label ySuffer Loss l(y,y)
Update Model M
14
• Any Features
• W.l.o.g.
• Binary Classifiers of the form
Linear Classifiers
Notation
Abuse
15
• Prediction :
• Confidence in prediction:
Linear Classifiers (cntd.)
16
Linear Classifiers
Input Instance to be classified
Weight vector of classifier
17
• Margin of an example with respect to the classifier :
• Note :
• The set is separable iff there exists such that
Margin
18
Geometrical Interpretation
19
Geometrical Interpretation
20
Geometrical Interpretation
21
Geometrical Interpretation
Margin >0
Margin <<0
Margin <0Margin >>0
22
Hinge Loss
23
Why Online Learning?
• Fast• Memory efficient - process one example at a
time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive
• Not as good as a well designed batch algorithms
24
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
25
The Perceptron Algorithm
• If No-Mistake
– Do nothing
• If Mistake
– Update
• Margin after update :
Rosenblat 1958
26
Geometrical Interpretation
27
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for – Compute
– Set
28
Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute
– Set
30
31
Stochastic Gradient Descent
• “Hinge” loss
• The gradient
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – If then
else– Set 32
The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples
33
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
34
Motivation
• Perceptron: No guaranties of margin after the update
• PA :Enforce a minimal non-zero margin after the update
• In particular :– If the margin is large enough (1), then do nothing– If the margin is less then unit, update such that the
margin after the update is enforced to be unit
35
Input Space
36
Input Space vs. Version Space
• Input Space :– Points are input data– One constraint is induced
by weight vector– Primal space– Half space = all input
examples that are classified correctly by a given predictor (weight vector)
• Version Space :– Points are weight vectors– One constraints is induced
by input data– Dual space– Half space = all predictors
(weight vectors) that classify correctly a given input example
37
Weight Vector (Version) Space
The algorithm forces to reside in this region
38
Passive Step
Nothing to do. already resides on the desired side.
39
Aggressive Step
The algorithm projects on the desired half-space
40
Aggressive Update Step
• Set to be the solution of the following optimization problem :
• Solution:
41
Perceptron vs. PA
• Common Update :
• Perceptron
• Passive-Aggressive
42
Perceptron vs. PA
Margin
Error
No-E
rror, Sm
all Margin
No-Error, Large Margin
43
Perceptron vs. PA
44
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
45
Geometrical Assumption
• All examples are bounded in a ball of radius R
46
Separablity
• There exists a unit vector that classifies the data correctly
• Simple case: positive points
negative points
• Separating hyperplane
• Bound is :
Perceptron’s Mistake Bound
• The number of mistakes the algorithm makes is bounded by
47
48
Geometrical Motivation
SGD on such data
49
50
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
Second Order Perceptron
• Assume all inputs are given• Compute “whitening” matrix
• Run the Perceptron on “wightened” data
• New “whitening” matrix
51
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
Second Order Perceptron
• Bound:
• Same simple case:
• Thus
• Bound is :
52
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
53
Second Order Perceptron
• If No-Mistake
– Do nothing
• If Mistake
– Update
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
SGD on weightened data
54
55
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
56
• The weight vector is a linear combination of examples
• Two rate schedules (many many others):– Perceptron algorithm, Conservative
– Passive - Aggressive
Span-based Update Rules
Feature-value of input instance
Target labelEither -1 or 1
Learning rateLearning rateWeight of feature f
57
Sentiment Classification
• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!
Pang, Lee, Vaithyanathan, EMNLP 2002
58
Sentiment Classification
• Many positive reviews with the word best
Wbest
• Later negative review – “boring book – best if you want to sleep in seconds”
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring• The model know’s more about best than boring• Better to reduce words in different rate
Wboring Wbest
59
Natural Language Processing
• Big datasets, large number of features
• Many features are only weakly correlated with target label
• Linear classifiers: features are associated with word-counts
• Heavy-tailed feature distribution
Feature Rank
Cou
nts
Natural Language Processing
61
New Prediction Models
• Gaussian distributions over weight vectors
• The covariance is either full or diagonal
• In NLP we have many features and use a diagonal covariance
62
Classification
• Given a new example • Stochastic:
– Draw a weight vector– Make a prediction
• Collective:– Average weight vector– Average margin– Average prediction
63
The Margin is Random Variable
• The signed margin
is random 1-d Gaussian
• Thus:
64
Linear Model Distribution over Linear Models
Example
Mean weight-vector
65
The algorithm forces that most of the values of would reside in this region
Weight Vector (Version) Space
66
Nothing to do, most of the weight vectors already classifies the example correctly
Passive Step
67
The mean is moved beyond the mistake-line(Large Margin)
Aggressive Step
The covariance is shrunk in the direction of the input example
The algorithm projects the current Gaussian distribution on the half-space
68
Projection Update
• Vectors (aka PA):
• Distributions (New Update) :
Confidence Parameter
69
• Sum of two divergences of parameters :
• Convex in both arguments simultaneously
Divergence
Matrix Itakura-Saito Divergence
Mahanabolis Distance
70
Constraint
• Probabilistic Constraint :
• Equivalent Margin Constraint :
• Convex in , concave in • Solutions:
– Linear approximation– Change variables to
get a convex formulation– Relax (AROW)
Dredze, Crammer, Pereira. ICML 2008
Crammer, Dredze, Pereira. NIPS 2008
Crammer, Dredze, Kulesza. NIPS 2009
71
Convexity
• Change variables• Equivalent convex formulation
Crammer, Dredze, Pereira. NIPS 2008
72
AROW
• PA:
• CW :
• Similar update form as CW
Crammer, Dredze, Kulesza. NIPS 2009
73
• Optimization update can be solved analytically
• Coefficients depend on specific algorithm
The Update
Definitions
74
Updates
CW (Linearization)CW (Change Variables)
AROW
75
76
Per-feature Learning Rate
Per-feature Learning rate
Reducing the Learning rate and eigenvalues of
covariance matrix
77
Diagonal Matrix• Given a matrix we define to
be only the diagonal part of the matrix,
• Make matrix diagonal
• Make inverse diagonal
78
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
(Back to)Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute
– Set
79
Adaptive Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute
– Set
– Set 80
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
Adaptive Stochastic Gradient Descent
• Very general! Can be used to solve with various regularizations
• The matrix A can be either full or diagonal
• Comes with convergence and regret bounds
• Similar performance to AROW
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
Adaptive Stochastic Gradient Descent
SGD AdaGrad
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
86
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
87
Kernels
Proof
• Show that we can write
• Induction
88
Proof (cntd)
• By update rule :
• Thus
89
Proof (cntd)
• By update rule :
90
Proof (cntd)
• Thus
91
92
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
94
Statistical Interpretation
• Margin Constraint :
• Distribution over weight-vectors :
• Assume input is corrupted with Gaussian noise
95
Statistical Interpretation
Example
Mean weight-vector
Version Space Input Space
Input Instance
Linear Separator
Good realization
Bad realization
96
Mistake Bound• For any reference weight vector , the
number of mistakes made by AROW is upper bounded by
where
– set of example indices with a mistake– set of example indices with an update
but not a mistake–
Orabona and Crammer, NIPS 2010
97
Comment I
• Separable case and no updates:
where
98
Comment II
• For large the bound becomes:
• When no updates are performed: Perceptron
Bound for Diagonal Algorithm
• No. of mistakes is bounded by
• Is low when either
a feature is rare or non-informative• Exactly as in NLP …
Orabona and Crammer, NIPS 2010
100
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
101
Synthetic Data
• 20 features• 2 informative (rotated
skewed Gaussian)• 18 noisy• Using a single feature is
as good as a random prediction
102
Synthetic Data (cntd.)
Distribution after 50 examples (x1)
103
Synthetic Data (no noise)
Perceptron
PA
SOP
CW-full
CW-diag
104
Synthetic Data (10% noise)
105
Outline
• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive
• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad
• Properties– Kernels– Analysis
• Empirical Evaluation– Synthetic– Real Data
106
Data
• Sentiment– Sentiment reviews from 6 Amazon domains (Blitzer et al)– Classify a product review as either positive or negative
• Reuters, pairs of labels– Three divisions:
• Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail.
– Bag of words representation with binary features.
• 20 News Groups, pairs of labels– Three divisions:
• comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances.
– Bag of words representation with binary features.
107
Experimental Design
• Online to batch :– Multiple passes over the training data– Evaluate on a different test set after each pass– Compute error/accuracy
• Set parameter using held-out data• 10 Fold Cross-Validation• ~2000 instances per problem• Balanced class-labels
108
Results vs Online- Sentiment
• StdDev and Variance – always better than baseline • Variance – 5/6 significantly better
109
Results vs Online – 20NG + Reuters
• StdDev and Variance – always better than baseline • Variance – 4/6 significantly better
110
Results vs Batch - Sentiment
• always better than batch methods • 3/6 significantly better
111
Results vs Batch - 20NG + Reuters
• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse
112
113
114
115
Results - Sentiment
• CW is better (5/6 cases), statistically significant (4/6)
• CW benefit less from many passes
Passes of Training Data
Acc
urac
y
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
116
Results – Reuters + 20NG
• CW is better (5/6 cases), statistically significant (4/6)
• CW benefit less from many passes
Passes of Training Data
Acc
urac
y
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
O PAO CW
117
Error Reduction by Multiple Passes
• PA benefits more from multiple passes (8/12)
• Amount of benefit is data dependent
Bayesian Logistic Regression
BLR
• Covariance
• Mean
CW/AROW
• Covariance
• Mean
118
T. Jaakkola and M. Jordan. 1997
Based on the Variational
Approximation
Conceptually decoupled update
Function of the margin/hinge-loss
Algorithms Summary
1st Order2nd Order
PerceptronSOP
PACW+AROW
SGDAdaGrad
Logisitic Regression
LR
• Different motivation, similar algorithms
• All algorithms can be kernelized
• Work well for data NOT isotropic / symmetric
• State-of-the-art results in various domains
• Accompanied with theory
119