Online Learning Algorithms
description
Transcript of Online Learning Algorithms
![Page 1: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/1.jpg)
1
Online Learning Online Learning AlgorithmsAlgorithms
![Page 2: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/2.jpg)
2
Outline
• Online learning Framework
• Design principles of online learning algorithms (additive
updates) Perceptron, Passive-Aggressive and Confidence weighted
classification
Classification – binary, multi-class and structured prediction
Hypothesis averaging and Regularization
• Multiplicative updates Weighted majority, Winnow, and connections to Gradient
Descent(GD) and Exponentiated Gradient Descent (EGD)
![Page 3: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/3.jpg)
3
Formal setting – Classification
• Instances Images, Sentences
• Labels Parse tree, Names
• Prediction rule Linear prediction rule
• Loss No. of mistakes
![Page 4: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/4.jpg)
4
Predictions
• Continuous predictions :
Label
Confidence
• Linear Classifiers
Prediction :
Confidence:
![Page 5: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/5.jpg)
5
Loss Functions
• Natural Loss: Zero-One loss:
• Real-valued-predictions loss: Hinge loss:
Exponential loss (Boosting)
![Page 6: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/6.jpg)
6
Loss Functions
1
1Zero-One Loss
Hinge Loss
![Page 7: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/7.jpg)
7
Online Framework
• Initialize Classifier• Algorithm works in rounds• On round the online algorithm :
Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule
• Goal : Suffer small cumulative loss
![Page 8: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/8.jpg)
8
• Margin of an example with respect to the classifier :
• Note :
• The set is separable iff there exists u such that
Margin
![Page 9: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/9.jpg)
9
Geometrical Interpretation
Margin >0
Margin <<0
Margin <0Margin >>0
![Page 10: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/10.jpg)
10
Hinge Loss
![Page 11: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/11.jpg)
11
Why Online Learning?
• Fast• Memory efficient - process one example at
a time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive
![Page 12: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/12.jpg)
12
Update Rules• Online algorithms are based on an update rule
which defines from (and possibly other information)
• Linear Classifiers : find from based on the input
• Some Update Rules :
Perceptron (Rosenblat) ALMA (Gentile) ROMMA (Li & Long) NORMA (Kivinen et. al)
MIRA (Crammer & Singer) EG (Littlestown and Warmuth) Bregman Based (Warmuth) CWL (Dredge et. al)
![Page 13: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/13.jpg)
13
Design Principles of Algorithms
• If the learner suffers non-zero loss at any round, then
we want to balance two goals:
Corrective: Change weights enough so that we don’t make
this error again (1)
Conservative: Don’t change the weights too much (2)
How to define too much ?
![Page 14: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/14.jpg)
14
Design Principles of Algorithms
• If we use Euclidean distance to measure the change between old and new
weights
Enforcing (1) and minimizing (2)
e.g., Perceptron for squared loss (Windrow-Hoff or Least Mean Squares)
• Passive-Aggressive algorithms do exactly same
except (1) is much stronger – we want to make a correct classification with
margin of at least 1
• Confidence-Weighted classifiers
maintains a distribution over weight vectors
(1) is same as passive-aggressive with a probabilistic notion of margin
Change is measured by KL divergence between two distributions
![Page 15: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/15.jpg)
15
Design Principles of Algorithms
• If we assume all weights are positive we can use (unnormalized) KL divergence to
measure the change
Multiplicative update or EG algorithm (Kivinen and Warmuth)
![Page 16: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/16.jpg)
16
The Perceptron Algorithm
• If No-Mistake
Do nothing
• If Mistake
Update
• Margin after update:
![Page 17: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/17.jpg)
17
Passive-Aggressive Passive-Aggressive AlgorithmsAlgorithms
![Page 18: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/18.jpg)
18
Passive-Aggressive: Motivation
• Perceptron: No guaranties of margin after the update
• PA: Enforce a minimal non-zero margin after the update
• In particular: If the margin is large enough (1), then do nothing If the margin is less then unit, update such that
the margin after the update is enforced to be unit
![Page 19: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/19.jpg)
19
Aggressive Update Step
• Set to be the solution of the following optimization problem:
• Closed-form update:
(2)
(1)
where,
![Page 20: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/20.jpg)
20
Passive-Aggressive Update
![Page 21: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/21.jpg)
21
Unrealizable Case
![Page 22: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/22.jpg)
22
Confidence Weighted Confidence Weighted ClassificationClassification
![Page 23: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/23.jpg)
23
Confidence-Weighted Classification: Motivation
• Many positive reviews with the word best
Wbest
• Later negative review “boring book – best if you want to sleep in seconds”
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring
• How to adjust weights at different rates?Wboring Wbest
![Page 24: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/24.jpg)
24
• The weight vector is a linear combination of examples
• Two rate schedules (among others): Perceptron algorithm, conservative:
Passive-aggressive
Update Rules
![Page 25: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/25.jpg)
25
Distributions in Version Space
Example
Mean weight-vector
Q u ic k T ime ™ a n d a d e c o mp re s s o r
a re n e e d e d to s e e th is p ic tu re .
![Page 26: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/26.jpg)
26
Margin as a Random Variable
• Signed margin
is a Gaussian-distributed variable
• Thus:
![Page 27: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/27.jpg)
27
PA-like Update
• PA:
• New Update :
![Page 28: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/28.jpg)
28
Place most of the probability mass in this region
Weight Vector (Version) Space
![Page 29: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/29.jpg)
29
Nothing to do, most weight vectors already classify the example correctly
Passive Step
![Page 30: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/30.jpg)
30
Project the current Gaussian distribution onto the half-space
Aggressive Step
The covariance is shirked in the direction of the new example
Mean moved past the mistake line(large margin)
![Page 31: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/31.jpg)
31
Extensions: Extensions: Multi-class and Structured Multi-class and Structured
PredictionPrediction
![Page 32: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/32.jpg)
32
Multiclass Representation I
• k Prototypes• New instance • Compute
• Prediction: the class achieving the highest Score
Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
![Page 33: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/33.jpg)
33
• Map all input and labels into a joint vector space
• Score labels by projecting the corresponding feature vector
Multiclass Representation II
Estimated volume was a light 2.4 million ounces .
F ) =0 1 1 0( … B I O B I I I I O
![Page 34: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/34.jpg)
34
Multiclass Representation II
• Predict label with highest score (Inference)
• Naïve search is expensive if the set of possible labels is large
No. of labelings = 3No. of words
B I O B I I I I O
Estimated volume was a light 2.4 million ounces .
Efficient Viterbi decoding for sequences!
![Page 35: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/35.jpg)
35
Two Representations
• Weight-vector per class (Representation I) Intuitive Improved algorithms
• Single weight-vector (Representation II) Generalizes representation I
Allows complex interactions between input and output
0 0 0 x 0F(x,4)=
![Page 36: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/36.jpg)
36
• Binary:
• Multi Class:
Margin for Multi Class
![Page 37: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/37.jpg)
37
• But different mistakes cost (aka loss function) differently – so use it!
• Margin scaled by loss function:
Margin for Multi Class
![Page 38: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/38.jpg)
38
• Initialize • For
Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule
Perceptron Multiclass online algorithm
![Page 39: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/39.jpg)
39
• Initialize • For
Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule
PA Multiclass online algorithm
![Page 40: Online Learning Algorithms](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814a83550346895db793df/html5/thumbnails/40.jpg)
40
Regularization
• Key Idea: If an online algorithm works well on a
sequence of i.i.d examples, then an ensemble of online hypotheses should generalize well.
• Popular choices: the averaged hypothesis the majority vote use validation set to make a choice