EfÞcient Bandit Algorithms for Online Multiclass Predictionshai/talks/Banditron.pdf · EfÞcient...
Transcript of EfÞcient Bandit Algorithms for Online Multiclass Predictionshai/talks/Banditron.pdf · EfÞcient...
Sham M. KakadeShai Shalev-Shwartz Ambuj Tewari
Efficient Bandit Algorithms for Online Multiclass Prediction
ICML, July 2008
Motivation
Online web advertisement systems
User submits a query
System (the learner) places an ad
User either “clicks” or ignores
Goal: Maximize number of “clicks”
Modeling ?
Not the common online learning setting -- If user ignores, we donʼt get the “correct” ad
Not the common multi-armed bandit -- We are also provided with a query
Outline
Online Bandit Multi-class Categorization
Background: The Multi-class Perceptron
The Banditron
Analysis
Experiments
The Separable Case
Extensions and Open Problems
Online Bandit Multiclass Categorization
For t = 1, 2, . . . , T
Receive x ! Rd
Predict yt ! {1, . . . , k}
Pay 1[yt "= yt]
yt is not revealed
(query)
(ad)
(click feedback)
Linear Hypotheses
A hypothesis is a mapping h : Rd ! {1, . . . , k}
Linear hypothesis: Exists k " d matrix W s.t.
h(x) = argmaxr
(W x)r
The Multiclass Perceptron
U t =
!
"""""""""""""""""#
0 . . . 0...
0 . . . 0. . . xt . . .0 . . . 0
...0 . . . 0
. . . !xt . . .0 . . . 0
...0 . . . 0
$
%%%%%%%%%%%%%%%%%&
row yt
row yt
For t = 1, 2, . . . , T
Receive xt ! Rd
Predict yt = argmaxr
(W t xt)r
Receive yt
Update: W t+1 = W t + U t where
Perceptron in the Bandit Setting
row yt
row yt
U t =
!
"""""""""""""""""#
0 . . . 0...
0 . . . 0. . . xt . . .0 . . . 0
...0 . . . 0
. . . !xt . . .0 . . . 0
...0 . . . 0
$
%%%%%%%%%%%%%%%%%&
Problem: We’re blind to value of yt
Solution: Randomization can help !
Exploration
Explore: instead of predicting yt guess some yt
Suppose we get the feedback ’correct’, i.e. yt = yt
Then, we know that
yt != yt
yt = yt
So, we can update W using the matrix U t
Exploration vs. Exploitation
But, if our current model is correct, i.e. yt = yt
And, we guess some other yt
Then, we both su!er loss and do not know how toupdate W
In this case, it’s better to Exploit the quality ofcurrent model
We control the exploration-exploitation tradeo!using randomization
For t = 1, 2, . . . , T
Receive xt ! Rd
Set yt = argmaxr
(W t xt)r
Define: P (r) = (1" !)1[r = yt] + !k
Randomly sample yt according to P
Predict yt and receive feedback 1[yt = yt]
Update: W t+1 = W t + U t
The Banditron
For t = 1, 2, . . . , T
Receive xt ! Rd
Set yt = argmaxr
(W t xt)r
Define: P (r) = (1" !)1[r = yt] + !k
Randomly sample yt according to P
Predict yt and receive feedback 1[yt = yt]
Update: W t+1 = W t + U t
The Banditron!
""""""""""""""""""""#
0 . . . 0...
0 . . . 0. . . 1[yt=yt]
P (yt)xt . . .
0 . . . 0...
0 . . . 0. . . !xt . . .0 . . . 0
...0 . . . 0
$
%%%%%%%%%%%%%%%%%%%%&
row yt
row yt
The Banditron Expected Update
E[U t] =!
r
P (r)
"
####################$
0 . . . 0...
0 . . . 0. . . 1[yt=r]
P (yt)xt . . .
0 . . . 0...
0 . . . 0. . . !xt . . .0 . . . 0
...0 . . . 0
%
&&&&&&&&&&&&&&&&&&&&'
= U t
Analysis: The Hinge-Loss
5 !! 1
(W xt)yt (W xt)r
! 1[yt "= yt]!t(W ) = maxr !=yt
1! (W xt)yt + (W xt)r
Analysis: The Hinge-Loss
5 !! 1
(W xt)yt (W xt)r
! 1[yt "= yt]
The Separable Case:
!t(W ) = maxr !=yt
1! (W xt)yt + (W xt)r
Mistake Bounds
M ! L + D +"
L D
E[M ] ! L + ! T + 3 max!
k D! ,
"D ! T
#+
$k D L
!
Perceptron:
Banditron:
Symbol MeaningM # mistakesL competitor loss
!t !t(W !)
D competitor margin !W !!2F
k # classesT # rounds" Exploration-Exploitation parameter
Mistake Bounds (cont.)
Symbol MeaningL competitor loss
!t !t(W !)
D competitor margin !W !!2F
k # classesT # rounds
Perceptron BanditronNo noise:L = 0 D
!k D T
Low noise:L = O(
!k D T )
!k D T
!k D T
Noisy: L + T 1/2 L + T 2/3
Experiments
Reuters RCV1
~700k documents
Bag-of-words (d ~ 350k)
4 labels {CCAT, ECAT, GCAT, MCAT}
Synthetic separable data set
9 classes, d=400, million instances
A simple simulation of generating text documents
Synthetic non-separable data set
separable + 5% label noise
Experimental Results -- Reuters
102 103 104 105 10610!2
10!1
100! = 0.050
number of examples
erro
r rat
e
PerceptronBanditron
Similar Slope
Experimental Results -- Separable Data
102 103 104 105 10610!5
10!4
10!3
10!2
10!1
100! = 0.014
number of examples
erro
r rat
e
PerceptronBanditron
Slope -1
Slope -0.55
Experimental Results -- 5% label noise
102 103 104 105 106
10!0.9
10!0.7
10!0.5
10!0.3
! = 0.006
number of examples
erro
r rat
e
PerceptronBanditron
13%
10%
Exploration-Exploitation Parameter
10!3 10!2 10!1 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
!
erro
r rat
e
PerceptronBanditron
10!3 10!2 10!1 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
!
erro
r rat
e
PerceptronBanditron
Reuters5% label noise
The Separable Case
Halving
Discretized hypothesis space
Predict by majority vote
Remove ’wrong’ hypotheses
Note: can be applied in Banditsetting
Mistake Bound O(k2 d log(D d))
Using JL lemma we can also ob-tain O(k2 D log(T+k
! ) log(D))
The Separable Case
Halving
Discretized hypothesis space
Predict by majority vote
Remove ’wrong’ hypotheses
Note: can be applied in Banditsetting
Mistake Bound O(k2 d log(D d))
Using JL lemma we can also ob-tain O(k2 D log(T+k
! ) log(D))
The Separable Case
Halving
Discretized hypothesis space
Predict by majority vote
Remove ’wrong’ hypotheses
Note: can be applied in Banditsetting
Mistake Bound O(k2 d log(D d))
Using JL lemma we can also ob-tain O(k2 D log(T+k
! ) log(D))
Extensions and Open Problems
Label Ranking
Predicting a “label ranking”
How to interpret feedback ?
Multiplicative and Margin-based updates
Bandit versions of “Winnow” and “Passive-Aggressive”
Deterministic vs. Randomized strategies
Achievable rates ?
Efficient algorithms for the separable case ?