Bandit Learning with switching costs

45
Bandit Learning with switching costs Jian Ding, University of Chicago joint with: Ofer Dekel (MSR), Tomer Koren (Technion) and Yuval Peres (MSR) June 2016, Harvard University

Transcript of Bandit Learning with switching costs

Page 1: Bandit Learning with switching costs

Bandit Learning withswitching costs

Jian Ding, University of Chicago

joint with: Ofer Dekel (MSR), Tomer Koren (Technion) and

Yuval Peres (MSR)

June 2016, Harvard University

Page 2: Bandit Learning with switching costs

Online Learning with k -Actions

player(a.k.a. learner)

k actions(a.k.a. arms, experts)

adversary(a.k.a. environment)

0.9

0.2

0.6

0.2

0.2

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Page 3: Bandit Learning with switching costs

Online Learning with k -Actions

player(a.k.a. learner)

k actions(a.k.a. arms, experts)

adversary(a.k.a. environment)

0.9

0.2

0.6

0.2

0.2

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Page 4: Bandit Learning with switching costs

Online Learning with k -Actions

player(a.k.a. learner)

k actions(a.k.a. arms, experts)

adversary(a.k.a. environment)

0.9

0.2

0.6

0.2

0.2

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Page 5: Bandit Learning with switching costs

Round 1

player(a.k.a. learner)

k actions(a.k.a. arms, experts)

adversary(a.k.a. environment)

0.9

0.2

0.6

0.2

0.2

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Page 6: Bandit Learning with switching costs

Round 1

player(a.k.a. learner)

k actions(a.k.a. arms, experts)

adversary(a.k.a. environment)

0.9

0.2

0.6

0.2

0.2

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 2 / 30

Page 7: Bandit Learning with switching costs

Finite-Action Online Learning

hard

easy unlearnable

lessfeedback

more powerful adversary

I Goal: a complete characterization of “learninghardness”.

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 3 / 30

Page 8: Bandit Learning with switching costs

Round t

randomizedplayer

adversary

0.2 0.1 0.3 0.8 0.5

0.1

0.7

0.2

0.1

0.1

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Page 9: Bandit Learning with switching costs

Round t

randomizedplayer

adversary

0.2 0.1 0.3 0.8 0.5

0.1

0.7

0.2

Two Types of Adversaries

An adaptive adversary takes the player’s past actionsinto account when setting loss values.

An oblivious adversary ignores the player’s past ac-tions when setting loss values.

0.1

0.1

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Page 10: Bandit Learning with switching costs

Round t

randomizedplayer

adversary

0.2 0.1 0.3 0.8 0.5

0.1

0.7

0.2

0.1

0.1

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Page 11: Bandit Learning with switching costs

Round t

randomizedplayer

adversary

0.2 0.1 0.3 0.8 0.5

0.1

0.7

0.2

0.1

0.1

Two Feedback Models

In the bandit feedback model, the player only sees theloss associated with his action (one number).

In the full feedback model, the player also sees thelosses associated with the other actions (k numbers).

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 4 / 30

Page 12: Bandit Learning with switching costs

Examples

bandit feedback

Display one of k newsarticles to maximizeuser clicks.

full feedback

Invest in one stock oneach day.

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 5 / 30

Page 13: Bandit Learning with switching costs

More Formally

Setting A T -round repeated game between a ran-domized player and a deterministic adaptive adversary

Notation: player’s actions: X = 1, . . . , k

Before the game: adversary chooses sequence of lossfunctions f1, . . . , fT , where ft : X t 7→ [0,1]

The game: for t = 1, . . . ,T

I player chooses distribution µt over X and draws Xt ∼ µt

I player suffers and observes loss ft(X1, . . . ,Xt)

I if full feedback , player observes ∀x ft(X1, . . . ,Xt−1, x)

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 6 / 30

Page 14: Bandit Learning with switching costs

Adaptive vs. ObliviousI Adaptive : ft : X t 7→ [0,1] can be any function

I Oblivious : adversary chooses `1, . . . , `T , where`t : X 7→ [0,1], and sets ft(x1, . . . , xt) = `t(xt).

adap

tiveoblivious

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 7 / 30

Page 15: Bandit Learning with switching costs

Loss, Regret

Definition Player’s expected cumulative loss is

E[∑T

t=1 ft(X1, . . . ,Xt)].

Definition Player’s regret w.r.t. the best action is

R(T ) = E[∑T

t=1 ft(X1, . . . ,Xt)]− min

x∈X

∑Tt=1 ft(x , . . . , x) .

Interpretation

R(T ) = o(T ) ⇒ the player gets better with time.

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 8 / 30

Page 16: Bandit Learning with switching costs

Loss, Regret

Definition Player’s expected cumulative loss is

E[∑T

t=1 ft(X1, . . . ,Xt)].

Definition Player’s regret w.r.t. the best action is

R(T ) = E[∑T

t=1 ft(X1, . . . ,Xt)]− min

x∈X

∑Tt=1 ft(x , . . . , x) .

Interpretation

R(T ) = o(T ) ⇒ the player gets better with time.

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 8 / 30

Page 17: Bandit Learning with switching costs

Minimax Regret

I Regret measures a specific player’s performanceI We want to measure the inherent difficulty of the

problem

Definition The minimax regret R?(T ), is the inf overrandomized player strategies of the sup over adversaryloss sequences of the resulting expected regret.

I R?(T ) = θ(√

T ) ⇒ problem is easyI R?(T ) = θ(T ) ⇒ problem is unlearnable

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 9 / 30

Page 18: Bandit Learning with switching costs

Full + Oblivious

I a.k.a. “Predicting with Expert Advice”I Littlestone & Warmuth (1994), Freund & Schapire

(1997)

The Multiplicative Weights Algorithm

Sample Xt from µt where

µt(i) ∝ exp

−γ t−1∑j=1

`j(i)

.

Theorem γ = 1/√

T yields R(T ) = O(√

T log(k)).

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 10 / 30

Page 19: Bandit Learning with switching costs

Bandit + Oblivious

I a.k.a. “The Adversarial Multiarmed Bandit Problem”I Auer, Cesa-Bianchi, Freund, Schapire (2002)

The EXP3 Algorithm

Run the weighted majority algorithm with estimates ofthe full feedback vectors

ˆt(i) =

`t (i)µt (i)

if i = Xt

0 otherwise.

Theorem E[ˆt(i)] = `t(i) ⇒ R(T ) = O(√

Tk).

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 11 / 30

Page 20: Bandit Learning with switching costs

Adaptive

Obstacle (Arora, Dekel, Tewari 2012) R?(T ) = θ(T )in any feedback model.

Proof w.l.o.g. assume µ1(1) > 0. Define

ft(x1, . . . , xt) =

1 if x1 = 10 otherwise

.

This loss guarantees R?(T ) = µ1(1) · T .

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 12 / 30

Page 21: Bandit Learning with switching costs

The Characterization (so far)

oblivious

adaptive

bandit

full

lessfeedback

more powerful adversary

θ(√

T )

easy

θ(T )

unlearnable

I BoringI Feedback models seem to be equivalent (when k = 2,

say).

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 13 / 30

Page 22: Bandit Learning with switching costs

Adding a Switching CostI The switching cost adversary chooses `1, . . . , `T ,

where `t : X 7→ [0,1], and setsft(x1, . . . , xt) =

12(`t(xt) + 11xt 6=xt−1

).

I The “Follow the Lazy Leader” algorithm (Kalai-Vempala05) guarantees R(T ) = O(

√T ) (full information); also

“Shrinking the dartboard” (Geulen-Vocking-Winkler 10)

adap

tive

switchin

goblivious

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 14 / 30

Page 23: Bandit Learning with switching costs

m-Memory Adversary, CounterfactualFeedback

I The m-memory adversary defines loss functions thatdepend only on the m + 1 recent actions.

ft(

x1, . . . , xt︸ ︷︷ ︸t

)= ft

(xt−m, . . . , xt︸ ︷︷ ︸

m+1

).

A Third Feedback ModelIn the counterfactual feedback model, the player re-ceives the entire loss function ft .

I Merhav et al. (2002) proved R(T ) = O(T 2/3).I Gyorgy & Neu (2011) improved this to R(T ) = O(

√T ).

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 15 / 30

Page 24: Bandit Learning with switching costs

Adversaries and Feedbacks

adap

tive

m-mem

ory

switchin

goblivious

counterfa

ctua

l

full

bandit

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 16 / 30

Page 25: Bandit Learning with switching costs

The Characterization (so far)

full

oblivious

bandit

switch

ing

counter-

factual

m-memory

lessfeedback

more powerful adversary

θ(√

T )easy

θ(T )

unlearnable

Arora, Dekel, Tewari 2012

Ω(√

T ), O(T 2/3)easy? hard?

θ(T 2/3)hard

Cesa-Bianchi, Dekel, Shamir (2013), (Unbounded Losses)Dekel, D., Koren, Peres (2013)

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

Page 26: Bandit Learning with switching costs

The Characterization (so far)

full

oblivious

bandit

switch

ing

counter-

factual

m-memory

lessfeedback

more powerful adversary

θ(√

T )easy

θ(T )

unlearnable

Arora, Dekel, Tewari 2012

Ω(√

T ), O(T 2/3)easy? hard?

θ(T 2/3)hard

Cesa-Bianchi, Dekel, Shamir (2013), (Unbounded Losses)Dekel, D., Koren, Peres (2013)

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

Page 27: Bandit Learning with switching costs

The Characterization (so far)

full

oblivious

bandit

switch

ing

counter-

factual

m-memory

lessfeedback

more powerful adversary

θ(√

T )easy

θ(T )

unlearnable

Arora, Dekel, Tewari 2012

Ω(√

T ), O(T 2/3)easy? hard?

θ(T 2/3)hard

Cesa-Bianchi, Dekel, Shamir (2013), (Unbounded Losses)Dekel, D., Koren, Peres (2013)

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

Page 28: Bandit Learning with switching costs

The Characterization (so far)

full

oblivious

bandit

switch

ing

counter-

factual

m-memory

lessfeedback

more powerful adversary

θ(√

T )easy

θ(T )

unlearnable

Arora, Dekel, Tewari 2012

Ω(√

T ), O(T 2/3)easy? hard?

θ(T 2/3)hard

Cesa-Bianchi, Dekel, Shamir (2013), (Unbounded Losses)Dekel, D., Koren, Peres (2013)

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 17 / 30

Page 29: Bandit Learning with switching costs

Bandit + Switching: Upper Bound

AlgorithmI split rounds into T/B blocks of length B.I use EXP3 to choose action xj for the entire blockI feedback to EXP3 is the average loss in the block

x1 x1 x1 x1 x2 x2 x2 x2 x3 x3 x3 x3 x4 x4 x4 x4

Regret Analysis

R(T ) ≤ T/B︸︷︷︸switches

+ B ·O(√

T/B)︸ ︷︷ ︸

loss

= O(T/B +√

TB)

Minimized by selecting B = T 1/3 yielding regretR(T ) = O(T 2/3).

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 18 / 30

Page 30: Bandit Learning with switching costs

Bandit + Switching: Lower Bound

Yao’s Minimax Principle (1975) The expected re-gret of the best deterministic algorithm on a randomloss sequence lower-bounds the expected regret of arandomized algorithm on the worst deterministic losssequence.

Goal find a random loss sequence for whichall deterministic algorithms have expected regretΩ(T 2/3).

For simplicity, assume k = 2.

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 19 / 30

Page 31: Bandit Learning with switching costs

Bandit + Switching: Lower Bound

Cesa-Bianchi, Dekel, Shamir 2013: random walkconstruction

I Let (St) be a Gaussian random walk, andε = 1/

√T .

I Randomly choose an action and assign to it theloss function (St), and the other action the lossfunction (St + ε).

Key: 1/ε2 switches required before determining whichaction is worse!Drawback: Unbounded loss function – is hard learning anartifact of unboundedness??

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 20 / 30

Page 32: Bandit Learning with switching costs

Multi-Scale Random Walk

Define the loss of action 1:

I Draw independent Gaussians ξ1, . . . , ξT ∼ N(0, σ2)

I For each t , define a parent ρ(t) ∈ 0, . . . , t − 1I Define (recursively): L0 = 1/2 , Lt = Lρ(t) + ξt

ξ4

ξ2 ξ6

ξ1 ξ3 ξ5 ξ7

L0 L1 L2 L3 L4 L5 L6 L7

loss of action 1

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 21 / 30

Page 33: Bandit Learning with switching costs

Examples

ρ(t) =

t − gcd(t ,2T )

L0 L1 L2 L3 L4 L5 L6 L7

ρ(t) = 0

L0 L1 L2 L3 L4 L5 L6 L7

wid

e

ρ(t) = t − 1

L0 L1 L2 L3 L4 L5 L6 L7

deep

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 22 / 30

Page 34: Bandit Learning with switching costs

The Second ActionDefine the loss of action 2:

I Draw a random sign χ(

Pr(χ = +1) = Pr(χ = −1))

I Define L′t = Lt + χε, where ε = T−1/3.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Action 1 (Lt )Action 2 (L′t )

t

loss ε

I Choose worse action θ(T ) times⇒ R(T ) = Ω(T 2/3)

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 23 / 30

Page 35: Bandit Learning with switching costs

The Information in One Sample

To avoid choosing the worse action θ(T ) times, algorithmmust identify the value of χ.

Fact Q: How many samples to estimate the mean ofa Gaussian with accuracy ε? A:

(σε

)2

L0 L1 L2 L3 L4 L5 L6 L7

action: 2 1 1 1 2 2 2

no infoone sample

How many red edges? Player needs at least σ2T 2/3

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

Page 36: Bandit Learning with switching costs

The Information in One Sample

To avoid choosing the worse action θ(T ) times, algorithmmust identify the value of χ.

Fact Q: How many samples to estimate the mean ofa Gaussian with accuracy ε? A:

(σε

)2

L0 L1 L2 L3 L4 L5 L6 L7

action: 2 1 1 1 2 2 2

no info

one sample

How many red edges? Player needs at least σ2T 2/3

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

Page 37: Bandit Learning with switching costs

The Information in One Sample

To avoid choosing the worse action θ(T ) times, algorithmmust identify the value of χ.

Fact Q: How many samples to estimate the mean ofa Gaussian with accuracy ε? A:

(σε

)2

L0 L1 L2 L3 L4 L5 L6 L7

action: 2 1 1 1 2 2 2

no infoone sample

How many red edges? Player needs at least σ2T 2/3

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

Page 38: Bandit Learning with switching costs

The Information in One Sample

To avoid choosing the worse action θ(T ) times, algorithmmust identify the value of χ.

Fact Q: How many samples to estimate the mean ofa Gaussian with accuracy ε? A:

(σε

)2

L0 L1 L2 L3 L4 L5 L6 L7

action: 2 1 1 1 2 2 2

no infoone sample

How many red edges? Player needs at least σ2T 2/3

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 24 / 30

Page 39: Bandit Learning with switching costs

Counting the Information

Define width(ρ) is the maximum size of any verticalcut in the graph induced by ρ.

L0 L1 L2 L3 L4 L5 L6 L7

width(ρ)=3

Lemma A switch contributes ≤ width(ρ) samples.

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 25 / 30

Page 40: Bandit Learning with switching costs

Depth

Define depth(ρ) is the length of the longest path.

L0 L1 L2 L3 L4 L5 L6 L7

Loss should remain bounded in [0,1] ⇒ set σ ∼ 1depth(ρ) .

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 26 / 30

Page 41: Bandit Learning with switching costs

Putting it All Together

I σ2T 2/3 samples neededI each switch gives ≤ width(ρ) samplesI loss bounded in [0,1] ⇒ σ2 ∼ 1

depth(ρ) .

Conclusion Number of switches needed todetermine the better action ∼ T 2/3

width(ρ)·depth(ρ)2

Choose ρ(t) = t − gcd(t ,2T )

Lemma depth(ρ) ≤ log(T ) and width(ρ) ≤ log(T )+1

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 27 / 30

Page 42: Bandit Learning with switching costs

Putting it All Together

I σ2T 2/3 samples neededI each switch gives ≤ width(ρ) samplesI loss bounded in [0,1] ⇒ σ2 ∼ 1

depth(ρ) .

Conclusion Number of switches needed todetermine the better action ∼ T 2/3

width(ρ)·depth(ρ)2

Choose ρ(t) = t − gcd(t ,2T )

Lemma depth(ρ) ≤ log(T ) and width(ρ) ≤ log(T )+1

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 27 / 30

Page 43: Bandit Learning with switching costs

Corollaries & Extensions

Corollary Exploration requires switching. e.g., EXP3switches θ(T ) times.

Dependence on k The minimax regret of the multi-armed bandit with switching costs is θ(T 2/3k1/3)

Implications on other models The minimax regretof learning an adversarial deterministic MDP is θ(T 2/3)

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 28 / 30

Page 44: Bandit Learning with switching costs

Summary

I A complete characterization of “learning hardness”.

I There exist online learning problems that are hard yetlearnable.

I Learning with bandit feedback can be strictly harderthan learning with full feedback.

I Exploration requires extensive switching.

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 29 / 30

Page 45: Bandit Learning with switching costs

The End

hard

easy unlearnable

Jian Ding (University of Chicago) Bandits with switching costs June, 2016 30 / 30