Bandit Algorithmsdownloads.tor-lattimore.com/bandit-tutorial/fs.pdf · 2018-02-09 · iis Bernoulli...

59
Bandit Algorithms Tor Lattimore & Csaba Szepesv´ ari

Transcript of Bandit Algorithmsdownloads.tor-lattimore.com/bandit-tutorial/fs.pdf · 2018-02-09 · iis Bernoulli...

Bandit AlgorithmsTor Lattimore & Csaba Szepesvari

Bandits

Time 1 2 3 4 5 6 7 8 9 10 11 12Left arm $1 $0 $1 $1 $0Right arm $1 $0

Five rounds to go. Which arm would you play next?

Overview

• What are bandits, and why you should care• Finite-armed stochastic bandits• A brief intro to finite-armed adversarial bandits (if time)• Break• Contextual and linear bandits• Summary and discussion• Details for the core ideas, rather than a broad overview• Plenty of references on where to find more• Please ask questions!

What’s in a name? A tiny bit of historyFirst bandit algorithm proposed by Thompson (1933)

Bush and Mosteller (1953) were in-terested in how mice behaved in aT-maze

Why care about bandits?

1. Many applications2. They isolate an important component ofreinforcement learning: exploration-vs-exploitation3. Rich and beautiful (we think) mathematically

Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning

Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning

Lots for you to do!

Finite-armed Bandits• K actions• n rounds• In each round t the learner chooses an action

At ∈ {1, 2, . . . ,K} .

• Observes rewardXt ∼ PAt where P1, P2, . . . , PK are unknowndistributions

Distributional assumptionsWhile P1, P2, . . . , PK are not known in advance, we make someassumptions:

• Pi is Bernoulli with unknown bias µi ∈ [0, 1]

• Pi is Gaussian with unit variance and unknown mean µi ∈ R• Pi is subgaussian• Pi is supported on [0, 1]

• Pi has variance less than one• ...

As usual, stronger assumptions lead to stronger bounds

This tutorial All reward distributions are Gaussian (or subgaussian) withunit variance

What makes a bandit problem?

How to tell if your problem is a bandit problem?Three core properties:

1. Sequentially taking actions of unknown quality2. The feedback provides information about quality of chosen action3. There is no state

Things are considerably easier if the problem is close to stationary, but itis not a defining feature of a bandit problem

Example: A/B testing• Business wants to optimize their webpage• Actions correspond to ‘A’ and ‘B’• Users arrive at webpage sequentially• Algorithm chooses either ‘A’ or ‘B’• Receives activity feedback (the reward)

Measuring performance – the regret

• Let µi be the mean reward of distribution Pi• µ∗ = maxi µi is the maximum mean• The regret is

Rn = nµ∗ − E

[n∑t=1

Xt

]

• Policies for which the regret is sublinear are learning• Of course we would like to make it as ‘small as possible’

Measuring performance – the regret

• A learner minimising the regret tries to collect as much reward aspossible• Sometimes you only care about finding the best action after nrounds• Captured by the simple regret

Rsimplen = E[∆An ]

• Learner’s shooting for this objective are solving the pure explorationproblem• We don’t focus on this here though

Measuring performance – the regret

Rn = nµ∗ − E

[n∑t=1

Xt

]

• The regret is an expectation• Does not take risk into account

Measuring performance – the regret

Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =

K∑i=1

∆iE[Ti(n)]

Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]

Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1

E[Et[Xt]] = nµ∗ −n∑t=1

E[µAt ]

=

n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]

A simple policy: Explore-Then-Commit

1 Choose each action m times2 Find the empirically best action I ∈ {1, 2, . . . ,K}3 Choose At = I for all remaining rounds

In order to analyse this policy we need to bound the probability ofcomitting to a suboptimal action

A Crash Course in ConcentrationLet Z,Z1, Z2, . . . , Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?Classical statistics says:

1. (law of large numbers) limn→∞ µn = µ almost surely2. (central limit theorem) √n(µn − µ)

d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2

We need something nonasymptotic and stronger then Chebysehv’sNot possible without assumptions

Random variable Z is σ-subgaussian if for all λ ∈ R,MZ(λ)

.= E[exp(λZ)] ≤ exp

(λ2σ2/2

)Lemma If Z,Z1, . . . , Zn are independent and σ-subgaussian, then

• aZ is |a|σ-subgaussian for any a ∈ R• ∑n

t=1 Zt is√nσ-subgaussian• µn is n−1/2σ-subgaussian

A Crash Course in ConcentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then

P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))

≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp

(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)

A Crash Course in Concentration• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.• And not: exponential, power law• Comparing Chebyshev’s w. subgaussian bound:

Chebyshev’s:√σ2

nδSubgaussian:

√2σ2 log(1/δ)

n

• Typically δ � 1/n in our use-cases

The results that follow hold when the distributionassociated with each arm is 1-subgaussian

Analysing Explore-Then-Commit

• Standard convention Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Algorithms are symmetric and do not exploit this fact• Means that first arm is optimal• Remember, Explore-Then-Commit chooses each arm m times• Then commits to the arm with the largest payoff• We consider only K = 2

Analysing Explore-Then-CommitStep 1 Let µi be the average reward after exploringThe algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆

Observation µ1 − µ1 + µ2 − µ2 is√2/m-subgaussianStep 2 The regret is

Rn = E

[n∑t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n− 2m)∆P (commit to the wrong arm)

= m∆ + (n− 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)

Analysing Explore-Then-Commit

Rn ≤ m∆︸︷︷︸(A)

+n∆ exp(−m∆2/4)︸ ︷︷ ︸(B)

(A) is monotone increasing in m while (B) is monotone decreasing in mExploration/Exploitation dilemma Exploring too much (m large) then (A)is big, while exploring too little makes (B) largeBound minimised by m =

⌈4

∆2 log(n∆2

4

)⌉ leading toRn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

Analysing Explore-Then-CommitLast slide: Rn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

What happens when ∆ is very small?Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}

0 0.2 0.4 0.6 0.8 1

0

10

20

30

Regret

Analysing Explore-Then-CommitDoes this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆Small ∆ makes identification hard, but cost of failure is lowLarge ∆ makes the cost of failure high, but identification easyWorst case is when ∆ ≈

√1/n with Rn ≈ √n

Limitations of Explore-Then-Commit

• Need advance knowledge of the horizon n• Optimal tuning depends on ∆

• Does not behave well with K > 2

• Issues can be overcome by using data to adapt the commitmenttime• All variants of Explore-Then-Commit are at least a factor of 2 frombeing optimal• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analysing a bandit problem

Optimism principle

Informal illustrationVisiting a new regionShall I try local cuisine?Optimist: Yes!Pessimist: No!Optimism leads to exploration, pessimism prevents itExploration is necessary, but how much?

Optimism Principle

• Let µi(t) = 1Ti(t)

∑ts=1 1(As = i)Xs

• Formalise the intuition using confidence intervals• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’• Suggests

optimistic estimate = µi(t− 1) +

√2 log(1/δ)

Ti(t− 1)

• δ ∈ (0, 1) determines the level of optimism

Upper Confidence Bound Algorithm1 Choose each action once2 Choose the action maximising

At = argmaxi µi(t− 1) +

√2 log(t3)

Ti(t− 1)

3 Goto 2Corresponds to δ = 1/t3

This is quite a conservative choice. More on this laterAlgorithm does not depend on horizon n (it is anytime)

Demonstration

Regret of UCBTheorem The regret of UCB is at most

Rn = O

∑i:∆i>0

(∆i +

log(n)

∆i

)Furthermore,

Rn = O(√

Kn log(n))

Bounds of the first kind are called problem dependent or instancedependent

Bounds like the second are called distribution free or worst case

UCB Analysis

Rewrite the regret Rn =

K∑i=1

∆iE[Ti(n)]

Only need to show that E[Ti(n)] is not too large for suboptimal arms

UCB AnalysisKey insight Arm i is only played if its index is larger than the index of theoptimal armNeed to show two things:(A) The index of the optimal arm is larger than its actual mean with highprobability(B) The index of suboptimal arms falls below the mean of the optimalarm after only a few plays

γi(t− 1) = µi(t− 1) +

√2 log(t3)

Ti(t− 1)︸ ︷︷ ︸index of arm i in round t

UCB Analysis Intuition

Arm 1 Arm 2

True meanEmpirical mean

UCB Analysis Intuition

Arm 1 Arm 2

True meanEmpirical mean

UCB AnalysisTo make this intuition a reality we decompose the ‘pull-count’

E[Ti(n)] = E

[n∑t=1

1(At = i)

]=

n∑t=1

P (At = i)

=

n∑t=1

P (At = i and (γ1(t− 1) ≤ µ1 or γi(t− 1) ≥ µ1))

≤n∑t=1

P (γ1(t− 1) ≤ µ1)︸ ︷︷ ︸index of opt. arm too small?

+

n∑t=1

P (At = i and γi(t− 1) ≥ µ1)︸ ︷︷ ︸index of subopt. arm large?

UCB AnalysisWe want to show that P (γ1(t− 1) ≤ µ1) is smallTempting to use the concentration theorem...

P (γ1(t− 1) ≤ µ1) = P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)?≤ 1

t3

What’s wrong with this? Ti(t− 1) is a random variable!

P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)≤ P

(∃s < t : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2.

UCB Analysisn∑t=1

P (At = i and γi(t− 1) ≥ µ1) = E

[n∑t=1

1(At = i and γi(t− 1) ≥ µ1)

]

= E

[n∑t=1

1(At = i and µi(t− 1) +

√6 log(t)

Ti(t− 1)≥ µ1)

]

≤ E

[n∑t=1

1(At = i and µi(t− 1) +

√6 log(n)

Ti(t− 1)≥ µ1)

]

≤ E

[n∑s=1

1(µi,s +

√6 log(n)

s≥ µ1)

]

=

n∑s=1

P

(µi,s +

√6 log(n)

s≥ µ1

)

UCB AnalysisLet u =

24 log(n)

∆2i

. Thenn∑s=1

P

(µi,s +

√6 log(n)

s≥ µ1

)≤ u+

n∑s=u+1

P

(µi,s +

√6 log(n)

s≥ µ1

)

≤ u+

n∑s=u+1

P(µi,s ≥ µi +

∆i

2

)

≤ u+

∞∑s=u+1

exp

(−s∆

2i

8

)≤ 1 + u+

8

∆2i

.

UCB Analysis

Combining the two parts we haveE[Ti(n)] ≤ 3 +

8

∆2i

+24 log(n)

∆2i

So the regret is bounded byRn =

∑i:∆i>0

∆iE[Ti(n)] ≤∑i:∆i>0

(3∆i +

8

∆i+

24 log(n)

∆i

)

Distribution free bounds

Let ∆ > 0 be some constant to be chosen laterRn =

∑i:∆i>0

∆iE[Ti(n)] ≤ n∆ +∑

i:∆i>∆

∆iE[Ti(n)]

. n∆ +∑

i:∆i>∆

log(n)

∆i≤ n∆ +

K log(n)

∆.√nK log(n)

where in the last line we tuned ∆ =√K log(n)/n

Improvements• The constants in the algorithm/analysis can be improved quitesignificantly.

At = argmaxi µi(t− 1) +

√2 log(t)

Ti(t− 1)

• With this choice:limn→∞

Rnlog(n)

=∑i:∆i>0

2

∆i

• The distribution-free regret is also improvableAt = argmaxi µi(t− 1) +

√4

Ti(t− 1)log

(1 +

t

KTi(t− 1)

)• With this index we save a log factor in the distribution free bound

Rn = O(√nK)

Improvements

• Warning Pushing the expected regret too hard results in highvariance

Lower bounds

• Two kinds of lower bound: distribution free (worst case) andinstance-dependent• What could an instance-dependent lower bound look like?• Algorithms that always choose a fixed action?

Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27

Proof sketch• µ = (∆, 0, . . . , 0)

• i = argmini>1 Eµ[Ti(n)]

• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)

• Envs. indistinguishable if ∆ ≈√K/n

• Suffers n∆ regret on one of them

Instance-dependent lower boundsAn algorithm is consistent on class of bandits E if Rn = o(n) for allbandits in ETheorem If an algorithm is consistent for the class of Gaussian bandits,then

lim infn→∞

Rnlog(n)

≥∑i:∆i>0

2

∆i

• Consistency rules out stupid algorithms like the algorithm thatalways chooses a fixed action• Consistency is asymptotic, so it is not surprising the lower bound wederive from it is asymptotic• A non-asymptotic version of consistenncy leads to non-asymptoticlower bounds

What else is there?• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...• A twist on UCB that replaces classifical confidence bounds withBayesian confidence bounds – offers empirical improvements• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)• Different objectives Simple regret, measures of risk

The adversarial viewpoint• Replace random rewards with an adversary• At the start of the game the adversary secretly chooses lossesy1, y2, . . . , yn where yt ∈ [0, 1]K

• Learner chooses actions At and suffers loss ytAt• Regret isRn = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

− mini

n∑t=1

yti︸ ︷︷ ︸loss of best arm

• Mission Make the regret small, regardless of the adversary• There exists an algorithm such that

Rn ≤ 2√Kn

The adversarial viewpoint• The trick is in the definition of regret• The adversary cannot be too mean

Rn = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

− mini

n∑t=1

yti︸ ︷︷ ︸loss of best arm

y =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)

• The following alternative objective is hopelessR′n = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

−n∑t=1

miniyti︸ ︷︷ ︸

loss of best sequence• Randomisation is crucial in adversarial bandits

Tackling the adversarial bandit• Learner chooses distribution Pt ∈ ∆K over the K actions• Samples At ∼ Pt• Observes Yt = ytAt• Expected regret is

Rn = maxi

E

[n∑t=1

(ytAt − yti)

]= max

p∈∆KE

[n∑t=1

〈Pt − p, yt〉

]

• This looks a lot like online linear optimisation on a simplex• Only yt is not observed• Idea is to find unbiased estimator yt

Tackling the adversarial bandit

Simple estimator of yt is the importance weighted estimator

yti =1(At = i)yti

Pti

We can see that E[yti|A1, Y1, . . . , At−1, Yt−1] = yti

Rn = maxp∈∆K

E

[n∑t=1

〈Pt − p, yt〉

]= max

p∈∆KE

[n∑t=1

〈Pt − p, yt〉

]

Now we have an online linear optimisation problem!

Tackling the adversarial banditClassic algorithm: Pt = argminp η

∑t−1s=1〈p, yt〉+ F (p)

where η > 0 is called the learning rate and F is the regulariserTheorem if F (p) =

∑i pi log(pi)− pi is the negentropy regulariser, then

n∑t=1

〈Pt − p, yt〉 ≤log(K)

η+η

2

n∑t=1

K∑i=1

Ptiy2t

Taking the expectation and using the def. of yt,Rn ≤

log(K)

η+η

2E

[n∑t=1

K∑i=1

Pti

(1(At = i)yti

Pti

)2]

≤ log(K)

η+η

2E

[n∑t=1

K∑i=1

Et[1(At = i)]

Pti

]

=log(K)

η+ηnK

2=√

2nK log(K)

Adversarial bandits

• Instance-dependence?• Moving beyond expected regret (high probability bounds)• Why bother with stochastic bandits?• Best of both worlds? Bubeck and Slivkins (2012); Seldin and Lugosi(2017); Auer and Chiang (2016)• Big myth Adversarial bandits do not address nonstationarity

Resources

• Book by Bubeck and Cesa-Bianchi (2012)• Book by Cesa-Bianchi and Lugosi (2006)• The Bayesian books by Gittins et al. (2011) and Berry and Fristedt(1985). Both worth reading.• Our online notes: http://banditalgs.com• Notes by Aleksandrs Slivkins:http://slivkins.com/work/MAB-book.pdf

• We will soon release a 450 page book (“Bandit Algorithms” to bepublished by Cambridge)

Historical notes

• First paper on bandits is by Thompson (1933). He proposed analgorithm for two-armed Bernoulli bandits and hand-runs somesimulations (Thompson sampling)• Popularised enormously by Robbins (1952)• Confidence bounds first used by Lai and Robbins (1985) to deriveasymptotically optimal algorithm• UCB by Katehakis and Robbins (1995) and Agrawal (1995).Finite-time analysis by Auer et al. (2002)• Adversarial bandits: Auer et al. (1995)• Minimax optimal algorithm by Audibert and Bubeck (2009)

References IAgrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, pages 1054–1078.Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In

Proceedings of Conference on Learning Theory (COLT), pages 217–226.Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47:235–256.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: Theadversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings.,

36th Annual Symposium on, pages 322–331. IEEE.Auer, P. and Chiang, C. (2016). An algorithm with nearly optimal pseudo-regret for both stochasticand adversarial bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016,

New York, USA, June 23-26, 2016, pages 116–120.Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments. Chapmanand Hall, London ; New York :.Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed

Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated.Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits. In

COLT, pages 42.1–42.23.Bush, R. R. and Mosteller, F. (1953). A stochastic model with applications to learning. The Annals of

Mathematical Statistics, pages 559–585.Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.

References II

Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley &Sons.Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings

of the National Academy of Sciences of the United States of America, 92(19):8584.Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in

applied mathematics, 6(1):4–22.Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American

Mathematical Society, 58(5):527–535.Seldin, Y. and Lugosi, G. (2017). An improved parametrization and analysis of the EXP3++ algorithmfor stochastic and adversarial bandits. In COLT, pages 1743–1759.Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of theevidence of two samples. Biometrika, 25(3/4):285–294.

Random concentration failureLet X1, X2, . . . be a sequence of independent and identically distributedstandard Gaussian. For any n we haveP

(n∑t=1

Xt ≥√

2n log(1/δ)

)≤ δ

Want to show this can fail if n is replaced by random variable TLaw of the iterated logaritm says that

lim supn→∞

∑nt=1Xt√

2n log log(n)= 1 almost surely

Let T = min{n :∑n

t=1Xt ≥√

2n log(1/δ)}. Then P (T <∞) = 1 andP

(T∑t=1

Xt ≥√

2T log(1/δ)

)= 1 .

Contradiction! (works if T is independent of X1, X2, . . . though)