Lecture 22
Exploration & Exploitation in Reinforcement Learning: MAB, UCB,
Exp3
COS 402 β Machine
Learning and Artificial Intelligence
Fall 2016
How to balance exploration and exploitation in reinforcement learning
β’ Exploration: β try out each action/option to find the best one, gather more
information for long term benefit
β’ Exploitation: β take the best action/option believed to give the best reward/payoff,
get the maximum immediate reward given current information.
β’ Questions: Exploration-exploitation problems in real world?
β Select a restaurant for dinner
β β¦
Balancing exploration and exploitation
β’ Exploration: β try out each action/option to find the best one, gather more information
for long term benefit
β’ Exploitation: β take the best action/option believed to give the best reward/payoff, get
the maximum immediate reward given current information.
β’ Questions: Exploration-exploitation problems in real world?
β Select a restaurant for dinner
β Medical treatment in clinical trials
β Online advertisement
β Oil drilling
β β¦
Picture is taken from Wikipedia
Whatβs in the picture?
Multi-armed bandit problem
β’ A gambler is facing at a row of slot machines. At each time step, he chooses one of the slot machines to play and receives a reward. The goal is to maximize his return.
A row of slot machines in Las Vegas
Multi-armed bandit problem
β’ Stochastic bandits
β Problem formulation
β Algorithms
β’ Adversarial bandits
β Problem formulation
β Algorithms
β’ Contextual bandits
Multi-armed bandit problem
β’ Stochastic bandits: β K possible arms/actions: 1 β€ i β€ K,
β Rewards xi(t) at each arm i are drawn iid, with an expectation/mean ui, unknown to the agent/gambler
β xi(t) is a bounded real-valued reward.
β Goal : maximize the return(the accumulative reward.) or minimize the expected regret:
β Regret = u* T - πΈ[π₯ππ‘(π‘)ππ‘=1 ] , where
β’ u* =maxi[ui], expectation from the best action
Multi-armed bandit problem: algorithms?
β’ Stochastic bandits:
β Example: 10-armed bandits
β Question: what is your strategy? Which arm to pull at each time step t?
Multi-armed bandit problem: algorithms
β’ Stochastic bandits: 10-armed bandits
β’ Question: what is your strategy? Which arm to pull at each time step t?
β’ 1. Greedy method: β At time step t, estimate a value for each action
β’ Qt(a)=π π’π ππ πππ€ππππ π€βππ π π‘ππππ πππππ π‘π π‘
ππ’ππππ ππ π‘ππππ π π‘ππππ πππππ π‘π π‘
β Select the action with the maximum value.
β’ At= argmaxπ
Qt(a)
β Weaknesses?
Multi-armed bandit problem: algorithms
β’ 1. Greedy method:
β At time step t, estimate a value for each action
β’ Qt(a)=π π’π ππ πππ€ππππ π€βππ π π‘ππππ πππππ π‘π π‘
ππ’ππππ ππ π‘ππππ π π‘ππππ πππππ π‘π π‘
β Select the action with the maximum value.
β’ At= argmaxπ
Qt(a)
β’ Weaknesses of the greedy method:
β Always exploit current knowledge, no exploration.
β Can stuck with a suboptimal action.
Multi-armed bandit problem: algorithms
β’ 2. π-Greedy methods:
β At time step t, estimate a value for each action
β’ Qt(a)=π π’π ππ πππ€ππππ π€βππ π π‘ππππ πππππ π‘π π‘
ππ’ππππ ππ π‘ππππ π π‘ππππ πππππ π‘π π‘
β With probability 1- π, Select the action with the maximum value.
β’ At= argmaxπ
Qt(a)
β With probability π, select an action randomly from all the actions with equal probability.
Experiments:
β’ Set up: (one run)
β 10-armed bandits
β Draw ui from Gaussian(0,1), i = 1,β¦,10 β’ the expectation/mean of rewards for action i
β Rewards of action i at time t: xi(t)
β’ xi(t) ~ Gaussian(ui, 1)
β Play 2000 rounds/steps
β Average return at each time step
β’ Average over 1000 runs
Experimental results: average over 1000 runs
Observations? What will happen if we run more steps?
Observations: β’ Greedy method improved faster at the very
beginning, but level off at a lower level. β’ π- Greedy methods continue to Explore and
eventually perform better. β’ Theπ = 0.01 method improves slowly,but eventually performs better than the
π = 0.1 method.
Improve the Greedy method with optimistic initialization
Observations:
Greedy with optimistic initialization
Observations: β’ Big initial Q values force the Greedy method to
explore more in the beginning. β’ No exploration afterwards.
Improve π-greedy with decreasing π over time
Decreasing over time: β’ ππ‘ = ππ‘ * (πΌ)/(t+ πΌ) β’ Improves faster in the beginning, also outperforms
fixed π-greedy methods in the long run.
Weaknesses of π-Greedy methods:
β’ π-Greedy methods: β At time step t, estimate a value Qt(a) for each action
β With probability 1- π, Select the action with the maximum value.
β’ At= argmaxπ
Qt(a)
β With probability π, select an action randomly from all the actions with equal probability.
β’ Weaknesses:
Weaknesses of π-Greedy methods:
β’ π-Greedy methods: β At time step t, estimate a value Qt(a) for each action
β With probability 1- π, Select the action with the maximum value.
β’ At= argmaxπ
Qt(a)
β With probability π, select an action randomly from all the actions with equal probability.
β’ Weaknesses: β Randomly selects a action to explore, does not
explore more βpromisingβ actions.
β Does not consider confidence interval. If an action has been taken many times, no need to explore it.
Upper Confidence Bound algorithm
β’ Take each action once.
β’ At any time t > K,
β At= argmaxπ(Qt(a)+
2πππ‘
ππ(π‘)), where
β’ Qt(a) is the average reward obtained from action a,
β’ ππ(π‘) is the number of times action a has been taken.
Upper Confidence Bounds
β’ At= argmaxπ(Qt(a)+
2πππ‘
ππ(π‘))
β’2πππ‘
ππ(π‘)
is confidence interval for the average reward,
β’ Small ππ(π‘)-> large confidence interval(Qt(a) is more uncertain) β’ large ππ(π‘)-> small confidence interval(Qt(a) is more accurate) β’ select action maximizing upper confidence bound.
β Explore actions which are more uncertain, exploit actions with high average rewards obtained.
β UCB: balance exploration and exploitation. β As t -> infinity, select optimal action.
How good is UCB? Experimental results
Observations:
How good is UCB? Experimental results
Observations: After initial 10 steps, improves quickly and outperforms π-Greedy methods.
How good is UCB? Theoretical guarantees
β’ Expected regret is bounded by
8* πππ‘
u*- π’ππ’π <u* + 1 +
π2
3 u*- π’πΎπ=1
β’ Achieves the optimal regret up to a multiplicative constant.
β’ Expected regret grows in O(lnt)
How is the UCB derived?
β’ Hoeffdingβs Inequality Theorem:
β’ Let be Xi, β¦, Xt i.i.d. random variables in [0,1], and let Xt ππ π‘βπ π πππππ ππππ, π‘βππ
β P[E[Xt ]β₯ Xt +u]β€ πβ2π‘π’
2
β’ Apply to the bandit rewards, β P[Q(a) β₯ Qt(a)+Vt(a)]β€ πβ2ππ‘(π)ππ‘(π)
2
,
β 2*Vt(a) is size of the confidence interval for Q(a)
β p=πβ2ππ‘(π)ππ‘(π)2 is the probability of making an error
How is the UCB derived?
β’ Apply to the bandit rewards, β P[Q(a)> Qt(a)+Vt(a)]β€ πβ2ππ‘(π)ππ‘(π)
2
β’ Let π = πβ2ππ‘(π)ππ‘(π)2
=t-4,
β’ Then ππ‘(π)= 2πππ‘
ππ(π‘)
β’ Variations of UCB1(the basic one here): β UCB1_tuned, UCB3, UCBv, etc. β Provide and prove different upper bounds for the expected
regret. β Add a parameter to control the confidence level
Multi-armed bandit problem
β’ Stochastic bandits
β Problem formulation
β Algorithms:
β’ Greedy, π- greedy methods, UCB
β’ Boltzmann exploration(Softmax)
β’ Adversarial bandits
β Problem formulation
β Algorithms
β’ Contextual bandits
Multi-armed bandit problem
β’ Boltzmann exploration(Softmax) β Pick a action with a probability that is proportional to its
average reward.
β Actions with greater average rewards are picked with higher probability.
β’ The algorithm: β Given initial empirical means u1(0),β¦,uK(0)
β pi(t+1) = πui(t)/π
πuj(t)/ΟπΎπ=1
, i =1,β¦,K
β πβππ‘ ππ π‘βπ π’π π ππ π ?
β What happens as π β πππππππ‘π¦?
Multi-armed bandit problem
β’ Boltzmann exploration(Softmax)
β Pick a action with a probability that is proportional to its average reward.
β Actions with greater average rewards are picked with higher probability.
β’ The algorithm:
β Given initial empirical means u1(0),β¦,uK(0)
β pi(t+1) = πui(t)/π
πuj(t)/ΟπΎπ=1
, i =1,β¦,K
β π controls the choice.
β ππ π β πππππππ‘π¦, π πππππ‘π π’ππππππππ¦.
Multi-armed bandit problem
β’ Stochastic bandits
β Problem formulation
β Algorithms:
β’ π- greedy methods, UCB
β’ Boltzmann exploration(Softmax)
β’ Adversarial bandits
β Problem formulation
β Algorithms
β’ Contextual bandits
Non-stochastic/adversarial multi-armed bandit problem
β’ Setting: β K possible arms/actions: 1 β€ i β€ K,
β each action is denoted by an assignment of rewards. x(1), x(2), β¦, of vectors x(t)=(x1(t), x2(t),β¦,xK(t))
β xi(t) Ο΅ [0,1] : reward received if action i is chosen at time step t (can generalize to in the range[a, b])
β Goal : maximize the return(the accumulative reward.)
β’ Example. K=3, an adversary controls the reward.
k t=1 t=2 t=3 t=4 β¦
1 0.2 0.7 0.4 0.3
2 0.5 0.1 0.6 0.2
3 0.8 0.4 0.5 0.9
Adversarial multi-armed bandit problem
β’ Weak regret = Gmax(T)-GA(T)
β’ Gmax(T)=maxπ π₯π(π‘)ππ‘=1 , return of the single global best
action at time horizon T
β’ GA(T)= π₯ππ‘(π‘)ππ‘=1 , return at time horizon T of algorithm A
choosing actions i1, i2,..
β’ Example. Gmax(4)=?, GA(4)=? β (*indicates the reward received by algorithm A at a time step t)
K/T t=1 t=2 t=3 t=4 β¦
1 0.2 0.7 0.4 0.3*
2 0.5* 0.1 0.6* 0.2
3 0.8 0.4* 0.5 0.9
Adversarial multi-armed bandit problem
β’ Example. Gmax(4)=?, GA(4)=?
β Gmax(4)= maxπ π₯π(π‘)ππ‘=1 = max {G1(4),G2(4),G3(4)}
=max{1.6, 1.4, 2.4}=2.4
β GA(4) )= π₯ππ‘(π‘)ππ‘=1 =0.5+0.4+0.6+0.3=1.8
β’ Evaluate a randomized algorithm A: Bound on the expected regret for a A: Gmax(T)-E[GA(T)]
K/T t=1 t=2 t=3 t=4 β¦
1 0.2 0.7 0.4 0.3*
2 0.5* 0.1 0.6* 0.2
3 0.8 0.4* 0.5 0.9
Exp3: exponential-weight algorithm for exploration and exploitation
β’ Parameter: Ξ³ Ο΅ (0,1] β’ Initialize:
β wi(1) = 1 for i =1, ..,K
β’ for t=1, 2, β¦
β Set pi(t) = (1- Ξ³)wi(t) wj(t)πΎπ=1
+ Ξ³. 1πΎ
, i = 1,β¦,K
β Draw it randomly from p1(t), β¦, pK(t)
β Receive reward π₯ππ‘ Ο΅[0,1]
β For j =1,β¦,K
β’ π₯π(π‘) = π₯π(π‘) /pj(t) if j=it β’ π₯π(π‘) = 0 otherwise
β’ wj(t+1) = wj(t)*exp(Ξ³*π₯π(π‘) /K) Question: what happens at time step 1? At time step 2? What happens as Ξ³ --> 1? Why π₯π(π‘) /pj(t)?
Exp3: Balancing exploration and exploitation
β’ pi(t) = (1- Ξ³)wi(t) wj(t)πΎπ=1
+ Ξ³. 1πΎ
, i = 1,β¦,K
β’ Balancing exploration and exploitation in Exp3
β The distribution P(t) is a mixture of the uniform distribution and a distribution
which assigns to each action a probability mass exponential in the estimated cumulative reward for that action.
β Uniform distribution encourages exploration
β The other probability encourages exploitation
β The parameter Ξ³ controls the exploration.
β π₯π(π‘) /pj(t) compensate the reward of actions that are unlikely to be chosen
Exp3: How good is it?
β’ Theorem: For any K>0 and for any Ξ³ Ο΅ (0,1],
Gmax β E[GExp3] β€ (e-1) *Ξ³*Gmax+K*ln(K)/Ξ³
Holds for any assignment of rewards and for any T>0
β’ Corollary: β’ For any T > 0 , assume that g β₯ Gmax and that algorithm
Exp3 is run with the parameter
β’ Ξ³ = min {1, πΎπππΎ
πβ1 π},
β’ Then Gmax β E[GExp3] β€ 2 (π β 1) ππΎπππ β€ 2.63 ππΎπππ
A family of Exp3 algorithms
β’ Exp3, Exp3.1, Exp3.P, Exp4, β¦
β’ Better performance
β tighter upper bound,
β achieve small weak regret with high probability, etc.
β’ Changes based on Exp3
β How to choose Ξ³
β How to initialize and update the weights w
Summary: Exploration & exploitation in Reinforcement learning
β’ Stochastic bandits β Problem formulation
β Algorithms β’ π- greedy methods, UCB, Boltzmann exploration(Softmax)
β’ Adversarial bandits β Problem formulation
β Algorithms: Exp3
β’ Contextual bandits β At each time step t, observes a feature/context vector.
β Uses both context vector and rewards in the past to make a decision
β Learns how context vector and rewards relate to each other.
u*- π’
References:
β’ βReinforcement Learning: An Introductionβ by Sutton and Barto β https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html
β’ βThe non-stochastic multi-armed bandit problemβ by Auer, Cesa-Bianchi, Freund, and Schapire β https://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf
β’ βMulti-armed banditsβ by Michael β http://blog.thedataincubator.com/2016/07/multi-armed-bandits-2/
β’ β¦
Take home questions?
β’ 1. What is the difference between MABs(multi-armed bandits) and MDPs(Markov Decision Processes)?
β’ 2. How are they related?
Top Related