Download - Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Transcript
Page 1: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Lecture 22

Exploration & Exploitation in Reinforcement Learning: MAB, UCB,

Exp3

COS 402 – Machine

Learning and Artificial Intelligence

Fall 2016

Page 2: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

How to balance exploration and exploitation in reinforcement learning

β€’ Exploration: – try out each action/option to find the best one, gather more

information for long term benefit

β€’ Exploitation: – take the best action/option believed to give the best reward/payoff,

get the maximum immediate reward given current information.

β€’ Questions: Exploration-exploitation problems in real world?

– Select a restaurant for dinner

– …

Page 3: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Balancing exploration and exploitation

β€’ Exploration: – try out each action/option to find the best one, gather more information

for long term benefit

β€’ Exploitation: – take the best action/option believed to give the best reward/payoff, get

the maximum immediate reward given current information.

β€’ Questions: Exploration-exploitation problems in real world?

– Select a restaurant for dinner

– Medical treatment in clinical trials

– Online advertisement

– Oil drilling

– …

Page 4: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Picture is taken from Wikipedia

What’s in the picture?

Page 5: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem

β€’ A gambler is facing at a row of slot machines. At each time step, he chooses one of the slot machines to play and receives a reward. The goal is to maximize his return.

A row of slot machines in Las Vegas

Page 6: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem

β€’ Stochastic bandits

– Problem formulation

– Algorithms

β€’ Adversarial bandits

– Problem formulation

– Algorithms

β€’ Contextual bandits

Page 7: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem

β€’ Stochastic bandits: – K possible arms/actions: 1 ≀ i ≀ K,

– Rewards xi(t) at each arm i are drawn iid, with an expectation/mean ui, unknown to the agent/gambler

– xi(t) is a bounded real-valued reward.

– Goal : maximize the return(the accumulative reward.) or minimize the expected regret:

– Regret = u* T - 𝐸[π‘₯𝑖𝑑(𝑑)𝑇𝑑=1 ] , where

β€’ u* =maxi[ui], expectation from the best action

Page 8: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem: algorithms?

β€’ Stochastic bandits:

– Example: 10-armed bandits

– Question: what is your strategy? Which arm to pull at each time step t?

Page 9: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem: algorithms

β€’ Stochastic bandits: 10-armed bandits

β€’ Question: what is your strategy? Which arm to pull at each time step t?

β€’ 1. Greedy method: – At time step t, estimate a value for each action

β€’ Qt(a)=π‘ π‘’π‘š π‘œπ‘“ π‘Ÿπ‘’π‘€π‘Žπ‘Ÿπ‘‘π‘  π‘€β„Žπ‘’π‘› π‘Ž π‘‘π‘Žπ‘˜π‘’π‘› π‘π‘Ÿπ‘–π‘œπ‘Ÿ π‘‘π‘œ 𝑑

π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘–π‘šπ‘’π‘  π‘Ž π‘‘π‘Žπ‘˜π‘’π‘› π‘π‘Ÿπ‘–π‘œπ‘Ÿ π‘‘π‘œ 𝑑

– Select the action with the maximum value.

β€’ At= argmaxπ‘Ž

Qt(a)

– Weaknesses?

Page 10: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem: algorithms

β€’ 1. Greedy method:

– At time step t, estimate a value for each action

β€’ Qt(a)=π‘ π‘’π‘š π‘œπ‘“ π‘Ÿπ‘’π‘€π‘Žπ‘Ÿπ‘‘π‘  π‘€β„Žπ‘’π‘› π‘Ž π‘‘π‘Žπ‘˜π‘’π‘› π‘π‘Ÿπ‘–π‘œπ‘Ÿ π‘‘π‘œ 𝑑

π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘–π‘šπ‘’π‘  π‘Ž π‘‘π‘Žπ‘˜π‘’π‘› π‘π‘Ÿπ‘–π‘œπ‘Ÿ π‘‘π‘œ 𝑑

– Select the action with the maximum value.

β€’ At= argmaxπ‘Ž

Qt(a)

β€’ Weaknesses of the greedy method:

– Always exploit current knowledge, no exploration.

– Can stuck with a suboptimal action.

Page 11: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem: algorithms

β€’ 2. πœ€-Greedy methods:

– At time step t, estimate a value for each action

β€’ Qt(a)=π‘ π‘’π‘š π‘œπ‘“ π‘Ÿπ‘’π‘€π‘Žπ‘Ÿπ‘‘π‘  π‘€β„Žπ‘’π‘› π‘Ž π‘‘π‘Žπ‘˜π‘’π‘› π‘π‘Ÿπ‘–π‘œπ‘Ÿ π‘‘π‘œ 𝑑

π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘–π‘šπ‘’π‘  π‘Ž π‘‘π‘Žπ‘˜π‘’π‘› π‘π‘Ÿπ‘–π‘œπ‘Ÿ π‘‘π‘œ 𝑑

– With probability 1- πœ€, Select the action with the maximum value.

β€’ At= argmaxπ‘Ž

Qt(a)

– With probability πœ€, select an action randomly from all the actions with equal probability.

Page 12: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Experiments:

β€’ Set up: (one run)

– 10-armed bandits

– Draw ui from Gaussian(0,1), i = 1,…,10 β€’ the expectation/mean of rewards for action i

– Rewards of action i at time t: xi(t)

β€’ xi(t) ~ Gaussian(ui, 1)

– Play 2000 rounds/steps

– Average return at each time step

β€’ Average over 1000 runs

Page 13: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Experimental results: average over 1000 runs

Observations? What will happen if we run more steps?

Page 14: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Observations: β€’ Greedy method improved faster at the very

beginning, but level off at a lower level. β€’ πœ€- Greedy methods continue to Explore and

eventually perform better. β€’ Theπœ€ = 0.01 method improves slowly,but eventually performs better than the

πœ€ = 0.1 method.

Page 15: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Improve the Greedy method with optimistic initialization

Observations:

Page 16: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Greedy with optimistic initialization

Observations: β€’ Big initial Q values force the Greedy method to

explore more in the beginning. β€’ No exploration afterwards.

Page 17: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Improve πœ€-greedy with decreasing πœ€ over time

Decreasing over time: β€’ πœ€π‘‘ = πœ€π‘‘ * (𝛼)/(t+ 𝛼) β€’ Improves faster in the beginning, also outperforms

fixed πœ€-greedy methods in the long run.

Page 18: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Weaknesses of πœ€-Greedy methods:

β€’ πœ€-Greedy methods: – At time step t, estimate a value Qt(a) for each action

– With probability 1- πœ€, Select the action with the maximum value.

β€’ At= argmaxπ‘Ž

Qt(a)

– With probability πœ€, select an action randomly from all the actions with equal probability.

β€’ Weaknesses:

Page 19: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Weaknesses of πœ€-Greedy methods:

β€’ πœ€-Greedy methods: – At time step t, estimate a value Qt(a) for each action

– With probability 1- πœ€, Select the action with the maximum value.

β€’ At= argmaxπ‘Ž

Qt(a)

– With probability πœ€, select an action randomly from all the actions with equal probability.

β€’ Weaknesses: – Randomly selects a action to explore, does not

explore more β€œpromising” actions.

– Does not consider confidence interval. If an action has been taken many times, no need to explore it.

Page 20: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Upper Confidence Bound algorithm

β€’ Take each action once.

β€’ At any time t > K,

– At= argmaxπ‘Ž(Qt(a)+

2𝑙𝑛𝑑

π‘π‘Ž(𝑑)), where

β€’ Qt(a) is the average reward obtained from action a,

β€’ π‘π‘Ž(𝑑) is the number of times action a has been taken.

Page 21: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Upper Confidence Bounds

β€’ At= argmaxπ‘Ž(Qt(a)+

2𝑙𝑛𝑑

π‘π‘Ž(𝑑))

β€’2𝑙𝑛𝑑

π‘π‘Ž(𝑑)

is confidence interval for the average reward,

β€’ Small π‘π‘Ž(𝑑)-> large confidence interval(Qt(a) is more uncertain) β€’ large π‘π‘Ž(𝑑)-> small confidence interval(Qt(a) is more accurate) β€’ select action maximizing upper confidence bound.

– Explore actions which are more uncertain, exploit actions with high average rewards obtained.

– UCB: balance exploration and exploitation. – As t -> infinity, select optimal action.

Page 22: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

How good is UCB? Experimental results

Observations:

Page 23: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

How good is UCB? Experimental results

Observations: After initial 10 steps, improves quickly and outperforms πœ€-Greedy methods.

Page 24: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

How good is UCB? Theoretical guarantees

β€’ Expected regret is bounded by

8* 𝑙𝑛𝑑

u*- 𝑒𝑖𝑒𝑖 <u* + 1 +

πœ‹2

3 u*- 𝑒𝐾𝑖=1

β€’ Achieves the optimal regret up to a multiplicative constant.

β€’ Expected regret grows in O(lnt)

Page 25: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

How is the UCB derived?

β€’ Hoeffding’s Inequality Theorem:

β€’ Let be Xi, …, Xt i.i.d. random variables in [0,1], and let Xt 𝑏𝑒 π‘‘β„Žπ‘’ π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘šπ‘’π‘Žπ‘›, π‘‘β„Žπ‘’π‘›

– P[E[Xt ]β‰₯ Xt +u]≀ π‘’βˆ’2𝑑𝑒

2

β€’ Apply to the bandit rewards, – P[Q(a) β‰₯ Qt(a)+Vt(a)]≀ π‘’βˆ’2𝑁𝑑(π‘Ž)𝑉𝑑(π‘Ž)

2

,

– 2*Vt(a) is size of the confidence interval for Q(a)

– p=π‘’βˆ’2𝑁𝑑(π‘Ž)𝑉𝑑(π‘Ž)2 is the probability of making an error

Page 26: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

How is the UCB derived?

β€’ Apply to the bandit rewards, – P[Q(a)> Qt(a)+Vt(a)]≀ π‘’βˆ’2𝑁𝑑(π‘Ž)𝑉𝑑(π‘Ž)

2

β€’ Let 𝑝 = π‘’βˆ’2𝑁𝑑(π‘Ž)𝑉𝑑(π‘Ž)2

=t-4,

β€’ Then 𝑉𝑑(π‘Ž)= 2𝑙𝑛𝑑

π‘π‘Ž(𝑑)

β€’ Variations of UCB1(the basic one here): – UCB1_tuned, UCB3, UCBv, etc. – Provide and prove different upper bounds for the expected

regret. – Add a parameter to control the confidence level

Page 27: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem

β€’ Stochastic bandits

– Problem formulation

– Algorithms:

β€’ Greedy, πœ€- greedy methods, UCB

β€’ Boltzmann exploration(Softmax)

β€’ Adversarial bandits

– Problem formulation

– Algorithms

β€’ Contextual bandits

Page 28: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem

β€’ Boltzmann exploration(Softmax) – Pick a action with a probability that is proportional to its

average reward.

– Actions with greater average rewards are picked with higher probability.

β€’ The algorithm: – Given initial empirical means u1(0),…,uK(0)

– pi(t+1) = 𝑒ui(t)/𝜏

𝑒uj(t)/τ𝐾𝑗=1

, i =1,…,K

– π‘Šβ„Žπ‘Žπ‘‘ 𝑖𝑠 π‘‘β„Žπ‘’ 𝑒𝑠𝑒 π‘œπ‘“ 𝜏 ?

– What happens as 𝜏 β†’ 𝑖𝑛𝑓𝑖𝑛𝑖𝑑𝑦?

Page 29: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem

β€’ Boltzmann exploration(Softmax)

– Pick a action with a probability that is proportional to its average reward.

– Actions with greater average rewards are picked with higher probability.

β€’ The algorithm:

– Given initial empirical means u1(0),…,uK(0)

– pi(t+1) = 𝑒ui(t)/𝜏

𝑒uj(t)/τ𝐾𝑗=1

, i =1,…,K

– 𝜏 controls the choice.

– π‘Žπ‘  𝜏 β†’ 𝑖𝑛𝑓𝑖𝑛𝑖𝑑𝑦, 𝑠𝑒𝑙𝑒𝑐𝑑𝑠 π‘’π‘›π‘–π‘“π‘œπ‘Ÿπ‘šπ‘™π‘¦.

Page 30: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Multi-armed bandit problem

β€’ Stochastic bandits

– Problem formulation

– Algorithms:

β€’ πœ€- greedy methods, UCB

β€’ Boltzmann exploration(Softmax)

β€’ Adversarial bandits

– Problem formulation

– Algorithms

β€’ Contextual bandits

Page 31: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Non-stochastic/adversarial multi-armed bandit problem

β€’ Setting: – K possible arms/actions: 1 ≀ i ≀ K,

– each action is denoted by an assignment of rewards. x(1), x(2), …, of vectors x(t)=(x1(t), x2(t),…,xK(t))

– xi(t) Ο΅ [0,1] : reward received if action i is chosen at time step t (can generalize to in the range[a, b])

– Goal : maximize the return(the accumulative reward.)

β€’ Example. K=3, an adversary controls the reward.

k t=1 t=2 t=3 t=4 …

1 0.2 0.7 0.4 0.3

2 0.5 0.1 0.6 0.2

3 0.8 0.4 0.5 0.9

Page 32: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Adversarial multi-armed bandit problem

β€’ Weak regret = Gmax(T)-GA(T)

β€’ Gmax(T)=max𝑗 π‘₯𝑗(𝑑)𝑇𝑑=1 , return of the single global best

action at time horizon T

β€’ GA(T)= π‘₯𝑖𝑑(𝑑)𝑇𝑑=1 , return at time horizon T of algorithm A

choosing actions i1, i2,..

β€’ Example. Gmax(4)=?, GA(4)=? – (*indicates the reward received by algorithm A at a time step t)

K/T t=1 t=2 t=3 t=4 …

1 0.2 0.7 0.4 0.3*

2 0.5* 0.1 0.6* 0.2

3 0.8 0.4* 0.5 0.9

Page 33: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Adversarial multi-armed bandit problem

β€’ Example. Gmax(4)=?, GA(4)=?

– Gmax(4)= max𝑗 π‘₯𝑗(𝑑)𝑇𝑑=1 = max {G1(4),G2(4),G3(4)}

=max{1.6, 1.4, 2.4}=2.4

– GA(4) )= π‘₯𝑖𝑑(𝑑)𝑇𝑑=1 =0.5+0.4+0.6+0.3=1.8

β€’ Evaluate a randomized algorithm A: Bound on the expected regret for a A: Gmax(T)-E[GA(T)]

K/T t=1 t=2 t=3 t=4 …

1 0.2 0.7 0.4 0.3*

2 0.5* 0.1 0.6* 0.2

3 0.8 0.4* 0.5 0.9

Page 34: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Exp3: exponential-weight algorithm for exploration and exploitation

β€’ Parameter: Ξ³ Ο΅ (0,1] β€’ Initialize:

– wi(1) = 1 for i =1, ..,K

β€’ for t=1, 2, …

– Set pi(t) = (1- Ξ³)wi(t) wj(t)𝐾𝑗=1

+ γ. 1𝐾

, i = 1,…,K

– Draw it randomly from p1(t), …, pK(t)

– Receive reward π‘₯𝑖𝑑 Ο΅[0,1]

– For j =1,…,K

β€’ π‘₯𝑗(𝑑) = π‘₯𝑗(𝑑) /pj(t) if j=it β€’ π‘₯𝑗(𝑑) = 0 otherwise

β€’ wj(t+1) = wj(t)*exp(Ξ³*π‘₯𝑗(𝑑) /K) Question: what happens at time step 1? At time step 2? What happens as Ξ³ --> 1? Why π‘₯𝑗(𝑑) /pj(t)?

Page 35: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Exp3: Balancing exploration and exploitation

β€’ pi(t) = (1- Ξ³)wi(t) wj(t)𝐾𝑗=1

+ γ. 1𝐾

, i = 1,…,K

β€’ Balancing exploration and exploitation in Exp3

– The distribution P(t) is a mixture of the uniform distribution and a distribution

which assigns to each action a probability mass exponential in the estimated cumulative reward for that action.

– Uniform distribution encourages exploration

– The other probability encourages exploitation

– The parameter Ξ³ controls the exploration.

– π‘₯𝑗(𝑑) /pj(t) compensate the reward of actions that are unlikely to be chosen

Page 36: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Exp3: How good is it?

β€’ Theorem: For any K>0 and for any Ξ³ Ο΅ (0,1],

Gmax – E[GExp3] ≀ (e-1) *Ξ³*Gmax+K*ln(K)/Ξ³

Holds for any assignment of rewards and for any T>0

β€’ Corollary: β€’ For any T > 0 , assume that g β‰₯ Gmax and that algorithm

Exp3 is run with the parameter

β€’ Ξ³ = min {1, 𝐾𝑙𝑛𝐾

π‘’βˆ’1 𝑔},

β€’ Then Gmax – E[GExp3] ≀ 2 (𝑒 βˆ’ 1) π‘”πΎπ‘™π‘›π‘˜ ≀ 2.63 π‘”πΎπ‘™π‘›π‘˜

Page 37: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

A family of Exp3 algorithms

β€’ Exp3, Exp3.1, Exp3.P, Exp4, …

β€’ Better performance

– tighter upper bound,

– achieve small weak regret with high probability, etc.

β€’ Changes based on Exp3

– How to choose Ξ³

– How to initialize and update the weights w

Page 38: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Summary: Exploration & exploitation in Reinforcement learning

β€’ Stochastic bandits – Problem formulation

– Algorithms β€’ πœ€- greedy methods, UCB, Boltzmann exploration(Softmax)

β€’ Adversarial bandits – Problem formulation

– Algorithms: Exp3

β€’ Contextual bandits – At each time step t, observes a feature/context vector.

– Uses both context vector and rewards in the past to make a decision

– Learns how context vector and rewards relate to each other.

u*- 𝑒

Page 39: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

References:

β€’ β€œReinforcement Learning: An Introduction” by Sutton and Barto – https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html

β€’ β€œThe non-stochastic multi-armed bandit problem” by Auer, Cesa-Bianchi, Freund, and Schapire – https://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf

β€’ β€œMulti-armed bandits” by Michael – http://blog.thedataincubator.com/2016/07/multi-armed-bandits-2/

β€’ …

Page 40: Lecture 22 - Princeton University Computer Science(t) = (1- Ξ³) wi(t) 𝐾wj(t) 𝑗=1 + Ξ³. 1 𝐾, i = 1,…,K β€’ Balancing exploration and exploitation in Exp3 –The distribution

Take home questions?

β€’ 1. What is the difference between MABs(multi-armed bandits) and MDPs(Markov Decision Processes)?

β€’ 2. How are they related?


Top Related