Bayesian Sparse Sampling for On-line Reward Optimization

21
Bayesian Sparse Sampling for On-line Reward Optimization Presented in the Value of Information Seminar at NIPS 2005 Based on previous paper from ICML 2005, written by Dale Schuurmans et all

description

Bayesian Sparse Sampling for On-line Reward Optimization. Presented in the Value of Information Seminar at NIPS 2005. Based on previous paper from ICML 2005, written by Dale Schuurmans et all. Background Perspective. Be Bayesian about reinforcement learning - PowerPoint PPT Presentation

Transcript of Bayesian Sparse Sampling for On-line Reward Optimization

Page 1: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian Sparse Sampling for On-line Reward Optimization

Presented in the Value of Information Seminar at NIPS 2005

Based on previous paper from ICML 2005, written by Dale Schuurmans et all

Page 2: Bayesian Sparse Sampling  for On-line Reward Optimization

Background Perspective

Be Bayesian about reinforcement learning Ideal representation of uncertainty for

action selection

Computational barriers

Page 3: Bayesian Sparse Sampling  for On-line Reward Optimization

Exploration vs. Exploitation

Bayes decision theory Value of information measured by ultimate return

in reward Choose actions to max expected value

Exploration/exploitation tradeoff implicitly handled as side effect

Page 4: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian Approach

conceptually cleanbut

computationally disasterous

conceptually disasterousbut

computationally clean

versus

Page 5: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian Approach

conceptually cleanbut

computationally disasterous

conceptually disasterousbut

computationally clean

versus

Page 6: Bayesian Sparse Sampling  for On-line Reward Optimization

Overview

Efficient lookahead search for Bayesian RL Sparser sparse sampling Controllable computational cost

Higher quality action selection than current methods

GreedyEpsilon - greedyBoltzmann Thompson SamplingBayes optimal Interval estimation Myopic value of perfect info. Standard sparse sampling Péret & Garcia

(Luce 1959)(Thompson 1933)(Hee 1978)(Lai 1987, Kaelbling 1994)(Dearden, Friedman, Andre 1999)(Kearns, Mansour, Ng 2001)(Péret & Garcia 2004)

Page 7: Bayesian Sparse Sampling  for On-line Reward Optimization

Sequential Decision Making

),(sup)( asQa

sV )]'([,|',

),( sVrassr

EasQ

aMAX

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s,a

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s,a

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’

s

a a

a’ a’ a’ a’ a’ a’ a’ a’

r r r r

r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’

assrE

,|',

'aMAX

','|'',' assrE

Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)

V(s’) V(s’) V(s’) V(s’)

Q(s,a) Q(s,a)

V(s)How to make an optimal decision?

General case: Fixed point equations:

“Planning”

Requires model P(r,s’|s,a)

This is: finite horizon, finite action, finite reward case

Page 8: Bayesian Sparse Sampling  for On-line Reward Optimization

Reinforcement Learning

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s,a

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s,a

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’

s

a a

a’ a’ a’ a’ a’ a’ a’ a’

r r r r

r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)

V(s’) V(s’) V(s’) V(s’)

Q(s,a) Q(s,a)

V(s) Do not have model P(r,s’|s,a)

Page 9: Bayesian Sparse Sampling  for On-line Reward Optimization

Reinforcement Learning

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s,a

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s,a

s’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’

s

a a

a’ a’ a’ a’ a’ a’ a’ a’

r r r r

r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)

V(s’) V(s’) V(s’) V(s’)

Q(s,a) Q(s,a)

V(s) Do not have model P(r,s’|s,a)

Cannot Compute

assrE

,|',

Page 10: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian Reinforcement Learning

s b

actions

s b +a

rewards

s’ b’

actions

s’ b’ +a’

rewards

s’’ b’’

decisiona

outcomer, s’, b’

Meta-level ModelP(r,s’b’|s b,a)

a

a’

r

r’

decisiona’

outcomer’, s’’, b’’

Meta-level ModelP(r’,s’’b’’|s’ b’,a’)

Meta-level MDP

Choose actionto maximize

long term reward

meta-level state

Belief state b=P(θ)

Have a model for meta-level transitions!

- based on posterior update and expectations over base-level MDPs

Prior P(θ) on model P(rs’ |sa, θ)

Page 11: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian RL Decision Making

Q(s b, a) Q(s b, a)

V(s b)

a a

r r r r

V(s’ b’)V(s’ b’)V(s’ b’)V(s’ b’)

How to make an optimal decision?

Solve planning problemin meta-level MDP:

- Optimal Q,V values

Problem: meta-level MDP much larger than base-level MDP

Bayes optimalaction selection

Impractical

Page 12: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian RL Decision Making

s b, a

s’ b’ s’ b’

s b, a

s’ b’ s’ b’

s b

a a

r r r r

s, a

s’ s’

s, a

s’ s’

s

a a

r r r r

Greedy approach:

current b → mean base-level MDP model

→ point estimate for Q, V

→ choose greedy action

Thompson approach:current b → sample a base-level MDP model

→ point estimate for Q, V

(Choose action proportional to probability it is max Q)

Current approximation strategies:

But doesn’t consider uncertainty

Draw a base-level MDP

☺ Exploration is basedon uncertainty

But still myopic

Consider current belief state b

Page 13: Bayesian Sparse Sampling  for On-line Reward Optimization

Their Approach

Try to better approximate Bayes optimal action selection by performing lookahead

Adapt “sparse sampling” (Kearns, Mansour ,Ng)

Make some practical improvements

Page 14: Bayesian Sparse Sampling  for On-line Reward Optimization

Sparse Sampling(Kearns, Mansour, Ng 2001)

Page 15: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian Sparse SamplingObservation 1

Action value estimates are not equally important Need better Q value

estimates for some actions but not all

Preferentially expand tree under actions that might be optimal

Biased tree growth

Use Thompson sampling to select actions to expand

Page 16: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian Sparse SamplingObservation 2

Correct leaf value estimates to same depth

Use mean MDP Q-value multiplied by remaining depth

t=1

t=2

t=3

effective horizon N=3

)(ˆ tNQ

Page 17: Bayesian Sparse Sampling  for On-line Reward Optimization

Bayesian Sparse SamplingTree growing procedure

Descend sparse tree from root Thompson sample actions Sample outcome

Until new node added Repeat until tree size

limit reached

Control computation by controlling tree size

s,b,a

s’b’

s,b1. Sample prior for a model2. Solve action values 3. Select the optimal action

Execute actionObserve reward

Page 18: Bayesian Sparse Sampling  for On-line Reward Optimization

Simple experiments

5 Bernoulli bandits Beta priors Sampled model from prior Run action selection strategies Repeat 3000 times Average accumulated reward per step

51,...,aa

Page 19: Bayesian Sparse Sampling  for On-line Reward Optimization

Five Bernoulli Bandits

0.45

0.5

0.55

0.6

0.65

0.7

0.75

5 10 15 20Horizon

Av

era

ge

Re

wa

rd p

er

Ste

p

eps-GreedyBoltzmannInterval Est.ThompsonMVPISparse Samp.Bayes Samp.

Page 20: Bayesian Sparse Sampling  for On-line Reward Optimization

Five Gaussian Bandits

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5 10 15 20Horizon

Av

era

ge

Re

wa

rd p

er

Ste

p

eps-GreedyBoltzmannInterval Est.ThompsonMVPISparse Samp.Bayes Samp.

Page 21: Bayesian Sparse Sampling  for On-line Reward Optimization