Bayesian Sparse Sampling for On-line Reward Optimization

Presented in the Value of Information Seminar at NIPS 2005

Based on previous paper from ICML 2005, written by Dale Schuurmans et all

Background Perspective

Be Bayesian about reinforcement learning Ideal representation of uncertainty for

action selection

Computational barriers

Exploration vs. Exploitation

Bayes decision theory Value of information measured by ultimate return

in reward Choose actions to max expected value

Exploration/exploitation tradeoff implicitly handled as side effect

Bayesian Approach

conceptually cleanbut

computationally disasterous

conceptually disasterousbut

computationally clean

versus

Bayesian Approach

conceptually cleanbut

computationally disasterous

conceptually disasterousbut

computationally clean

versus

Overview

Efficient lookahead search for Bayesian RL Sparser sparse sampling Controllable computational cost

Higher quality action selection than current methods

GreedyEpsilon - greedyBoltzmann Thompson SamplingBayes optimal Interval estimation Myopic value of perfect info. Standard sparse sampling Péret & Garcia

(Luce 1959)(Thompson 1933)(Hee 1978)(Lai 1987, Kaelbling 1994)(Dearden, Friedman, Andre 1999)(Kearns, Mansour, Ng 2001)(Péret & Garcia 2004)

Sequential Decision Making

),(sup)( asQa

sV )]'([,|',

),( sVrassr

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

a’ a’ a’ a’ a’ a’ a’ a’

r r r r

r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’

','|'',' assrE

Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)

V(s’) V(s’) V(s’) V(s’)

Q(s,a) Q(s,a)

V(s)How to make an optimal decision?

General case: Fixed point equations:

“Planning”

Requires model P(r,s’|s,a)

This is: finite horizon, finite action, finite reward case

Reinforcement Learning

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

a’ a’ a’ a’ a’ a’ a’ a’

r r r r

r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)

V(s’) V(s’) V(s’) V(s’)

Q(s,a) Q(s,a)

V(s) Do not have model P(r,s’|s,a)

Reinforcement Learning

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

s’,a’

s’’ s’’

s’,a’

s’’

a’ a’ a’ a’ a’ a’ a’ a’

r r r r

r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’ r’Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’) Q(s’,a’)

V(s’) V(s’) V(s’) V(s’)

Q(s,a) Q(s,a)

V(s) Do not have model P(r,s’|s,a)

Cannot Compute

Bayesian Reinforcement Learning

actions

s b +a

rewards

s’ b’

actions

s’ b’ +a’

rewards

s’’ b’’

decisiona

outcomer, s’, b’

Meta-level ModelP(r,s’b’|s b,a)

decisiona’

outcomer’, s’’, b’’

Meta-level ModelP(r’,s’’b’’|s’ b’,a’)

Meta-level MDP

Choose actionto maximize

long term reward

meta-level state

Belief state b=P(θ)

Have a model for meta-level transitions!

- based on posterior update and expectations over base-level MDPs

Prior P(θ) on model P(rs’ |sa, θ)

Bayesian RL Decision Making

Q(s b, a) Q(s b, a)

V(s b)

r r r r

V(s’ b’)V(s’ b’)V(s’ b’)V(s’ b’)

How to make an optimal decision?

Solve planning problemin meta-level MDP:

- Optimal Q,V values

Problem: meta-level MDP much larger than base-level MDP

Bayes optimalaction selection

Impractical

Bayesian RL Decision Making

s b, a

s’ b’ s’ b’

s b, a

s’ b’ s’ b’

r r r r

s’ s’

r r r r

Greedy approach:

current b → mean base-level MDP model

→ point estimate for Q, V

→ choose greedy action

Thompson approach:current b → sample a base-level MDP model

→ point estimate for Q, V

(Choose action proportional to probability it is max Q)

Current approximation strategies:

But doesn’t consider uncertainty

Draw a base-level MDP

☺ Exploration is basedon uncertainty

But still myopic

Consider current belief state b

Their Approach

Try to better approximate Bayes optimal action selection by performing lookahead

Adapt “sparse sampling” (Kearns, Mansour ,Ng)

Make some practical improvements

Sparse Sampling(Kearns, Mansour, Ng 2001)

Bayesian Sparse SamplingObservation 1

Action value estimates are not equally important Need better Q value

estimates for some actions but not all

Preferentially expand tree under actions that might be optimal

Biased tree growth

Use Thompson sampling to select actions to expand

Bayesian Sparse SamplingObservation 2

Correct leaf value estimates to same depth

Use mean MDP Q-value multiplied by remaining depth

effective horizon N=3

)(ˆ tNQ

Bayesian Sparse SamplingTree growing procedure

Descend sparse tree from root Thompson sample actions Sample outcome

Until new node added Repeat until tree size

limit reached

Control computation by controlling tree size

s’b’

s,b1. Sample prior for a model2. Solve action values 3. Select the optimal action

Execute actionObserve reward

Simple experiments

5 Bernoulli bandits Beta priors Sampled model from prior Run action selection strategies Repeat 3000 times Average accumulated reward per step

51,...,aa

Five Bernoulli Bandits

5 10 15 20Horizon

eps-GreedyBoltzmannInterval Est.ThompsonMVPISparse Samp.Bayes Samp.

Five Gaussian Bandits

5 10 15 20Horizon

eps-GreedyBoltzmannInterval Est.ThompsonMVPISparse Samp.Bayes Samp.

Bayesian Sparse Sampling for On-line Reward Optimization

Documents

Transcript of Bayesian Sparse Sampling for On-line Reward Optimization

Empirical Bayesian thresholding for sparse signals using ...Empirical Bayesian thresholding for sparse signals using mixture loss functions Vikas C. Raykar CAD and Knowledge Solutions

A Sparse Bayesian Learning Algorithm With Dictionary ...

Shrink Globally, Act Locally: Sparse Bayesian ...faculty.chicagobooth.edu/nicholas.polson/research/papers/Bayes1.pdf · Shrink Globally, Act Locally: Sparse Bayesian Regularization

Bayesian anti-sparse coding - GitHub Pages · the proposed Bayesian strategy for anti-sparse coding are threefold. Firstly, Bayesian inference is a ﬂexible methodology that may

Empirical Bayesian thresholding for sparse signals using mixture

Bayesian image modeling by generalized sparse … · 1 Bayesian image modeling by generalized sparse Markov random fields and loopy belief propagation Kazuyuki Tanaka Graduate School

Visual Sparse Bayesian Reinforcement Learning: A Framework ...

Sparse Bayesian Learning for Efficient Visual Tracking

Nonparametric Bayesian Sparse Factor Models with ...mlg.eng.cam.ac.uk/pub/pdf/KnoGha11b.pdfNONPARAMETRIC BAYESIAN SPARSE FACTOR MODELS 3 using a straightforward Gibbs sampling algorithm.

Sparse Linear Models: Estimation and Approximate Bayesian ...mlss.tuebingen.mpg.de/2013/2013/seeger_slides.pdf · Sparse Linear Models: Estimation and Approximate Bayesian Inference

sparse Bayesian Learning and the Relevance Vector Machine ...

Sparse Bayesian Learning and the Relevance Vector Machine

Bayesian Methods for Sparse Signal Recovery · Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao1 University of California, San Diego ... Robust Linear Regression and Outlier

Bayesian non-parametric models and inference for sparse and hierarchical latent structure · 2013-06-25 · Bayesian non-parametric models and inference for sparse and hierarchical

Sparse Bayesian

Sparse Approximations to Bayesian Gaussian Processes

Bayesian Learning of Sparse Gaussian Graphical Models

Spectral sparse Bayesian learning reflectivity inversion …web.cup.edu.cn/wutan973/docs/2013-06/... · · 2013-06-20Spectral sparse Bayesian learning reflectivity inversion ...

GPU-Based Sparse Bayesian Learning for Adaptive ...GPU-Based Sparse Bayesian Learning for Adaptive Transmission Tomography by HyungJu Jeon Department of Electrical and Computer Engineering

Non-Parametric Bayesian Dictionary Learning for Sparse