Reinforcement Learning Generalization and Function Approximation

49
Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012

description

Reinforcement Learning Generalization and Function Approximation. Subramanian Ramamoorthy School of Informatics 28 February, 2012. Function Approximation in RL: Why?. Tabular V p : What should we do when the state spaces are very large and this becomes computationally expensive?. - PowerPoint PPT Presentation

Transcript of Reinforcement Learning Generalization and Function Approximation

Page 1: Reinforcement Learning Generalization and Function Approximation

Reinforcement Learning

Generalization and Function Approximation

Subramanian RamamoorthySchool of Informatics

28 February, 2012

Page 2: Reinforcement Learning Generalization and Function Approximation

Function Approximation in RL: Why?

28/02/2012 2

Tabular V:

What should we do when the statespaces are very large and this becomes computationally expensive?

Represent V as a surface,use function approximation

Page 3: Reinforcement Learning Generalization and Function Approximation

Generalization

28/02/2012 3

… how experience with small part of state space is used toproduce good behaviour over large part of state space

Page 4: Reinforcement Learning Generalization and Function Approximation

Rich Toolbox

• Neural networks, decision trees, multivariate regression ...• Use of statistical clustering for state aggregation• Can swap in/out your favourite function approximation

method as long as they can deal with:– learning while interacting, online– non-stationarity induced by policy changes

• So, combining gradient descent methods with RL requires care

• May use generalization methods to approximate states, actions, value functions, Q-value functions, policies

28/02/2012 4

Page 5: Reinforcement Learning Generalization and Function Approximation

Value Prediction with FA

As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function

V

updated. is

vectorparameter only the and , aon

depends , , at time estimatefunction valueThe

t

tVt

vectorparameter

e.g., t could be the vector of connection weights

of a neural network.

28/02/2012 5

Page 6: Reinforcement Learning Generalization and Function Approximation

Adapt Supervised Learning Algorithms

Parameterized FunctionInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Training example = {input, target output}

28/02/2012 6

Page 7: Reinforcement Learning Generalization and Function Approximation

Backups as Training Examples

e.g., the TD(0) backup :

V (st ) V (st ) rt1 V (st1) V (st )

description of st , rt1 V (st1)

As a training example:

input target output

28/02/2012 7

Page 8: Reinforcement Learning Generalization and Function Approximation

Feature Vectors

28/02/2012 8

This gives you a summary of state, e.g., state weighted linear combination of features

Page 9: Reinforcement Learning Generalization and Function Approximation

Linear Methods

Represent states as feature vectors :

for each s S :s s(1),s(2),,s(n) T

Vt (s)tTs t (i)s(i)

i1

n

Vt (s) ?

28/02/2012 9

Page 10: Reinforcement Learning Generalization and Function Approximation

What are we learning? From what?

t t (1),t (2),,t (n) T

Assume Vt is a (sufficiently smooth) differentiable function

of t , for all s S.

Assume, for now, training examples of this form:

description of st , V (st )

28/02/2012 10

Page 11: Reinforcement Learning Generalization and Function Approximation

Performance Measures

• Many are applicable but…• a common and simple one is the mean-squared error (MSE)

over a distribution P :

• Why P ?• Why minimize MSE?• Let us assume that P is always the distribution of states at

which backups are done.• The on-policy distribution: the distribution created while

following the policy being evaluated. Stronger results are available for this distribution.

MSE(t ) P(s) V (s) Vt (s) sS

2

28/02/2012 11

Page 12: Reinforcement Learning Generalization and Function Approximation

RL with FA – Basic Setup

• For the policy evaluation (i.e., value prediction) problem, we are minimizing:

• This could be solved by gradient descent (updating the parameter vector that specifies function V):

28/02/2012 12

MSE(t ) P(s) V (s) Vt (s) sS

2

Approximated ‘surface’Value obtained, e.g.,by backup updates

Unbiased estimator,

Page 13: Reinforcement Learning Generalization and Function Approximation

Gradient Descent

28/02/2012 13

Let f be any function of the parameter space.

Its gradient at any point t in this space is :

f (t )

f (t )

(1),f (t )

(2),,

f (t )

(n)

T

.

(1)

(2)

t t(1), t(2) T

t1

t

f (t )

Iteratively move down the gradient:

Page 14: Reinforcement Learning Generalization and Function Approximation

Gradient Descent

t1

t

1

2

MSE(

t )

t

1

2

P(s)

sS

V (s) Vt (s) 2

t P(s)

sS

V (s) Vt (s) Vt (s)

For the MSE given earlier and using the chain rule:

28/02/2012 14

Page 15: Reinforcement Learning Generalization and Function Approximation

Gradient Descent

t1

t

1

2

V (st ) Vt (st ) 2

t V

(st ) Vt (st ) Vt (st ),

Use just the sample gradient instead:

Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if decreases appropriately with t.

E V (st ) Vt (st ) Vt (st ) P(s) V (s) Vt (s)

sS

Vt (s)

28/02/2012 15

Page 16: Reinforcement Learning Generalization and Function Approximation

But We Don’t have these Targets

Suppose we just have targets v t instead :t1

t v t Vt (st )

Vt (st )

If each v t is an unbiased estimate of V (st ),

i.e., E v t V (st ), then gradient descent converges

to a local minimum (provided decreases appropriately).

e.g., the Monte Carlo target v t Rt :

t1

t Rt Vt (st )

Vt (st )

28/02/2012 16

Page 17: Reinforcement Learning Generalization and Function Approximation

RL with FA, TD style

28/02/2012 17

In practice, how do we obtain this?

Page 18: Reinforcement Learning Generalization and Function Approximation

Nice Properties of Linear FA Methods

• The gradient is very simple:• For MSE, the error surface is simple: quadratic surface with a

single minimum.• Linear gradient descent TD() converges:

– Step size decreases appropriately– On-line sampling (states sampled from the on-policy

distribution)– Converges to parameter vector with property:

Vt (s)

s

MSE()

1 1

MSE()

best parameter vector

28/02/2012 18

Page 19: Reinforcement Learning Generalization and Function Approximation

On-Line Gradient-Descent TD()

28/02/2012 19

Page 20: Reinforcement Learning Generalization and Function Approximation

On Basis Functions: Coarse Coding

28/02/2012 20

Page 21: Reinforcement Learning Generalization and Function Approximation

Shaping Generalization in Coarse Coding

28/02/2012 21

Page 22: Reinforcement Learning Generalization and Function Approximation

Learning and Coarse Coding

28/02/2012 22

Page 23: Reinforcement Learning Generalization and Function Approximation

Tile Coding

28/02/2012

• Binary feature for each tile• Number of features present at

any one time is constant• Binary features means

weighted sum easy to compute• Easy to compute indices of the

features present

23

Page 24: Reinforcement Learning Generalization and Function Approximation

Tile Coding, Contd.

Irregular tilings

Hashing

28/02/2012 24

Page 25: Reinforcement Learning Generalization and Function Approximation

Radial Basis Functions (RBFs)

s(i)exp s c i

2

2 i2

e.g., Gaussians

28/02/2012 25

Page 26: Reinforcement Learning Generalization and Function Approximation

Beating the “Curse of Dimensionality”

• Can you keep the number of features from going up exponentially with the dimension?

• Function complexity, not dimensionality, is the problem.• Kanerva coding:

– Select a set of binary prototypes– Use Hamming distance as distance measure– Dimensionality is no longer a problem, only complexity

• “Lazy learning” schemes:– Remember all the data– To get new value, find nearest neighbors & interpolate– e.g., (nonparametric) locally-weighted regression

28/02/2012 26

Page 27: Reinforcement Learning Generalization and Function Approximation

Going from Value Prediction to GPI

• So far, we’ve only discussed policy evaluation where the value function is represented as an approximated function

• In order to extend this to a GPI-like setting,1. Firstly, we need to use the action-value functions2. Combine that with the policy improvement and action

selection steps3. For exploration, we need to think about on-policy vs. off-

policy methods

28/02/2012 27

Page 28: Reinforcement Learning Generalization and Function Approximation

Gradient Descent Update for Action-Value Function Prediction

28/02/2012 28

Page 29: Reinforcement Learning Generalization and Function Approximation

How to Plug-in Policy Improvement or Action Selection?

• If spaces are very large, or continuous, this is an active research topic and there are no conclusive answers

• For small-ish discrete spaces,– For each action, a, available at a state, st, compute Qt(st,a)

and find the greedy action according to it

– Then, one could use this as part of an -greedy action selection or as the estimation policy in off-policy methods

28/02/2012 29

Page 30: Reinforcement Learning Generalization and Function Approximation

On Eligibility Traces with Function Approx.

• The formulation of control with function approx., so far, is mainly assuming accumulating traces

• Replacing traces have advantages over this• However, they do not directly extend to case of function

approximation– Notice that we now maintain a column vector of eligibility traces, one

trace for each component of parameter vector– So, can’t update separately for a state (as in tabular methods)

• A practical way to get around this is to do the replacing traces procedure for features rather than states– Could also utilize an optional clearing procedure, over features

28/02/2012 30

Page 31: Reinforcement Learning Generalization and Function Approximation

On-policy, SARSA(), control with function approximation

28/02/2012 31

Let’s discuss it using a grid world…

Page 32: Reinforcement Learning Generalization and Function Approximation

Off-policy, Q-learning, control with function approximation

28/02/2012 32

Let’s discuss it using the same grid world…

Page 33: Reinforcement Learning Generalization and Function Approximation

Example: Mountain-car Task

• Drive an underpowered car up a steep mountain road

• Gravity is stronger than engine (like in cart-pole example)

• Example of a continuous control task where system must move away from goal first, then converge to goal

• Reward of -1 until car ‘escapes’

• Actions: +, -, 0

28/02/2012 33

Page 34: Reinforcement Learning Generalization and Function Approximation

Mountain-car Example: Cost-to-go Function (SARSA() solution)

28/02/2012 34

Page 35: Reinforcement Learning Generalization and Function Approximation

Mountain Car Solution with RBFs

28/02/2012 35

[Computed by M. Kretchmar]

Page 36: Reinforcement Learning Generalization and Function Approximation

Parameter Choices: Mountain Car Example

28/02/2012 36

Page 37: Reinforcement Learning Generalization and Function Approximation

Off-Policy Bootstrapping

• All methods except MC make use of an existing value estimate in order to update the value function

• If the value function is being separately approximated, what is the combined effect of the two iterative processes?

• As we already saw, value prediction with linear gradient-descent function approximation converges to a sub-optimal point (as measured by MSE)– The further strays from 1, the more the deviation from optimality

• Real issue is that V is learned from an on-policy distribution– Off-policy bootstrapping with function approximation can

lead to divergence (MSE tending to infinity)

28/02/2012 37

Page 38: Reinforcement Learning Generalization and Function Approximation

Some Issues with Function Approximation:Baird’s Counterexample

28/02/2012 38

Reward = 0, on all transitionsTrue V(s) = 0, for all sParameter updates, as below.

Page 39: Reinforcement Learning Generalization and Function Approximation

Baird’s Counterexample

28/02/2012 39

If we backup according to an uniform off-policy distribution (as opposedto the on-policy distribution), the estimates diverge for some parameters!

- similar counterexamples exist for Q-learning as well…

Page 40: Reinforcement Learning Generalization and Function Approximation

Another Example

28/02/2012 40

Baird’s counterexample has a simple fix: Instead of taking small steps towards expected one step returns, change value function to the best least-squares approximation.

However, even this is not a general rule!

Reward = 0, on all transitionsTrue V(s) = 0, for all s

Page 41: Reinforcement Learning Generalization and Function Approximation

Tsitsiklis and Van Roy’s Counterexample

28/02/2012 41

Some function approximation methods (that do not extrapolate),such as nearest neighbors or locally weighted regression, can avoid this.- another solution is to change the objective (e.g., min 1-step expected error)

Page 42: Reinforcement Learning Generalization and Function Approximation

Some Case Studies

Positive examples (instead of just counterexamples)…

28/02/2012 42

Page 43: Reinforcement Learning Generalization and Function Approximation

TD Gammon

Tesauro 1992, 1994, 1995, ...• White has just rolled a 5 and a 2 so

can move one of his pieces 5 and one (possibly the same) 2 steps

• Objective is to advance all pieces to points 19-24

• Hitting• Doubling• 30 pieces, 24 locations implies

enormous number of configurations

• Effective branching factor of 400

43

Page 44: Reinforcement Learning Generalization and Function Approximation

A Few Details

• Reward: 0 at all times except those in which the game is won, when it is 1

• Episodic (game = episode), undiscounted

• Gradient descent TD() with a multi-layer neural network– weights initialized to small random numbers– backpropagation of TD error– four input units for each point; unary encoding of

number of white pieces, plus other features

• Use of afterstates, learning during self-play

28/02/2012 44

Page 45: Reinforcement Learning Generalization and Function Approximation

Multi-layer Neural Network

28/02/2012 45

Page 46: Reinforcement Learning Generalization and Function Approximation

Summary of TD-Gammon Results

28/02/2012 46

Page 47: Reinforcement Learning Generalization and Function Approximation

Arthur Samuel 1959, 1967

Samuel’s Checkers Player

• Score board configurations by a “scoring polynomial” (after Shannon, 1950)

• Minimax to determine “backed-up score” of a position• Alpha-beta cutoffs• Rote learning: save each board config encountered

together with backed-up score– needed a “sense of direction”: like discounting

• Learning by generalization: similar to TD algorithm

28/02/2012 47

Page 48: Reinforcement Learning Generalization and Function Approximation

Samuel’s Backups

28/02/2012 48

Page 49: Reinforcement Learning Generalization and Function Approximation

The Basic Idea

“. . . we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play.”

A. L. Samuel

Some Studies in Machine Learning Using the Game of Checkers, 1959

28/02/2012 49