Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Reinforcement learning

This is mostly taken from Dayan and Abbot ch. 9

Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information.

Central problem – temporal credit assignment.

Example: Spatial learning is impaired by block of NMDA receptors (Morris, 1989)

Morris water maze rat

platform

Solving this problem is comprised of two separate tasks.

1. Predicting reward

2.Choosing the correct action

or

1. Policy evaluation (critic)

2. Policy improvement (actor)

Classical vs. instrumental conditioning

Classical think -> Pavlov dog

In instrumental the animal is rewarded for “correct” actions, and not, or even punished for incorrect.

In instrumental (Operant) what the animal does (Policy) matters.

Predicting reward – Rascola-Wagner rule

Notation

u – stimulusr - rewardv – expected rewardw – weight (filter)

uww

wuv

vr With:

For more than one stimulus: uu

Learning, r=1 Extinction, r=0

Random reward

Predicting future reward: Temporal Difference learning

In more realistic conditions, especially in operant conditioning the actual reward might come some time after the signal for the reward. What we might care about is not the immediate reward at this time point, but rather the total reward predicted given the choice made at this time. How can we estimate the total reward?

Total averagefuture reward at time t:

Assume that we estimate this with a linear estimator:

tT

tr0

)(

t

tuwtv0

)()()(

Use the δ rule at time t:

)()()()( tutww

Where δ is the difference between the actual future rewards, and the prediction of these rewards:

)()()( tvtrt

But, we do not know

Instead we can approximate this by:

)1()()( tvtrtr

Which gives us:

The temporal difference learning rule then becomes:

)()1()()( tvtvtrt

)()()()( tutww

(1)

(2)

Dopamine and predicted reward

Activity of VTA doparminergic neurons in a monkey. A. top- before learning, bottom after learningB. After learning. top- with reward, bottom – no reward

Generalization of TD(0)

1. u can be a vector u, so w is also a vector. This is for more complex, or multiple possible stimuli.

2. A decay term. Here:

)()()()( ' uvuvura a

Current location Location moved to after action a

This has the effect of putting a stronger emphasis on rewards that take fewer steps to reach.

Until now – how do we predict a reward.Still need to see how we make decisions of which path to take, or what policy to use.

Describe bee foraging example:

?

Different reward for each flower

Different reward for each flower P(rb) and P(ry)

Learn “action values” mb and my (the actor), these will determine which choice to make.

Assume rb=1, ry=2, what is the best choice we can make?

The average reward is:

What will maximize this reward?

yb ryPrbPr ][][

Learn “action values” mb and my, these will determine which choice to make.

Use softmax:

This is a stochastic choice, β is a variability parameter.A good choice for the “action values”: is to set them

to the mean reward:

This is also called “indirect actor” (???)

)exp()exp(

)exp()(;

)exp()exp(

)exp()(

yb

y

yb

b

mm

myP

mm

mbP

;; bbbb rmrm

How good is this choice?

Assume β=1, rb=1, ry=2, what is <r>

)exp()exp(

)exp()(;

)exp()exp(

)exp()(

yb

y

yb

b

rr

ryP

rr

rbP

>> rb=1;ry=2;>> pb=exp(rb)/(exp(rb)+exp(ry))pb = 0.2689>> py=exp(ry)/(exp(rb)+exp(ry))py = 0.7311 >> r_av=rb*pb+ry*pyr_av = 1.7311

;; bbbb rmrm

This choice can be learned using a delta rule

xxxx mrmm ;

β=1 β=50

t<100; rb=1, ry=2

t >100; rb=2, ry=1

Another option (direct actor ???) is to set the activation values to maximize the expected reward:

This can be done by stochastic gradient decent on <r>

For example:

So that generally for actions variable mx given action a:

A good choice for r0 is the mean of rx over all possible choices. (See D&A book pg 344)

yb ryPrbPr ][][

ybb

rbPyPrbPbPm

r][][])[1]([

)])([( 0rrxPmm aaxxx

The Maze task and sequential action choice

75.1))()((2

1)(

,1)20(2

1)(

,5.2)50(2

1)(

BvAvCv

Bv

Av

Policy evaluation: Initial random policy

)()1()()(

)()(

tvtvtrt

uwuw

Policy evaluation

What would it be for an ideal policy?

Policy improvement

Using the direct actor learn to improve the policy.

)(])[( 0rrxPmm aaxxx

Note – policy improvement and policy evaluation are best carried out sequentially:evaluate – improve – evaluate – Improve …

?At A:

75.0)()(0

75.0)()(0

AvCv

AvBv

For leftturn

For rightturn

V(a)=1.75

V(B)=2.5 V(C)=1

Reinforcement learning - summary

Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9

Documents

Transcript of Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9