Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9
-
Upload
jackson-daugherty -
Category
Documents
-
view
38 -
download
2
description
Transcript of Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9
Reinforcement learning
This is mostly taken from Dayan and Abbot ch. 9
Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information.
Central problem – temporal credit assignment.
Example: Spatial learning is impaired by block of NMDA receptors (Morris, 1989)
Morris water maze rat
platform
Solving this problem is comprised of two separate tasks.
1. Predicting reward
2.Choosing the correct action
or
1. Policy evaluation (critic)
2. Policy improvement (actor)
Classical vs. instrumental conditioning
Classical think -> Pavlov dog
In instrumental the animal is rewarded for “correct” actions, and not, or even punished for incorrect.
In instrumental (Operant) what the animal does (Policy) matters.
Predicting reward – Rascola-Wagner rule
Notation
u – stimulusr - rewardv – expected rewardw – weight (filter)
uww
wuv
vr With:
For more than one stimulus: uu
Learning, r=1 Extinction, r=0
Random reward
Predicting future reward: Temporal Difference learning
In more realistic conditions, especially in operant conditioning the actual reward might come some time after the signal for the reward. What we might care about is not the immediate reward at this time point, but rather the total reward predicted given the choice made at this time. How can we estimate the total reward?
Total averagefuture reward at time t:
Assume that we estimate this with a linear estimator:
tT
tr0
)(
t
tuwtv0
)()()(
Use the δ rule at time t:
)()()()( tutww
Where δ is the difference between the actual future rewards, and the prediction of these rewards:
)()()( tvtrt
But, we do not know
Instead we can approximate this by:
)1()()( tvtrtr
Which gives us:
The temporal difference learning rule then becomes:
)()1()()( tvtvtrt
)()()()( tutww
(1)
(2)
Dopamine and predicted reward
Activity of VTA doparminergic neurons in a monkey. A. top- before learning, bottom after learningB. After learning. top- with reward, bottom – no reward
Generalization of TD(0)
1. u can be a vector u, so w is also a vector. This is for more complex, or multiple possible stimuli.
2. A decay term. Here:
)()()()( ' uvuvura a
Current location Location moved to after action a
This has the effect of putting a stronger emphasis on rewards that take fewer steps to reach.
Until now – how do we predict a reward.Still need to see how we make decisions of which path to take, or what policy to use.
Describe bee foraging example:
?
Different reward for each flower
Different reward for each flower P(rb) and P(ry)
Learn “action values” mb and my (the actor), these will determine which choice to make.
Assume rb=1, ry=2, what is the best choice we can make?
The average reward is:
What will maximize this reward?
yb ryPrbPr ][][
Learn “action values” mb and my, these will determine which choice to make.
Use softmax:
This is a stochastic choice, β is a variability parameter.A good choice for the “action values”: is to set them
to the mean reward:
This is also called “indirect actor” (???)
)exp()exp(
)exp()(;
)exp()exp(
)exp()(
yb
y
yb
b
mm
myP
mm
mbP
;; bbbb rmrm
How good is this choice?
Assume β=1, rb=1, ry=2, what is <r>
)exp()exp(
)exp()(;
)exp()exp(
)exp()(
yb
y
yb
b
rr
ryP
rr
rbP
>> rb=1;ry=2;>> pb=exp(rb)/(exp(rb)+exp(ry))pb = 0.2689>> py=exp(ry)/(exp(rb)+exp(ry))py = 0.7311 >> r_av=rb*pb+ry*pyr_av = 1.7311
;; bbbb rmrm
This choice can be learned using a delta rule
xxxx mrmm ;
β=1 β=50
t<100; rb=1, ry=2
t >100; rb=2, ry=1
Another option (direct actor ???) is to set the activation values to maximize the expected reward:
This can be done by stochastic gradient decent on <r>
For example:
So that generally for actions variable mx given action a:
A good choice for r0 is the mean of rx over all possible choices. (See D&A book pg 344)
yb ryPrbPr ][][
ybb
rbPyPrbPbPm
r][][])[1]([
)])([( 0rrxPmm aaxxx
The Maze task and sequential action choice
75.1))()((2
1)(
,1)20(2
1)(
,5.2)50(2
1)(
BvAvCv
Bv
Av
Policy evaluation: Initial random policy
)()1()()(
)()(
tvtvtrt
uwuw
Policy evaluation
What would it be for an ideal policy?
Policy improvement
Using the direct actor learn to improve the policy.
)(])[( 0rrxPmm aaxxx
Note – policy improvement and policy evaluation are best carried out sequentially:evaluate – improve – evaluate – Improve …
?At A:
75.0)()(0
75.0)()(0
AvCv
AvBv
For leftturn
For rightturn
V(a)=1.75
V(B)=2.5 V(C)=1
Reinforcement learning - summary