Download - Summary of part I: prediction and RL

Transcript

Summary of part I: prediction and RL

Prediction is important for action selection

• The problem: prediction of future reward

• The algorithm: temporal difference learning

• Neural implementation: dopamine dependent learning in BG

⇒ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model

⇒ Precise (normative!) theory for generation of dopamine firing patterns

⇒ Explains anticipatory dopaminergic responding, second order conditioning

⇒ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

prediction error hypothesis of dopamine

mea

sure

d fir

ing

rate

model prediction error

mea

sure

d fir

ing

rate

Bayer & Glimcher (2005)

at end of trial: δt = rt - Vt (just like R-W)

Vt = η (1−η)t− i ri

i=1

t

Global plan

• Reinforcement learning I:

– prediction

– classical conditioning

– dopamine

• Reinforcement learning II:• Reinforcement learning II:

– dynamic programming; action selection

– Pavlovian misbehaviour

– vigor

• Chapter 9 of Theoretical Neuroscience

Action Selection

• Evolutionary specification• Immediate reinforcement:

– leg flexion– Thorndike puzzle box– Thorndike puzzle box– pigeon; rat; human matching

• Delayed reinforcement:– these tasks– mazes– chess

Bandler;Blanchard

Immediate Reinforcement

• stochastic policy:

• based on action values:

5

RL mm ;

Indirect Actoruse RW rule:

6

25.0;05.0 ==RL p

R

p

L rr switch every 100 trials

Direct ActorRL rRPrLPE ][][)( +=m

][][][

][][][

RPLPm

RPRPLP

m

LPRL

ββ −=∂

∂=∂

( )( )( )[ ] [ ] [ ]L L RE

P L r P L r P R rβ∂ = − +m ( )( )( )[ ] [ ] [ ]L L R

L

EP L r P L r P R r

mβ∂ = − +

∂m

( )( )[ ] ( )L

L

EP L r E

mβ∂ = −

∂m

m

( )( )( ) if L is chosenL

L

Er E

mβ∂ ≈ −

∂m

m

(1 )( ) ( ( ))( )L R L R am m m m r E L Rε ε− → − − + − −m

Direct Actor

8

Could we Tell?

• correlate past rewards, actions with present choice

• indirect actor (separate clocks):

• direct actor (single clock):

Matching: Concurrent VI-VI

Lau, Glimcher, Corrado, Sugrue, Newsome

Matching

• income not return• approximately exponential

in r

• alternation choice kernel

Action at a (Temporal) Distance

x=1

x=2 x=3

x=1

x=2 x=3

12

• learning an appropriate action at x=1:– depends on the actions at x=2 and x=3

– gains no immediate feedback

• idea: use prediction as surrogate feedback

Action Selectionstart with policy: ))()((];[ xmxmxLP RL −= σ

evaluate it: )3(),2(),1( VVV

x=1

x=2 x=3

x=1 x=3x=2

13

improve it:

thus choose R more frequently than L;C δα *m∆

0.025-0.175

-0.1250.125

x=1

x=2 x=3

Policyif 0>δ

• value is too pessimistic• action is better than average

v∆⇒

P∆⇒

x=1 x=3x=2

14

actor/critic

m1

m2

m3

mn

dopamine signals to both motivational & motor striatum appear, surprisingly the same

suggestion: training both values & policies

Formally: Dynamic Programming

Variants: SARSA[ ]CuxxVrECQ tttt ==+= + ,1|)(),1( 1

**

( )),1(),2(),1(),1( CQuQrCQCQ actualt −++→ ε

Morris et al, 2006

Variants: Q learning[ ]CuxxVrECQ tttt ==+= + ,1|)(),1( 1

**

( )),1(),2(max),1(),1( CQuQrCQCQ ut −++→ ε

Roesch et al, 2007

Summary

• prediction learning– Bellman evaluation

• actor-critic– asynchronous policy iteration– asynchronous policy iteration

• indirect method (Q learning)– asynchronous value iteration

[ ]1|)()1( 1** =+= + ttt xxVrEV

[ ]CuxxVrECQ tttt ==+= + ,1|)(),1( 1**

Impulsivity & Hyperbolic Discounting

• humans (and animals) show impulsivity in:– diets– addiction– spending, …

• intertemporal conflict between short and long term choices• often explained via hyperbolic discount functions• often explained via hyperbolic discount functions

• alternative is Pavlovian imperative to an immediate reinforcer

• framing, trolley dilemmas, etc

Direct/Indirect Pathways

• direct: D1: GO; learn from DA increase• indirect: D2: noGO; learn from DA decrease• hyperdirect (STN) delay actions given

strongly attractive choices

Frank

Frank

• DARPP-32: D1 effect• DRD2: D2 effect

Three Decision Makers

• tree search• position evaluation• situation memory

Multiple Systems in RL

• model-based RL– build a forward model of the task, outcomes– search in the forward model (online DP)

• optimal use of information• computationally ruinous• computationally ruinous

• cached-based RL– learn Q values, which summarize future worth

• computationally trivial• bootstrap-based; so statistically inefficient

• learn both – select according to uncertainty

Animal Canary

• OFC; dlPFC; dorsomedial striatum; BLA?• dosolateral striatum, amygdala

Two Systems:

Behavioural Effects

Effects of Learning

• distributional value iteration• (Bayesian Q learning)

• fixed additional uncertainty per step

One Outcome

shallow treeimpliesgoal-directedcontrolwins

Human Canary...a b

• if a → c and c → £££ , then do more of a or b?– MB: b– MF: a (or even no effect)

c

Behaviour

• action values depend on both systems:

• expect that will vary by subject (but be fixed)

( ) ),(),(, uxQuxQuxQ MBMFtot β+=β

Neural Prediction Errors (1→2)

R ventral striatum

• note that MB RL does not use this prediction error – training signal?

R ventral striatum

(anatomical definition)

Neural Prediction Errors (1)

• right nucleus accumbens

behaviour

1-2, not 1

Vigour

• Two components to choice:– what:

• lever pressing• direction to run

34

• direction to run• meal to choose

– when/how fast/how vigorous• free operant tasks

• real-valued DP

The model

τ

cost

LP

NP

?

how fast τ

VC

vigour cost

UC

unit cost(reward)

UR

PR

35

choose(action,ττττ) = (LP,τ1)

ττττ1 time

CostsRewards

choose(action,ττττ)= (LP,τ2)

CostsRewards

Other

ττττ2 timeS1 S2S0

goal

The model

Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time)

36

choose(action,ττττ) = (LP,τ1)

ττττ1 time

CostsRewards

choose(action,ττττ)= (LP,τ2)

CostsRewards

ττττ2 timeS1 S2S0

ARL

Compute differential values of actions

Differential value of taking action L

with latency τwhen in state x

ρ = average rewards

minus costs, per unit time

Average Reward RL

37

• steady state behavior (not learning dynamics)

(Extension of Schwartz 1993)

QL,τ(x) = Rewards – Costs + Future Returns

)'(xV

τρ−v

uCC τ+

⇒ Choose action with largest expected reward minus cost

1. Which action to take?

• slow → delays (all) rewards2.How fast to perform it?

• slow → less costly (vigour

Average Reward Cost/benefit Tradeoffs

38

• slow → delays (all) rewards

• net rate of rewards = cost of delay (opportunity cost of time)

⇒ Choose rate that balances vigour and opportunity costs

• slow → less costly (vigour cost)

explains faster (irrelevant) actions under hunger, etc

masochism

0 0.5 1 1.50

0.2

0.4pr

obab

ility

0 20 400

10

20

30

rate

per

min

ute

1st NPLP

Optimal response rates

Exp

erim

enta

l dat

aNiv, Dayan, Joel, unpublished1st Nose poke

39

0 0.5 1 1.5 0 20 40 Exp

erim

enta

l dat

a

seconds since reinforcement

Mod

el s

imul

atio

n

1st Nose poke

seconds since reinforcement0 0.5 1 1.5

0

0.2

0.4

prob

abili

ty

0 20 400

10

20

30ra

te p

er m

inut

e

seconds

seconds

Optimal response rates

Model simulation

50

% R

esp

on

ses

on

leve

r A Model

Perfect matching

100

80

60

Pigeon APigeon BPerfect matching

% R

esp

on

ses

on

key

A

Experimental data

40

500

0% Reinforcements on lever A

% R

esp

on

ses

on

leve

r A

0 20 40 60 80 100

40

20

% Reinforcements on key A

% R

esp

on

ses

on

key

A

Herrnstein 1961More:• # responses• interval length• amount of reward• ratio vs. interval• breaking point• temporal structure• etc.

Effects of motivation (in the model)RR25

RxVC

CRpuxQ vur ⋅−+−−= τττ )'(),,(

0),,(

2=−=

∂∂

RCuxQ v

τττ

opt

vopt

R

C=ττττ

41

low utilityhigh utility

mea

n la

ten

cy

LP Other

energizing effect

∂ ττ optR

Effects of motivation (in the model)re

spo

nse

rat

e / m

inu

te

RR25

resp

on

se r

ate

/ min

ute

directing effect1

42

UR 50%

resp

on

se r

ate

/ min

ute

seconds from reinforcement

resp

on

se r

ate

/ min

ute

seconds from reinforcement

low utilityhigh utility

mea

n la

ten

cy

LP Other

energizing effect

2

Phasic dopamine firing = reward prediction error

Relation to Dopamine

43

What about tonic dopamine?moreless

Tonic dopamine = Average reward ratem

inu

tes

2000

2500Control

DA depleted

800

1000

1200

30 m

inu

tes Control

DA depleted

1. explains pharmacological manipulations

2. dopamine control of vigour through BG pathways

44NB. phasic signal RPE for choice/value learning

Aberman and Salamone 1999

# L

Ps

in 3

0 m

inu

tes

1 4 16 64

500

1000

1500

1 4 8 160

200

400

600

Model simulation

# L

Ps

in 3

0 m

inu

tes

• eating time confound• context/state dependence (motivation & drugs?)• less switching=perseveration

Tonic dopamine hypothesis

…also explains effects of phasic dopamine on response times

$ $ $ $ $ $♫ ♫ ♫ ♫ ♫♫

45 Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992

reaction time

firin

g ra

te

…also explains effects of phasic dopamine on response times

Sensory Decisions as Optimal Stopping

• consider listening to:

• decision: choose, or sample• decision: choose, or sample

Optimal Stopping

• equivalent of state u=1 is 11 nu =

• and states u=2,3 is ( )212 2

1nnu +=

2.5

0.1Cr

σ =

= −

Transition Probabilities

Computational Neuromodulation

• dopamine– phasic: prediction error for reward– tonic: average reward (vigour)

• serotonin• serotonin– phasic: prediction error for punishment?

• acetylcholine:– expected uncertainty?

• norepinephrine– unexpected uncertainty; neural interrupt?

Conditioning

• Ethology– optimality– appropriateness

• Computation– dynamic progr.– Kalman filtering

prediction: of important eventscontrol: in the light of those predictions

50

– appropriateness

• Psychology– classical/operant

conditioning

– Kalman filtering

• Algorithm– TD/delta rules– simple weights

• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum

class of stylized tasks withstates, actions & rewards

– at each timestep t the world takes on state st and delivers reward rt, and the agent chooses an action at

Markov Decision Process

World: You are in state 34.Your immediate reward is 3. You have 3 actions.

Markov Decision Process

Your immediate reward is 3. You have 3 actions.Robot: I’ll take action 2.

World: You are in state 77.Your immediate reward is -7. You have 2 actions.

Robot: I’ll take action 1.

World: You’re in state 34 (again).Your immediate reward is 3. You have 3 actions.

Markov Decision Process

Stochastic process defined by:–reward function:

rt ~ P(rt | st)–transition function:

st ~ P(st+1 | st, at)

Markov Decision Process

Stochastic process defined by:–reward function:

rt ~ P(rt | st)–transition function:

st ~ P(st+1 | st, at) Markov property–future conditionally independent of past, given st

The optimal policy

Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policies

Theorem: For every MDP there exists (at least) Theorem: For every MDP there exists (at least) one deterministic optimal policy.

� by the way, why is the optimal policy just a mapping from states to actions? couldn’t you earn more reward by choosing a different action depending on last 2 states?

Pavlovian & Instrumental Conditioning

• Pavlovian– learning values and predictions– using TD error

• Instrumental• Instrumental– learning actions:

• by reinforcement (leg flexion)• by (TD) critic

– (actually different forms: goal directed & habitual)

Pavlovian-Instrumental Interactions

• synergistic– conditioned reinforcement– Pavlovian-instrumental transfer

• Pavlovian cue predicts the instrumental outcome• behavioural inhibition to avoid aversive outcomes• behavioural inhibition to avoid aversive outcomes

• neutral– Pavlovian-instrumental transfer

• Pavlovian cue predicts outcome with same motivational valence

• opponent– Pavlovian-instrumental transfer

• Pavlovian cue predicts opposite motivational valence

– negative automaintenance

-ve Automaintenance in Autoshaping

• simple choice task– N: nogo gives reward r=1– G: go gives reward r=0

• learn three quantities• learn three quantities– average value– Q value for N – Q value for G

• instrumental propensity is

-ve Automaintenance in Autoshaping

• Pavlovian action– assert: Pavlovian impetus towards G is v(t)– weight Pavlovian and instrumental advantages by ω –

competitive reliability of Pavlov

• new propensities

• new action choice

-ve Automaintenance in Autoshaping

• basic –ve automaintenance effect (µ=5)

• lines are theoretical • lines are theoretical asymptotes

• equilibrium probabilities of action