scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving...

20
scaling up RL with func1on approxima1on

Transcript of scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving...

Page 1: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

scalingupRLwithfunc1onapproxima1on

Page 2: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

Human-levelcontrolthroughdeepreinforcementlearning,Mnihet.al.,Nature518,Feb2015hFp://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

•pixelinput

•18joys-ck/bu3onposi-onsoutput

•changeingamescoreasfeedback

• convolu-onalnetrepresen-ngQ

•backpropaga-onfortraining!

humanlevelgamecontrol

Page 3: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

neuralnetwork

Page 4: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

convolu1on,weightsharing,andpooling

pixel

sharedfeaturedetector/kernel/filterw

pooled

featuremap

max(window)

fewerparametersduetosharingandpooling!

Page 5: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

reverseprojec-onsofneuronoutputsinpixelspace

whatdoesadeepneuralnetworkdo?

Page 6: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)
Page 7: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)
Page 8: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

composi1onalfeaturescomposi1onalproblem

solvingmul1plica1on(circuitdesign)

<—composedofaddingnumbers

<—composedofaddingbits

output:x.y—————multiply————————addingnums———————addingbits————input:xandy

humanknowledgeorganisa1on

findrootsofalinearexpression

<—composedofseWngexpressiontozeroandsolvinglinearequa1ons

<—composedofrearrangingterms

output:x=-2—————(findrootsofx+2)————————————(setx+2=0)—(solve)——————————————————————(rearrange)—————input:x,+,2,=,0

deeplayersmakerepresenta1onofknowledgeandprocesseshappenwithfewerneurons!

Page 9: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

backpropaga1on?Whatisthetargetagainstwhichtominimiseerror?

Page 10: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

prac1callyspeaking…minimiseMSEbySGD

Page 11: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

experiencereplay

savecurrenttransi-on(s,a,r,s’)inmemory

every1mestep

randomlysampleasetof(s,a,r,s’)frommemoryfortrainingQnetwork

(insteadoflearningfromcurrentstatetransi-on)

everystep=i.i.d+learnfromthepast

at

st

st+1

rt+1

Page 12: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

freezingtargetQmovingtarget=>oscilla-ons

stabiliselearningbyfixingtarget,movingiteverynowandthen

freeze

Page 13: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

doubleDQN

evalua1onoftargetac1on

selec1onoftargetac1on

maxQa’

DeepReinforcementLearningwithDoubleQ-learningvanHasseltet.al.,AAAI2016hFps://arxiv.org/pdf/1509.06461v3.pdf

Page 14: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

priori1sedexperiencereplay

sample(s,a,r,s’)frommemory

basedonsurprise

Priori1sedExperienceReplaySchaulet.al.,ICLR2016hFps://arxiv.org/pdf/1511.05952v4.pdf

Page 15: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

Combiningdecoupling(double),priori1sedreplay,andduellinghelps!

duellingarchitecture

Q(s,a)=V(s,u)+A(s,a,v)

u

v

Q

Q

DuelingNetworkArchitecturesforDeepRLWanget.al.,ICML2016hFps://arxiv.org/pdf/1511.06581v3.pdf

Page 16: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

howevertrainingis

SLOW

Page 17: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

makingdeepRLfasterandwilder(more

applicableintherealworld)!

Page 18: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

dataefficientexplora1on?

parallelism?

transferlearning?

makinguseofamodel?

Page 19: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

Q Q Q Q Q

Q Q Q Q Q

Q Qt

Qt Qt Qt Qt Qt

Qt Qt Qt Qt Qt

sharedparamsforQandtargetQ

parallellearnersgeWngindividual

experiences

lock-freeparamupdates

AsynchronousMethodsforDeepReinforcementLearning,Mnihet.al.,ICML2016hFp://jmlr.org/proceedings/papers/v48/mniha16.pdf

Page 20: scaling up RL with func1on approxima1on · composi1onal features composi1onal problem solving mulplicaon (circuit design)

codeforyoutoplaywith...

Telenor’sownimplementa1onofasynchronousdeepRL:hFps://github.com/traai/async-deep-rl

hFps://openrl.slack.comLet’skeeptheconversa1ongoing: