scaling up RL with func1on approxima1on€¦ · •backpropaga-on for training! human level game...

Post on 18-Apr-2020

3 views 0 download

Transcript of scaling up RL with func1on approxima1on€¦ · •backpropaga-on for training! human level game...

scalingupRLwithfunc1onapproxima1on

arjun.chandra@telenor.com

Human-levelcontrolthroughdeepreinforcementlearning,Mnihet.al.,Nature518,Feb2015hHp://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

•pixelinput

•18joys-ck/bu3onposi-onsoutput

•changeingamescoreasfeedback

• convolu-onalnetrepresen-ngQ

•backpropaga-onfortraining!

humanlevelgamecontrol

neuralnetwork

convolu1on,weightsharing,andpooling

pixel

sharedfeaturedetector/kernel/filterw

pooled

featuremap

max(window)

fewerparametersduetosharingandpooling!

reverseprojec-onsofneuronoutputsinpixelspace

whatdoesadeepneuralnetworkdo?

backpropaga1on?Whatisthetargetagainstwhichtominimiseerror?

prac1callyspeaking…minimiseMSEbySGD

experiencereplay

savecurrenttransi-on(s,a,r,s’)inmemory

every1mestep

randomlysampleasetof(s,a,r,s’)frommemoryfortrainingQnetwork

(insteadoflearningfromcurrentstatetransi-on)

everystep=i.i.d+learnfromthepast

at

st

st+1

rt+1

freezingtargetQmovingtarget=>oscilla-ons

stabiliselearningbyfixingtarget,movingiteverynowandthen

freeze

doubleDQN

evalua1onoftargetac1on

selec1onoftargetac1on

maxQa’

DeepReinforcementLearningwithDoubleQ-learningvanHasseltet.al.,AAAI2016hHps://arxiv.org/pdf/1509.06461v3.pdf

priori1sedexperiencereplay

sample(s,a,r,s’)frommemory

basedonsurprise

Priori1sedExperienceReplaySchaulet.al.,ICLR2016hHps://arxiv.org/pdf/1511.05952v4.pdf

Combiningdecoupling(double),priori1sedreplay,andduellinghelps!

duellingarchitecture

Q(s,a)=V(s,u)+A(s,a,v)

u

v

Q

Q

DuelingNetworkArchitecturesforDeepRLWanget.al.,ICML2016hHps://arxiv.org/pdf/1511.06581v3.pdf

ac1ona’sadvantage

V~avgQQ1

Q2

Q3

Q4

V~avgQQ1

Q2

Q3

Q4

arrowsrepresentadvantageofanac1on

advantage?e.g.4ac1ons,

V=E[Q]~Qaveragedacrossac1ons

howevertrainingis

SLOW

makingdeepRLfasterandwilder(more

applicableintherealworld)!

dataefficientexplora1on?

parallelism?

transferlearning?

makinguseofamodel?

Q Q Q Q Q

Q Q Q Q Q

Q Qt

Qt Qt Qt Qt Qt

Qt Qt Qt Qt Qt

sharedparamsforQandtargetQ

parallellearnersgemngindividual

experiences

lock-freeparamupdates

AsynchronousMethodsforDeepReinforcementLearning,Mnihet.al.,ICML2016hHp://jmlr.org/proceedings/papers/v48/mniha16.pdf

codeforyoutoplaywith...

Telenor’sownimplementa1onofasynchronousdeepRL:hHps://github.com/traai/async-deep-rl

hHps://openrl.slack.comLet’skeeptheconversa1ongoing:

Joinusingyourstudent.matnat.uio.noemailaddress!

intern with us in summer 2017 CV and motivation to:

arjun.chandra@telenor.com | tina@telenordigital.com (2 positions)

topic still to be finalised, but it will be at the intersection of software

engineering and machine learning