scaling up RL with func1on approxima1on€¦ · •backpropaga-on for training! human level game...
Transcript of scaling up RL with func1on approxima1on€¦ · •backpropaga-on for training! human level game...
scalingupRLwithfunc1onapproxima1on
Human-levelcontrolthroughdeepreinforcementlearning,Mnihet.al.,Nature518,Feb2015hHp://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
•pixelinput
•18joys-ck/bu3onposi-onsoutput
•changeingamescoreasfeedback
• convolu-onalnetrepresen-ngQ
•backpropaga-onfortraining!
humanlevelgamecontrol
neuralnetwork
convolu1on,weightsharing,andpooling
pixel
sharedfeaturedetector/kernel/filterw
pooled
featuremap
max(window)
fewerparametersduetosharingandpooling!
reverseprojec-onsofneuronoutputsinpixelspace
whatdoesadeepneuralnetworkdo?
backpropaga1on?Whatisthetargetagainstwhichtominimiseerror?
prac1callyspeaking…minimiseMSEbySGD
experiencereplay
savecurrenttransi-on(s,a,r,s’)inmemory
every1mestep
randomlysampleasetof(s,a,r,s’)frommemoryfortrainingQnetwork
(insteadoflearningfromcurrentstatetransi-on)
everystep=i.i.d+learnfromthepast
at
st
st+1
rt+1
freezingtargetQmovingtarget=>oscilla-ons
stabiliselearningbyfixingtarget,movingiteverynowandthen
freeze
doubleDQN
evalua1onoftargetac1on
selec1onoftargetac1on
maxQa’
DeepReinforcementLearningwithDoubleQ-learningvanHasseltet.al.,AAAI2016hHps://arxiv.org/pdf/1509.06461v3.pdf
priori1sedexperiencereplay
sample(s,a,r,s’)frommemory
basedonsurprise
Priori1sedExperienceReplaySchaulet.al.,ICLR2016hHps://arxiv.org/pdf/1511.05952v4.pdf
Combiningdecoupling(double),priori1sedreplay,andduellinghelps!
duellingarchitecture
Q(s,a)=V(s,u)+A(s,a,v)
u
v
Q
Q
DuelingNetworkArchitecturesforDeepRLWanget.al.,ICML2016hHps://arxiv.org/pdf/1511.06581v3.pdf
ac1ona’sadvantage
V~avgQQ1
Q2
Q3
Q4
V~avgQQ1
Q2
Q3
Q4
arrowsrepresentadvantageofanac1on
advantage?e.g.4ac1ons,
V=E[Q]~Qaveragedacrossac1ons
howevertrainingis
SLOW
makingdeepRLfasterandwilder(more
applicableintherealworld)!
dataefficientexplora1on?
parallelism?
transferlearning?
makinguseofamodel?
Q Q Q Q Q
Q Q Q Q Q
Q Qt
Qt Qt Qt Qt Qt
Qt Qt Qt Qt Qt
sharedparamsforQandtargetQ
parallellearnersgemngindividual
experiences
lock-freeparamupdates
AsynchronousMethodsforDeepReinforcementLearning,Mnihet.al.,ICML2016hHp://jmlr.org/proceedings/papers/v48/mniha16.pdf
codeforyoutoplaywith...
Telenor’sownimplementa1onofasynchronousdeepRL:hHps://github.com/traai/async-deep-rl
hHps://openrl.slack.comLet’skeeptheconversa1ongoing:
Joinusingyourstudent.matnat.uio.noemailaddress!
intern with us in summer 2017 CV and motivation to:
[email protected] | [email protected] (2 positions)
topic still to be finalised, but it will be at the intersection of software
engineering and machine learning