Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement...
Transcript of Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement...
![Page 1: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/1.jpg)
Asynchronous Methods for Deep Reinforcement Learning PaperbyVolodymyrMnih,AdriàPuigdomènechBadia,MehdiMirza,
AlexGraves,TimothyP.Lillicrap,TimHarley,DavidSilver,KorayKavukcuoglu
Presentedby:PihelSaatmann
![Page 2: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/2.jpg)
Reinforcement learning
• State–„snapshot“oftheenvironment
• Ac'on–leadstonewstate,someNmesreward
• Reward–Nmedelayed,sparse
• Policy–rulesforchoosingacNon
![Page 3: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/3.jpg)
So far
• ThoughtthatonlineRLalgorithmswithdeepNN-sareunstable.• Problems-correlatedandnon-staNonaryinputdata.
• Tocountertheseproblemsdatacanbestoredinexperiencereplaymemory.• Thisusesmorememory/computaNonalpower.
• DeepRLmethodsrequirespecializedhardware(GPUs)ormassivedistributedarchitectures.
![Page 4: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/4.jpg)
Q-learning
• AteachNmestept,theagentreceivesastatestandselectsanacNonaaccordingtoitspolicyπ.Thentheagentgetsthenextstatest+1andascalarrewardrt.• Thegoalistomaximizetheexpectedreturnfromeachstatest.• QfuncNonesimatestheacNon’svalue.• EachNmetheagentdoesanacNontheQvalueisupdated.• Off-policymethod–updaNngQfndoesnotdependonpolicy.
![Page 5: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/5.jpg)
Asynchronous RL framework
• InsteadofexperiencereplaytheyasynchronouslyexecutemulNpleagentsinparallelonmulNpleinstancesoftheenvironment.
• Parallelactor-learnershaveastabilizingeffectontraining.• RunsonasinglemachinewithastandardmulN-coreCPU.
![Page 6: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/6.jpg)
Asynchronous RL framework II
• AsyncvariantsoffourstandardRLalgorithms:• 1-stepQ-learning• N-stepQ-learning• 1-stepSarsa• Advantageactor-criNc(A3C)
![Page 7: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/7.jpg)
1-step Q-learning
• NNisusedtoapproximatetheQ(s,a;Θ)funcNon.• Theparameters(weights)ΘarelearnedbyiteraNvelyminimizingasequenceoflossfuncNons,wherethei-thlossfuncNonisdefinedas:
![Page 8: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/8.jpg)
Async 1-step Q-leaning
• Eachthreadhasowncopyofenvironment.• AteachstepcomputesagradientoftheQ-learningloss.• AccumulategradientsovermulNpleNmestepsbeforeapplying.• Sharedandslowlychangingtargetnetwork.
![Page 9: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/9.jpg)
Asynchronous 1-step Sarsa
• Sameas1-stepQ-learning,butusesadifferenttargetvalue:
![Page 10: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/10.jpg)
Asynchronous n-step Q-learning
• PotenNallyfasterwaytopropagaterewards.• Uses‘forward-view’-selectsacNonsusingitspolicyforuptonstepsinthefuture.• Receivesuptotmaxrewardssincelastupdate.• Totalaccumulatedreturn:• ValuefnisupdatedadereverytmaxacNonsoraderterminalstate.• Foreachupdateusesthelongestpossiblen-stepreturn.
![Page 11: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/11.jpg)
![Page 12: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/12.jpg)
Asynchronous advantage actor-criBc
• On-policymethod-hasapolicyandesNmatedvaluefuncNon.• Uses‘forward-view’.• Receivesuptotmaxrewardssincelastupdate.• Policyandvaluefn-sareupdatedadereverytmaxacNonsoraderterminalstate.• Foreachupdateusesthelongestpossiblen-stepreturn.
![Page 13: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/13.jpg)
![Page 14: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/14.jpg)
Performance evaluaBon
• Fourdifferentplaforms:• Atari2600-differentgames• TORCS3D-carracingsimulator• MuJoCo-physicssimulatorforconNnuousmotorcontrol(A3Conly)• Labyrinth-findingrewardsinrandomlygenerated3Dmazes(A3Conly)
![Page 15: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/15.jpg)
Atari 2600 games
• AllfourmethodscansuccessfullytrainNNcontrollers.• AsyncmethodsmostlyfasterthanDQN(DeepQ-Network).• Advantageactor-criNcwasthebest.
![Page 16: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/16.jpg)
Async A3C on 57 atari games
![Page 17: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/17.jpg)
TORCS Car Racing Simulator
• EvaluatedonlytheA3Calgorithm.• Agenthadtodrivearacecarusingonlyrawpixelsasinput.• Duringtraining,theagentwasrewardedformaintaininghighvelocityalongthecenteroftheracetrack.
hgps://youtu.be/0xo1Ldx3L5Q
![Page 18: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/18.jpg)
![Page 19: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/19.jpg)
MuJoCo Physics Simulator
• EvaluatedonlytheA3Calgorithm.• Rigidbodyphysicswithcontactdynamics.• ConNnuousacNons.• InallproblemsA3CfoundgoodsoluNonsinlessthan24hoursoftraining(typicallyafewhours).
hgps://youtu.be/0xo1Ldx3L5Q
![Page 20: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/20.jpg)
Labyrinth
• Theagentwasplacedinrandommazeandhad60stocollectpoints.• Apples–1point• Portals–10points,respawnedapplesandagentinrandomlocaNons• Visualinputonly.
• Theagentlearnedaresonablygoodgeneralstrategyforexploringrandommazes.
hgps://youtu.be/nMR5mjCFZCw
![Page 21: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/21.jpg)
Scalability
• Theframeworkscaleswellwiththenumberofparallelworkers.• Evenshowssuperlinearspeedupsforsomemethods.
![Page 22: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/22.jpg)
![Page 23: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/23.jpg)
![Page 24: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/24.jpg)
Robustness and stability
• Trainedmodelsonfivegamesusing50differentlearningratesandrandominiNalizaNon.• EachgameandalgorithmcombinaNonhadarangeoflearningratesforwhichallrandominiNalizaNonsachievedgoodscores.• Stabilityindicatedbyvirtuallyno0scoresinregionswithgoodlearningrates.
![Page 25: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, ...](https://reader036.fdocuments.net/reader036/viewer/2022081517/5ec699fbda86924bd42928df/html5/thumbnails/25.jpg)
To summarize
• Asynchronousmethodsforfourstandardreinforcementlearningalgorithms(1-stepQ,n-stepQ,1-stepSARSA,A3C).• Abletotrainneuralnetworkcontrollersonavarietyofdomainsinstablemanner.• Usingparallelactorlearnerstoupdateasharedmodelstabilizedthelearningprocess(alternaNivetoexperiencereplay).• InAtarigamestheadvantageactor-criNc(A3C)surpassedthecurrentstate-of-the-artinhalfthetrainingNme.• Superlinearspeedupwhenincreasingthreadcountfor1-stepmethods.