RL 17a: Deep RL - The University of Edinburgh · AlphaGo Chess: ˇ30possiblemoves,ˇ80movespergame...

RL 17a: Deep RL

(certainly interesting, but not examinable)

Michael Herrmann

University of Edinburgh, School of Informatics

(bonus lecture)

“The use of punishments and rewards can at best be a part ofthe teaching process” (A. Turing)

Russell and Norvig: AI, Ch.21“A microcosm for the entire AI problem”

Russell and Norvig: AI, Ch.20We seek a single agent which can solve any human-level taskThe essence of an intelligent agentReinforcement learning + deep learning = AI

David Silver, Google DeepMind

bonus lecutre in 2016 Michael Herrmann RL 17a

Deep RL

State classification by deep learningRepresentation of sample trajectoriesDeep network representation of the value functionDeep network representation of the policyDeep network model of state transitions

Deep Auto-Encoder Neural Networks in RL

Sascha Lange and Martin Riedmiller, IJCNN 2010

Deep and Recurrent Architectures for Optimal Control

9 degrees of freedom10 training terrains10 test terrainsNo x-y input tocontrollerPolicy gradientSoft or hard rectifiedunitsEqual number ofparameters in allmodels

Sergey Levine, 2013

Deep Q-learning is still Q-learning

Function approximation of theQ-function

Q(s; a; w) ≈ Q(s; a)

δ = r + γmaxa′ Q (s ′, a′; w)−Q (s, a; w)

Update by stochastic gradient descent

∆w = −η∂δ2

(r + γmax

a′Q(s ′, a′; w

)−Q (s, a; w)

)∂Q (s, a; w)

DQL: Lessons from RL with function approximation

Countermeasures against divergence of Q(s; a; w)

Diversify data to reduce episodic correlationUse lots of data, i.e. many quadruples (s, a, r , s ′)Use one networks for Q (s, a; w0) and another oneforQ (s ′, a′; w1)

Use information about the reward distributionNormalisation to get robust gradients

Adapted from: Deep Reinforcement Learning by David Silver, Google DeepMind

Demystifying Deep Reinforcement Learning (T. Matiisen)

Naive formulation of DQN vs. DQN in DeepMind

Deep Q Network (Volodymyr Mnih ea. 2015)

Represent artificial agent by deep neural network: DQNLearn successful policies directly from high-dimensional sensoryinputsEnd-to-end reinforcement learningTested classic Atari 2600 gamesReward: Game scoreAchieve a level comparable to a professional human gamestester across a set of 49 games (about half of the gamesbetter than human)Unchanged metaparameters for all games

Algorithm for Deep Reinforcement Learning (T. Matiisen)

Convolutional network for DQN (Volodymyr Mnih ea. 2015)

Space invaders: Visualising last hidden layer (Mnih ea 2015)t-distributed stochastic neighbour embedding

Hyperparameter values (Volodymyr Mnih ea. 2015)

Deep RL: ParallelisationGorila (GOogle ReInforcement Learning Architecture)

(Arun Nair et al., 2015)Many Learners send gradient information to a parameter server:Shards update disjoint subsets of the parameters of the deepnetwork.

Further work

Policy Gradient for Continuous ActionsModel-Based RL: Deep transition model of the environmentDeep ACPOMDPsIRLBiological modellingRobot controlText based games (incl. language understadning)Dropout for robust representation

Model-based Deep RL: Challenges (David Silver)

Errors in the transition model compound over the trajectoryBy the end of a long trajectory, rewards can be totally wrongModel-based RL has failed (so far) in AtariDeep networks of value/policy can “plan” implicitlyEach layer of network performs arbitrary computational stepn-layer network can “lookahead” n stepsAre transition models required at all?

AlphaGo

Chess: ≈ 30 possible moves, ≈ 80 moves per gameGo: ≈ 250 possible moves, ≈ 150 moves per gameApproach

Reduce depth by truncating search tree at state s andreplacing subtree below s by approximate value function v (s)Reduce width of search tree by policy p (a, s)Train using supervised learning to predict human player movesTrain policy network to optimise outcome by self-playTrain a value network to predict the winner

Silver, David, et al. "Mastering the game of Go with deep neural networks and treesearch." Nature 529.7587 (2016): 484-489.

AlphaGo

Monte Carlo tree search in AlphaGo

Remarks

Non-zero rewards only at win/loss statesDuring learning a fast policy (based on human data) is usedthat plays to the end , the learned policy becomes later moregeneralCombination of SL and RL (policy + value) and MCTSNeural networks of AlphaGo trained directly from gameplaypurely through general-purpose methods (without ahandcrafted evaluation function as in DeepBlue)AlphaGo achieved a 99.8% winning rate against other Goprograms, and defeated the human European Go champion by5 games to 0, and a human world champion 4-1 or 3-2 (t.b.d.today!), a feat previously thought to be at least a decade away.Achieving one of AI’s “grand challenges”

RL 17a: Deep RL - The University of Edinburgh · AlphaGo Chess: ˇ30possiblemoves,ˇ80movespergame...

Documents

Transcript of RL 17a: Deep RL - The University of Edinburgh · AlphaGo Chess: ˇ30possiblemoves,ˇ80movespergame...

AlphaGo – Artificial Intelligencecis.csuohio.edu/~sschung/CIS601/Presentation1... · AlphaGo –Artificial Intelligence CIS 601 - Graduate Seminar Presentation ... Employing Deep

Biallelic truncating variants in ATP9A cause a novel ...

Game tree complexity - ICML · Evaluating AlphaGo against computers Crazy Stone Zen 3000 2500 2000 1500 1000 500 0 3500 AlphaGo (Nature v13) Pachi Fuego Gnu Go 4000 4500 AlphaGo (Seoul

Genetic epidemiology of titin-truncating variants in the ... · Genetic epidemiology of titin-truncating variants in the etiology of dilated cardiomyopathy Ali M. Tabish1 & Valerio

Truncating Shortest Path Search for Efficient Map-Matching

AlphaGo andMonteCarloTreeSearch - University of Washington5/30/17 1 AlphaGo andMonteCarloTreeSearch Machine0Learning0forBig0Data00 CSE547/STAT548, University0of0Washington Sham0Kakade

SOSCON 2_1330_3.pdf · 2019-10-30 · AlphaGo (2016) Reinforcement Learning SOSCON 2019 AlphaGo Zero (2017) Reinforcement Learning SOSCON 2019 OpenAIFive (2018) Reinforcement Learning

Exploration versus exploitation in reinforcement learning ... › ~xz2574 › download › RL-ver12.pdfing playing perfect-information board games such as Go (AlphaGo/AlphaGo Zero,

PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games ﬁrst

AlphaGo Paper

Discovering Graphical Granger Causality Using the Truncating … · 2013-12-31 · Discovering Graphical Granger Causality Using the Truncating Lasso Penalty Ali Shojaie and George

CMPUT 396: AlphaGo - University of Albertawebdocs.cs.ualberta.ca/~hayward/396/hoven/11alphago.pdf · CMPUT 396: AlphaGo The race to build a superhuman Go-bot is over! AlphaGo won,

AlphaGo in Depth

Maurizio Lenzerini Tiziana Catarci · 2016: AlphaGo beats Go champion Lee Sedol Thanks to machine learning techniques AlphaGo may develop «intuitions» for Go playing.

How AlphaGo Works - ktiml.mff.cuni.czktiml.mff.cuni.cz/.../2017ZS/YuuSakaguchi_AlphaGo.pdf · AlphaGo played professional Go player Lee Sedol, ranked 9-dan, one of the best players

Functional consequences of ε AChR subunit truncating ...hss.ulb.uni-bonn.de/2005/0533/0533.pdf · Functional consequences of ε AChR subunit truncating mutations linked to Congenital

Two truncating variants in FANCC and breast cancer riskuu.diva-portal.org/smash/get/diva2:1359485/FULLTEXT01.pdfTwo truncating variants in FANCC and breast cancer risk ... 2., ...

AlphaGo 알고리즘 요약

AI, AlphaGo and computer Hex - University of Albertawebdocs.cs.ualberta.ca/~hayward/talks/aghex.pdf · evolution computer Hex AI, AlphaGo and computer Hex a math andcomputingstory

Neuroverkot ja AlphaGo Zero - jyx.jyu.fi