430.729 (003) –Spring 2020 Deep Reinforcement...

430.729 (003) – Spring 2020

Deep Reinforcement Learning

Deep Q‐Networks

Prof. Songhwai OhECE, SNU

Prof. Songhwai Oh 1Deep Reinforcement Learning ‐ Spring 2020

DEEP Q‐LEARNING

Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 2

Review: Action‐Utility Function• Q(s,a): the value of doing action a in state s.

• Q‐learning is a model‐free method: TD agent that learns a Q‐function does not need a model of the form P(s’|s,a) for learning or for action selection.

• At equilibrium when Q‐values are correct:

• ADP Q‐learning: (1) estimate the transition model; (2) update Q‐values.• TD Q‐learning (does not require the transition model)

cf. Bellman equation:

Review: Q‐Learning

• Update rule for Q‐learning:

Deep Q‐Learning: Q function is approximated by a deep neural network.

DEEP Q‐NETWORK (DQN)

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing Atari with deep reinforcement learning." NIPS Deep Learning Workshop 2013Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human‐level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529.

Q‐Learning with DNNs

Bellman equation (optimality condition):

‐> value iteration

Q‐network, , ; , approximates ∗ , and is learned by minimizing:

, : probability distribution over and .

Target value is not fixed (cf. supervised learning)

Stochastic gradient descent to learn :

Issues: correlated data and non‐stationary distribution ‐> make Q‐learning unstableSolution: experience replay

Notation

• : Atari emulator• ∈ 1,… , : actions• ∈ : image (screen shot from )• : reward (changes in game score)• , , , … , , : state (cf. POMDP)

• ∑ : future discounted return ( : terminal time)

• ∗ , max | , ,

Deep Q‐Learning with Experience Replay

‐greedy

Fixed length of sequence(last 4 frames)

minibatch = 32

Input: 84 x 84 x 4 image, 3 convolutional layers, 2 fully connected layers

Target Network (Final Version)

• Issues with Q‐learning with function approximation: update that increases , can also increase , for all

and increase target , leading to oscillation or divergence.

• A separate network to generate the target values, .

• A target network makes an algorithm becomes more stable.

Target network update

Target network

*clipping the error term between [‐1,1]

Training

• Different networks for different games but the same architecture and hyperparameters

• RMSProp: divides the learning rate by a running average of recent gradient magnitudes

• Trained for 50 million frames (38 days of gaming)

Results

Breakout

Space Invaders

Key ingredients in DQN

• Experience replay• Target network• Error clipping

• and “a lot of engineering”

DOUBLE Q‐LEARNING

H. van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q‐learning," in Proc. of the AAAI Conference on Artificial Intelligence (AAAI), Feb, 2016.

Double Q‐Learning

• DQN is likely to select overestimated values ‐> overoptimistic value estimates

• Decouple the selection from the evaluation

cf. DQN:

selection

evaluation

, ∗ , ~ 0,1

(uses the target network)

Comparison

PRIORITIZED EXPERIENCE REPLAY

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, "Prioritized Experience Replay," ICLR, 2016

Prioritized Experience Replay

Uniform if 0; ↑ 1

Comparison with Double DQN

DUELING DQN

Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. "Dueling network architectures for deep reinforcement learning." in Proc. of the International Conference on Machine Learning (ICML), Jun, 2016.

Dueling Architecture

Q values

State value

Advantages

Dueling

Advantage:

Dueling: Updates state values more often ‐> Better estimates

Forward Mapping for Q

• Makes a greedy action have zero advantage

• Addresses the identifiability issue

• Increases the stability of the algorithm

• Also addresses the identifiability issue

Dueling vs. Double DQN Dueling vs. Prioritized Double DQN

Results

430.729 (003) –Spring 2020 Deep Reinforcement...

Documents

Transcript of 430.729 (003) –Spring 2020 Deep Reinforcement...

Deep Reinforcement Learning for Robotics

Shear Reinforcement in Deep Slabs

ENGINEERING REINFORCEMENT LEARNING€¦ · Source: Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581 (2015). ...

Agente Sonic. Deep reinforcement learning

Deep reinforcement learningtranslectures.videolectures.net/site/normal_dl/tag=1137918/deep... · Deep reinforcement learning — Hado van Hasselt Industrial revolution (1750 - 1850)

Towards Deep Symbolic Reinforcement Learning

Advanced Q-Function Learning Methodsrll.berkeley.edu/deeprlcoursesp17/docs/lec4.pdfZ. Wang, N. de Freitas, and M. Lanctot.\Dueling network architectures for deep reinforcement learning".

Deep Reinforcement Learning - University of Waterloo · PDF file–CS885: Deep Reinforcement Learning •Instructor: Pascal Poupart •Spring 2018 •Deep Q Networks, Recurrent deep

Cooperative Multi-agent Control Using Deep Reinforcement ...rejuvyesh.com/publications/2017-coop-mas-drl.pdfCooperative Multi-agent Control Using Deep Reinforcement Learning 69 reinforcement

Deep Reinforcement Learning

Deep Reinforcement Learning from Human Preferencespapers.nips.cc/...deep-reinforcement-learning-from-human-preferences.pdf · Deep Reinforcement Learning from Human Preferences Paul

DEEP FEATURE EXTRACTION FOR SAMPLE-EFFICIENT REINFORCEMENT … · Deep Reinforcement Learning Il deep reinforcement learning (DRL) `e una branca dell’apprendimento per rinforzo

10703 Deep Reinforcement Learning

Deep Reinforcement Learning for Crazyhouse · Deep Reinforcement Learning for Crazyhouse MasterthesisbyJohannesCzech Dateofsubmission:December30,2019 1.Review:Prof.Dr.KristianKersting

Introduction to Deep Reinforcement Learningwnzhang.net/teaching/cs420/slides/13-deep-rl.pdf · 2020-06-08 · Deep Reinforcement Learning •Deep Reinforcement Learning •leverages

Human-level control through deep reinforcement … control through deep reinforcement learning ... We demon- stratethatthedeepQ ... Human-level control through deep reinforcement learning

A Deep Reinforcement Learning Chatbot

Tutorial: Deep Reinforcement Learning - Machine Learning ...hunch.net/~beygel/deep_rl_tutorial.pdfTutorial: Deep Reinforcement Learning - Machine Learning ...

Deep Learning and Reinforcement Learning

Reinforcement and deep reinforcement learning for wireless ...