430.729 (003) –Spring 2020 Deep Reinforcement...
Transcript of 430.729 (003) –Spring 2020 Deep Reinforcement...
![Page 1: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/1.jpg)
430.729 (003) – Spring 2020
Deep Reinforcement Learning
Deep Q‐Networks
Prof. Songhwai OhECE, SNU
Prof. Songhwai Oh 1Deep Reinforcement Learning ‐ Spring 2020
![Page 2: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/2.jpg)
DEEP Q‐LEARNING
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 2
![Page 3: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/3.jpg)
Review: Action‐Utility Function• Q(s,a): the value of doing action a in state s.
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 3
• Q‐learning is a model‐free method: TD agent that learns a Q‐function does not need a model of the form P(s’|s,a) for learning or for action selection.
• At equilibrium when Q‐values are correct:
• ADP Q‐learning: (1) estimate the transition model; (2) update Q‐values.• TD Q‐learning (does not require the transition model)
cf. Bellman equation:
![Page 4: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/4.jpg)
Review: Q‐Learning
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 4
• Update rule for Q‐learning:
Deep Q‐Learning: Q function is approximated by a deep neural network.
![Page 5: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/5.jpg)
DEEP Q‐NETWORK (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing Atari with deep reinforcement learning." NIPS Deep Learning Workshop 2013Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human‐level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529.
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 5
![Page 6: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/6.jpg)
Q‐Learning with DNNs
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 6
Bellman equation (optimality condition):
‐> value iteration
Q‐network, , ; , approximates ∗ , and is learned by minimizing:
, : probability distribution over and .
Target value is not fixed (cf. supervised learning)
Stochastic gradient descent to learn :
Issues: correlated data and non‐stationary distribution ‐> make Q‐learning unstableSolution: experience replay
![Page 7: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/7.jpg)
Notation
• : Atari emulator• ∈ 1,… , : actions• ∈ : image (screen shot from )• : reward (changes in game score)• , , , … , , : state (cf. POMDP)
• ∑ : future discounted return ( : terminal time)
• ∗ , max | , ,
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 7
![Page 8: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/8.jpg)
Deep Q‐Learning with Experience Replay
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 8
‐greedy
Fixed length of sequence(last 4 frames)
minibatch = 32
![Page 9: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/9.jpg)
DQN
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 9
Input: 84 x 84 x 4 image, 3 convolutional layers, 2 fully connected layers
![Page 10: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/10.jpg)
Target Network (Final Version)
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 10
• Issues with Q‐learning with function approximation: update that increases , can also increase , for all
and increase target , leading to oscillation or divergence.
• A separate network to generate the target values, .
• A target network makes an algorithm becomes more stable.
Target network update
Target network
*clipping the error term between [‐1,1]
![Page 11: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/11.jpg)
Training
• Different networks for different games but the same architecture and hyperparameters
• RMSProp: divides the learning rate by a running average of recent gradient magnitudes
• Trained for 50 million frames (38 days of gaming)
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 11
![Page 12: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/12.jpg)
Results
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 12
Breakout
Space Invaders
![Page 13: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/13.jpg)
Key ingredients in DQN
• Experience replay• Target network• Error clipping
• and “a lot of engineering”
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 13
![Page 14: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/14.jpg)
DOUBLE Q‐LEARNING
H. van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q‐learning," in Proc. of the AAAI Conference on Artificial Intelligence (AAAI), Feb, 2016.
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 14
![Page 15: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/15.jpg)
Double Q‐Learning
• DQN is likely to select overestimated values ‐> overoptimistic value estimates
• Decouple the selection from the evaluation
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 15
cf. DQN:
selection
evaluation
, ∗ , ~ 0,1
(uses the target network)
![Page 16: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/16.jpg)
Comparison
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 16
![Page 17: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/17.jpg)
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 17
![Page 18: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/18.jpg)
PRIORITIZED EXPERIENCE REPLAY
T. Schaul, J. Quan, I. Antonoglou, and D. Silver, "Prioritized Experience Replay," ICLR, 2016
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 18
![Page 19: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/19.jpg)
Prioritized Experience Replay
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 19
Uniform if 0; ↑ 1
![Page 20: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/20.jpg)
Comparison with Double DQN
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 20
![Page 21: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/21.jpg)
DUELING DQN
Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. "Dueling network architectures for deep reinforcement learning." in Proc. of the International Conference on Machine Learning (ICML), Jun, 2016.
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 21
![Page 22: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/22.jpg)
Dueling Architecture
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 22
Q values
State value
Advantages
Dueling
DQN
Advantage:
Dueling: Updates state values more often ‐> Better estimates
![Page 23: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/23.jpg)
Forward Mapping for Q
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 23
• Makes a greedy action have zero advantage
• Addresses the identifiability issue
• Increases the stability of the algorithm
• Also addresses the identifiability issue
![Page 24: 430.729 (003) –Spring 2020 Deep Reinforcement Learningrllab.snu.ac.kr/courses/deeprl_2020/files/lecture06_dqn.pdfand NandoDe Freitas. "Dueling network architectures for deep reinforcement](https://reader033.fdocuments.net/reader033/viewer/2022042220/5ec681ee75f0810deb1c40d0/html5/thumbnails/24.jpg)
Prof. Songhwai Oh Deep Reinforcement Learning ‐ Spring 2020 24
Dueling vs. Double DQN Dueling vs. Prioritized Double DQN
Results