CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning...
Transcript of CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning...
![Page 1: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/1.jpg)
CS 3630
Lecture 24:A brief introduction to Deep Reinforcement Learning
This lecture borrows heavily from the Deep RL Boot Camp slides, in particular the slide decks by Peter Abbeel, Rocky Duan, Vlad Mnih, Andrej Karpathy, and John Schulman
![Page 2: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/2.jpg)
Topics
1. Recap: MDP and RL Methods2. Deep Q-Learning3. Policy Optimization
![Page 3: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/3.jpg)
Motivation • Deep learning success in perception• MDP and RL frameworks
• Well understood• Early successes (backgammon)• Not great on more complex
problems
• Can deep learning make RL really work?• Evidence points to yes!
Image from DeepMimic paper, Peng et al 2018
![Page 4: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/4.jpg)
Some RL Success Stories
![Page 5: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/5.jpg)
Overview
This lecture will not go in depth
Provide rough overview, pointers to excellent starting points
Lecture materials all cribbed from Deep RL Boot Camp, took place August 2017 in Berkeley
All slides/lectures are online:https://sites.google.com/view/deep-rl-bootcamp/home
![Page 6: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/6.jpg)
Recap: Markov Decision Processes (MDP)
* P(s’|s,a) == T(s,a,s’) from Seth’s lectures
![Page 7: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/7.jpg)
A note on Rewards
• Three different ways to do the reward:• R(s): reward is function of state only (Seth’s lectures)• R(s, a): reward in given state depends on action you take• R(s, a, s’): reward also depends on in what state you land• Most general• Matches Deep RL Bootcamp slides
Image copyright Stevens 2003, used here for academic purposes
![Page 8: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/8.jpg)
Recap: Policy and Value Function
• Policy:• S -> A• Optimal 𝜋∗
• Value function:• Satisfies Bellman equations
if actions 𝑎 𝑠 = 𝜋∗(𝑠), i.e. actions are chosen chosen according to optimal policy 𝜋∗:
Optimal value 𝑉∗(𝑠)
• Exercise: check Bellman equations with given noise 0.2(actions only work 80% of the time) and discount factor 0.9
![Page 9: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/9.jpg)
Recap: Value Iteration
Image borrowed from (useful!) Nvideo blog post on RL: https://devblogs.nvidia.com/deep-learning-nutshell-
reinforcement-learning/
![Page 10: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/10.jpg)
Recap:Policy Iteration
Image borrowed from (useful!) blog post at https://mpatacchiola.github.io/blog/2016/12/09/dissecting-reinforcement-learning.html
![Page 11: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/11.jpg)
Recap: Reinforcement Learning• Passive:• Direct utility estimation• Adaptive Dynamic Programming• Temporal Difference Learning (model-free!)
• Active:• Model-based: exploitation vs. exploration• Q-learning (model-free!)
Image by Praphul Singh, https://blogs.oracle.com/datascience/reinforcement-learning-deep-q-networks
![Page 12: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/12.jpg)
2. Deep Q-learning
A simple (tabular) Q-learning algorithm:
![Page 13: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/13.jpg)
Q-learning on gridworld
![Page 14: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/14.jpg)
Q-learning properties
![Page 15: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/15.jpg)
Can tabular Q-learning scale?
![Page 16: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/16.jpg)
Can tabular Q-learning scale?
![Page 17: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/17.jpg)
Enter DQN (2015 Nature paper, Mnih et al)
![Page 18: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/18.jpg)
DQN overview
![Page 19: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/19.jpg)
Neural Network to approximate Q-values
![Page 20: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/20.jpg)
Learning Space Invaders
![Page 21: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/21.jpg)
Value function, visualized
• 512-dim state space (final hidden layer)• “t-SNE” embedding
visualizes this in 2D• Color is value of state
![Page 22: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/22.jpg)
Learning all the games!
• Human = 100%
![Page 23: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/23.jpg)
3. Policy Gradient Method
• Optimize over stochastic policies
![Page 24: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/24.jpg)
Why Policy Optimization
![Page 25: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/25.jpg)
Policy leads to trajectories 𝜏
![Page 26: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/26.jpg)
Vanilla Policy Gradient• 1992 (!) REINFORCE algorithm• Roll out many trajectories• Compares current policy with a baseline b• Adapt network to improve reward on trajectories that compare
advantageously (At) to the baseline
![Page 27: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/27.jpg)
Improvements led to SOTA methods
• Rather complicated and beyond scope• DDPG: Deep Deterministic Policy Gradient• TRPO: trust-region Policy Optimization• PPO: Proximal Policy Optimization
• See blog post: https://openai.com/blog/openai-baselines-ppo/
• Some successes:• OpenAI gym:
• Atari:
![Page 28: CS 3630 · 2020. 5. 4. · CS 3630 Lecture 24: A brief introduction to Deep Reinforcement Learning This lecture borrows heavily from the Deep RL Boot Camp slides, in particular theslide](https://reader035.fdocuments.net/reader035/viewer/2022071502/6121f4a977aca25a11759e26/html5/thumbnails/28.jpg)
Summary
1. Recap: MDP and RL Methods2. Deep Q-Learning (DQN) beats humans on many Atari
games, and is 10K citation Nature paper now3. Policy Optimization learns a network that takes a state
and outputs a stochastic action. Tricky to get to work, lots of heavy math, but great success in robotics-like domains.