Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST...
Transcript of Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST...
![Page 1: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/1.jpg)
Value Iteration NetworksNIPS 2016 BEST PAPER
7-Minute Tour
Runzhe Yang @ SJTU ACM CLASS
Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel
@ Berkeley Artificial Intelligence Research Lab (BAIR)
![Page 2: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/2.jpg)
• Deep RL learns policies from complicated visual input
Introduction
• Learns to act, but does it understand?
Runzhe Yang @ SJTU ACM CLASS
• A simple test: generalization on grid worlds
![Page 3: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/3.jpg)
Runzhe Yang @ SJTU ACM CLASS
Introduction
Train Test
• A simple test: generalization on grid worlds
FAIL
![Page 4: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/4.jpg)
Runzhe Yang @ SJTU ACM CLASS
Introduction
Why doesn’t it understand?
![Page 5: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/5.jpg)
Runzhe Yang @ SJTU ACM CLASS
Introduction
Observation Policy
- A neural network (NN) is trained to represent a policy
Deep Q-NetTask
Why doesn’t it understand?
Action Probability
![Page 6: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/6.jpg)
Runzhe Yang @ SJTU ACM CLASS
Introduction
Observation Policy
- A neural network (NN) is trained to represent a policy
Deep Q-NetTask
Why doesn’t it understand?
Action Probability
Task
Observation Policy
- New task → need to re-plan
![Page 7: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/7.jpg)
Task Sequential Nature
Policy
Runzhe Yang @ SJTU ACM CLASS
Introduction
Observation
Deep Q-Net
- A sequential problem requires a planning computation
Why doesn’t it understand?
![Page 8: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/8.jpg)
Task Sequential Nature
Reactive Policy
Runzhe Yang @ SJTU ACM CLASS
Introduction
Observation
- RL gets around that (learns a mapping, State → Q-value)
Deep Q-Net
- A sequential problem requires a planning computation
- Lack of planning computation bad understanding
Why doesn’t it understand?
![Page 9: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/9.jpg)
Runzhe Yang @ SJTU ACM CLASS
Introduction
- Policies that generalize to unseen tasks
- Learn to plan
In this work:
Task Sequential Nature
Reactive Policy Observation
Deep Q-Net
![Page 10: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/10.jpg)
Runzhe Yang @ SJTU ACM CLASS
A Planning-based Policy Model
Observation Reactive Policy
- Start from reactive policy
![Page 11: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/11.jpg)
- Assumption: observation can be mapped to a useful (but _.unknown) planning computation
Runzhe Yang @ SJTU ACM CLASS
A Planning-based Policy Model
Observation Reactive Policy
Planning Module
Plan on MDP .
- Add an explicit planning computation
![Page 12: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/12.jpg)
Runzhe Yang @ SJTU ACM CLASS
A Planning-based Policy Model
- NNs map observation to reward and transitions
- Later, learn on new MDP
- How to use the planning computation?
Planning Module
Plan on MDP .
Observation Reactive Policy
![Page 13: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/13.jpg)
Runzhe Yang @ SJTU ACM CLASS
A Planning-based Policy Model
- Fact 1: value function = sufficient information about plan
Planning Module
Plan on MDP .
Observation Reactive Policy
![Page 14: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/14.jpg)
A Planning-based Policy Model
- Fact 1: value function = sufficient information about plan
- Fact 2: action prediction can require only subset of
Planning Module
Plan on MDP .
Observation Reactive Policy
Runzhe Yang @ SJTU ACM CLASS
![Page 15: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/15.jpg)
A Planning-based Policy Model
- Fact 1: value function = sufficient information about plan
- Fact 2: action prediction can require only subset of
Planning Module
Plan on MDP .
Attention
Observation Reactive Policy
Runzhe Yang @ SJTU ACM CLASS
![Page 16: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/16.jpg)
Planning Module
Plan on MDP .
Attention
Runzhe Yang @ SJTU ACM CLASS
A Planning-based Policy Model
Observation Reactive Policy
- Policy is still a mapping
- Parameters for mapping , , attention
- How to back-prop through planning computation?
![Page 17: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/17.jpg)
Runzhe Yang @ SJTU ACM CLASS
Prev. V
New ValueReward
k recurrence
Value Iteration Module
Value Iteration Network
- Differential planner (Value Iteration ≈ CNN)
Conv:
![Page 18: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/18.jpg)
Runzhe Yang @ SJTU ACM CLASS
Prev. V
New ValueReward
k recurrence
Value Iteration Module
Value Iteration Network
- Differential planner (Value Iteration ≈ CNN)
Conv: Pool:
![Page 19: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/19.jpg)
Runzhe Yang @ SJTU ACM CLASS
Experiments
3. Continuous Control
1. Grid-World Domain
2. Mars Rover Navigation
4. WebNav Challenge
Network 8 × 8 16 × 16
VIN 90.9% 82.5%
CNN 86.9% 33.1%
Table: RL Results – performance on test maps.
![Page 20: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/20.jpg)
Runzhe Yang @ SJTU ACM CLASS
Q&A
Thank you!
![Page 21: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,](https://reader035.fdocuments.net/reader035/viewer/2022081517/5fe0ff6cd7df2a00bf4c209e/html5/thumbnails/21.jpg)
Runzhe Yang @ SJTU ACM CLASS
Q&A