Deep Learning for Robotics...Deep Learning: End-to-end vision standard computer vision features...

Deep Learning for Robotics

Sergey Levine

Deep Learning: End-to-end vision

standardcomputervision

features(e.g. HOG)

mid-level features(e.g. DPM)

classifier(e.g. SVM)

deeplearning

Felzenszwalb ‘08

Krizhevsky ‘12

Action(run away)

perception

action

Action(run away)

sensorimotor loop

“When a man throws a ball high in theair and catches it again, he behaves as ifhe had solved a set of differentialequations in predicting the trajectory ofthe ball … at some subconscious level,something functionally equivalent to themathematical calculations is going on.”

-- Richard Dawkins

McLeod & Dienes. Do fielders know where to go to catch the ball or only how to get there? Journal of Experimental

Psychology 1996, Vol. 22, No. 3, 531-543

Philipp Krahenbuhl, Stanford University

End-to-end vision

standardcomputervision

features(e.g. HOG)

mid-level features(e.g. DPM)

classifier(e.g. SVM)

deeplearning

Felzenszwalb ‘08

Krizhevsky ‘12

End-to-end robotic control

deepsensorimotorlearning

observationsmotor

torques

standardroboticcontrol

observationsstate

estimation(e.g. vision)

modeling & prediction

motion planning

low-level controller (e.g. PD)

motor torques

indirect supervision

actions have consequences

Why should we care?

action

Why should we care?

Goals of this lecture

• Introduce formalisms of decision making

– states, actions, time, cost

– Markov decision processes

• Survey recent research

– imitation learning

– reinforcement learning

• Outstanding research challenges

– what is easy

– what is hard

– where could we go next?

Contents

Imitation learning

Reinforcement learning

Imitation without a human

Research frontiers

Terminology & notation

1. run away

2. ignore

3. pet

1. run away

2. ignore

3. pet

Contents

Imitation learning

Research frontiers

Imitation Learning

Images: Bojarski et al. ‘16, NVIDIA

trainingdata

supervisedlearning

Does it work? No!

Does it work? Yes!

Video: Bojarski et al. ‘16, NVIDIA

Why did that work?

Bojarski et al. ‘16, NVIDIA

Can we make it work more often?co

stability

Learning from a stabilizing controller

(more on this later)

Can we make it work more often?

DAgger: Dataset Aggregation

Ross et al. ‘11

DAgger Example

Ross et al. ‘11

What’s the problem?

Ross et al. ‘11

• Usually (but not always) insufficient by itself

– Distribution mismatch problem

• Sometimes works well

– Hacks (e.g. left/right images)

– Samples from a stable trajectory distribution

– Add more on-policy data, e.g. using DAgger

Imitation learning: recap

trainingdata

supervisedlearning

• Distribution mismatch does not seem to be the whole story

– Imitation often works without Dagger

– Can we think about how stability factors in?

• Do we need to add data for imitation to work?

– Can we just use existing data more cleverly?

Imitation learning: questions

trainingdata

supervisedlearning

Contents

Imitation learning

Research frontiers

1. run away

2. ignore

3. pet

Trajectory optimization

Probabilistic version

Probabilistic version (in pictures)

DAgger without Humans

Ross et al. ‘11

Another problem

PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16

path replanned!

avoids high cost!

input substitution trick

need state at training time

but not at test time!

Beyond driving & flying

Trajectory Optimization with Unknown Dynamics

[L. et al. NIPS ‘14]

Trajectory Optimization with Unknown Dynamics

new old[L. et al. NIPS ‘14]

Learning on PR2

[L. et al. ICRA ‘15]

Combining with Policy Learning

expectation undercurrent policy

trajectory distribution(s)

Lagrange multiplier

L. et al. ICML ’14 (dual descent)

can also use BADMM (L. et al.’15)

Guided Policy Search

supervised learningtrajectory-centric RL

[see L. et al. NIPS ‘14 for details]

training time test time

L.*, Finn*, Darrell, Abbeel ‘16

~ 92,000 parameters

Experimental Tasks

• Any difference from standard imitation learning?

– Can change behavior of the “teacher” programmatically

• Can the policy help optimal control (rather than just the other way around?)

Imitating optimal control: questions

Contents

Imitation learning

Research frontiers

Can we avoid dynamics completely?

A simple reinforcement learning algorithm

REINFORCElikelihood ratio policy gradient

Williams ‘92

What the heck did we just do?

one more piece…

Policy gradient challenges

• High variance in gradient estimate– Smarter baselines

• Poor conditioning– Use higher order methods (see:

natural gradient)

• Very hard to choose step size– Trust region policy optimization

(TRPO)

Schulman, L., Moritz, Jordan, Abbeel ‘15

Value functions

Value functions challenges

• The usual problems with policy gradient– Poor conditioning (use natural gradient)

– Hard to choose step size (use TRPO)

• Bias/variance tradeoff– Combine Monte Carlo and function approximation: Generalized Advantage Estimation (GAE)

• Instability/overfitting– Limit how much the value function estimate changes per iteration

high variance?

high bias?

Schulman, Moritz, L. Jordan, Abbeel ‘16

Generalized advantage estimation

Schulman, Moritz, L., Jordan, Abbeel ‘16

Online actor-critic methods

Continuous Control with Deep Reinforcement Learning (Lillicrap et al. ‘15)

Deep deterministic policy gradient (DDPG)

Just the Q function?

Discrete Q-learning

Human-Level Control Through Deep Reinforcement Learning (Mnih et al. ‘15)

Continuous Q-learning

*Random, DDPG, NAF policies: final rewards and

episodes to converge

Normalized Advantage Functions

Policy Learning with Multiple Robots: Deep RL with NAF

Gu*, Holly*, Lillicrap, L., ‘16

Shane Gu Ethan Holly Tim Lillicrap

Deep RL with Policy Gradients

• Unbiased but high-variance gradient

• Stable

• Requires many samples

• Example: TRPO [Schulman et al. ‘15]

Deep RL with Off-Policy Q-Function Critic

• Low-variance but biased gradient

• Much more efficient (because off-policy)

• Much less stable (because biased)

• Example: DDPG [Lillicrap et al. ‘16]

Improving Efficiency & Stability with Q-Prop

Shane Gu

Policy gradient: Q-function critic:

Q-Prop:

• Unbiased gradient, stable

• Efficient (uses off-policy samples)

• Critic comes from off-policy data

• Gradient comes from on-policy data

• Automatic variance-based

adjustment

Comparisons

• Works with smaller batches than TRPO

• More efficient than TRPO

• More stable than DDPG with respect to hyperparameters

• Likely responsible for the better performance on harder task

Sample complexity• Deep reinforcement learning is very data-hungry

– DQN: about 100 hours to learn Breakout

– GAE: about 50 hours to learn to walk

– DDPG/NAF: 4-5 hours to learn basic manipulation, walking

• Model-based methods are more efficient

– Time-varying linear models: 3 minutes for real worldmanipulation

– GPS with vision: 30-40 minutes for real world visuomotorpolicies

Reinforcement learning tradeoffs

• Reinforcement learning (for the purpose of this slide) = model-free RL

• Fewer assumptions

– Don’t need to model dynamics

– Don’t (in general) need state definition, only observations

– Fully general stochastic environments

• Much slower (model-based acceleration?)

• Hard to stabilize

– Few convergence results with neural networks

Contents

Imitation learning

Research frontiers

ingredients for success in learning:

supervised learning: learning sensorimotor skills:

computation

algorithms

computation

algorithms~data?

L., Pastor, Krizhevsky, Quillen ‘16

Grasping with Learned Hand-Eye Coordination

• 800,000 grasp attempts for training (3,000 robot-hours)

• monocular camera (no depth)

• 2-5 Hz update

• no prior knowledge

monocularRGB camera

7 DoF arm

2-fingergripper

objectbin

Using Grasp Success Prediction

training testing

Open-Loop vs. Closed-Loop Grasping

Pinto & Gupta, 2015

open-loop grasping closed-loop grasping

failure rate: 33.7% failure rate: 17.5%depth + segmentationfailure rate: 35%

Grasping Experiments

Learning what Success Means

can we learn the costwith visual features?

with C. Finn, P. Abbeel

Challenges & Frontiers

• Algorithms

– Sample complexity

– Safety

– Scalability

• Supervision

– Automatically evaluate success

– Learn cost functions

AcknowledgementsPieter AbbeelChelsea Finn

Peter Pastor Alex Krizhevsky

Trevor Darrell

Deirdre Quillen

John Schulman Philipp Moritz Michael Jordan

Greg Kahn Tianhao Zhang

Shane Gu

Deep Learning for Robotics...Deep Learning: End-to-end vision standard computer vision features...

Documents

Transcript of Deep Learning for Robotics...Deep Learning: End-to-end vision standard computer vision features...

DEEP LEARNING IN PYTHON - Amazon S3 · Deep Learning in Python Vanishing gradients Occurs when many layers have very small slopes (e.g. due to being on ﬂat part of tanh curve) In

Machine Learning and Deep Learning - matlabexpo.com · Deep Learning is a Subset of Machine Learning E.g. Google Captioning Project Machine learning is the science of getting computers

Introduction to Deep Learning - Computer Graphics€¦ · Introduction to Deep Learning CS468 Spring 2017 Charles Qi. What is Deep Learning? Deep learning allows computational models

Deep Learning for Language - Benyou Wang · Deep Learning for Language e.g. Natural Language Processing and Information Retrieval

Deep Learning - uni-muenchen.de · Deep Learning Recipe: What Are the Reasons? (Hinton 2013) 1.Take a large data set 2.Take a Neuronal Network with many (e.g., 7) large (z.B. 1000

Deep Learning for Laser Based Odometry Estimation - …juxi.net/workshop/deep-learning-rss-2016/papers/Nicolai - Deep... · Deep Learning for Laser Based Odometry Estimation ... deep

Introduction to Deep Learning - Joyce Hojoyceho.github.io/cs534_s17/slide/25-deep-learning.pdf · Deep Learning: Overview ... Review: Backpropogation https: ... • Deep Learning

Deep Learning - Penn Engineering · • Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) • Shallow models (SVMs, one-hidden-layer

20151216 Computer Vision Meetup Deep Learning Deep Learning - Rene...Deep Learning . René Donner Deep Learning Overview 2 ... coursera ... 20151216 Computer Vision Meetup Deep Learning

Deep Learning I - Korea Universitydsba.korea.ac.kr/wp/wp-content/seminar/Deep learning/Deep Learning... · Deep Learning I Deep learning is research on learning models with multilayer

Integrating Domain-Knowledge into Deep Learningrsalakhu/10707/Lectures/Lecture_domain.pdf · Impact of Deep Learning ... Regularizing dynamical system (e.g. state space models) via

Lecture 1: Introduction to Graphical Models · PGMs in the Deep Learning era 40 Section 2 Deep Learning PGMs Empirical goal: e.g., Classification, feature learning e.g., transfer

Cheap, Fast, and Low Power Deep Learning: I need it now ...of Deep Learning Models ... Estimation - Labeled Images With Food Type (e.g. Milk, Toast, Eggs) User Feedback (Confirmation

Deep Learning for Computer Vision - de.mathworks.com fileTrain “deep” neural networks on structured data (e.g. images, signals, text) Implements Feature Learning: Eliminates need

Applying Deep Learning to Reduce Large Adaptation Spaces ... · Third, given the success of deep learning in various other domains, e.g. computer vision, we want to explore the applicability

Deep Robotic Learning - lima-city...deep learning Felzenszwalb ‘08 robotic control pipeline observations state estimation (e.g. vision) modeling & prediction planning low-level control

Executing Deep Learning Strategies Masterclass Preview - Enterprise Deep Learning

Tutorial: Deep Reinforcement Learning - Machine Learning ...hunch.net/~beygel/deep_rl_tutorial.pdfTutorial: Deep Reinforcement Learning - Machine Learning ...

Deep Learning for Reinforcement Learning in Pacman · Deep Learning for Reinforcement Learning in Pacman Deep Learning für Reinforcement Learning in Pacman Vorgelegte Bachelor-Thesis

From Reinforcement Learning to Deep Reinforcement …fagostin/assets/files/...Keywords: Machine learning · Reinforcement learning Deep learning · Deep reinforcement learning 1 Introduction