Apprenticeship Learning via Inverse Reinforcement...

42
Inverse Reinforcement Learning Krishnamurthy Dvijotham UW (Most slides borrowed/adapted from Pieter Abbeel)

Transcript of Apprenticeship Learning via Inverse Reinforcement...

Page 1: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Inverse Reinforcement Learning

Krishnamurthy Dvijotham

UW

(Most slides borrowed/adapted from Pieter Abbeel)

Page 2: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Reinforcement Learning

Almost the same as Optimal Control

“Reinforcement” term coined by psychologists studying animal learning

Focus on discrete state spaces, highly stochastic environments, learning to control without knowing system model in general

Work with rewards instead of costs, usually discounted

Page 3: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Inverse Reinforcement Learning

Dynamics Model

Expert Trajectories

IOC AlgorithmReward

Function R

R explains expert trajectories

Page 4: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Scientific inquiry

Model animal and human behavior

E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.]

Apprenticeship learning/Imitation learning through inverse RL

Presupposition: reward function provides the most succinct and transferable definition of the task

Has enabled advancing the state of the art in various robotic domains

Modeling of other agents, both adversarial and cooperative

Motivation for inverse RL

Page 5: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Examples of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

Historical sketch of inverse RL

Mathematical formulations for inverse RL

Examples

Lecture outline

Page 6: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Simulated highway driving

Abbeel and Ng, ICML 2004,

Syed and Schapire, NIPS 2007

Aerial imagery based navigation

Ratliff, Bagnell and Zinkevich, ICML 2006

Parking lot navigation

Abbeel, Dolgov, Ng and Thrun, IROS 2008

Urban navigation

Ziebart, Maas, Bagnell and Dey, AAAI 2008

Quadruped locomotion

Ratliff, Bradley, Bagnell and Chestnutt, NIPS 2007

Kolter, Abbeel and Ng, NIPS 2008

Examples

Page 7: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Examples of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

Historical sketch of inverse RL

Mathematical formulations for inverse RL

Examples

Lecture outline

Page 8: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Input:

State space, action space

Transition model Psa(st+1 | st, at)

No reward function

Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …

(= trace of the teacher’s policy *)

Inverse RL:

Can we recover R ?

Apprenticeship learning via inverse RL

Can we then use this R to find a good policy ?

Behavioral cloning

Can we directly learn the teacher’s policy using supervised learning?

Problem setup

Page 9: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Formulate as standard machine learning problem

Fix a policy class

E.g., support vector machine, neural network, decision tree, deep belief net, …

Estimate a policy (=mapping from states to actions) from the training examples (s0, a0), (s1, a1), (s2, a2), …

Two of the most notable success stories:

Pomerleau, NIPS 1989: ALVINN

Sammut et al., ICML 1992: Learning to fly (flight sim)

Behavioral cloning

Page 10: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Which has the most succinct description: ¼* vs. R*?

Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.

Inverse RL vs. behavioral cloning

Page 11: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Examples of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

Historical sketch of inverse RL

Mathematical formulations for inverse RL

Examples

Lecture outline

Page 12: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

1964, Kalman posed the inverse optimal control problem and solved it in the 1D input case

1994, Boyd+al.: a linear matrix inequality (LMI) characterization for the general linear quadratic setting

2000, Ng and Russell: first MDP formulation, reward function ambiguity pointed out and a few solutions suggested

2004, Abbeel and Ng: inverse RL for apprenticeship learning---reward feature matching

2006, Ratliff+al: max margin formulation

Inverse RL history

Page 13: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

2007, Ratliff+al: max margin with boosting---enables large vocabulary of reward features

2007, Ramachandran and Amir, and Neu and Szepesvari: reward function as characterization of policy class

2008, Kolter, Abbeel and Ng: hierarchical max-margin

2008, Syed and Schapire: feature matching + game theoretic formulation

2008, Ziebart+al: feature matching + max entropy

2010, Dvijotham, Todorov: Inverse RL with LMDPs

Active inverse RL? Inverse RL w.r.t. minmax control, partial observability, learning stage (rather than observing optimal policy), … ?

Inverse RL history

Page 14: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Examples of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

Historical sketch of inverse RL

Mathematical formulations for inverse RL

Examples

Lecture outline

Page 15: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

IRL with LMDP : OptV

Algorithm (OptV):1. Input: Transitions {(x_n,y_n)} sampled

from u*, Dynamics P(x’|x)2. Estimate v by maximizing log-likelihood:

3. Compute q from v using

No need to solve opt control prob repeatedly

Page 16: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Continuous State Spaces

How to represent v in a continuous state space?

Typical Sample (2d) Local Gaussian Bases

Discretize : Blows up exponentially with dimension

Intuition

Page 17: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

RBFs

Representation of v in continuous state spaces?

1. Lots of approximating power in highly visited regions

2. Low approximating power (smoother approximation) in low-density regions

Page 18: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Basis Adaptation

Complete log-likelihood:

Algorithm (OptVA)1. Initialize θ “cleverly”2. Until Convergence:

1. LBFGS to compute L*(θ)2. Conjugate Gradient ascent on L*(θ)

Page 19: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

OptQ

Log-likelihood of set of trajectories

Observe set of expert trajectories

Express z in terms of q

Resulting expression convex in q : Optimize!

Page 20: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Examples of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

Historical sketch of inverse RL

Mathematical formulations for inverse RL

IRL with LMDPs

IRL with Traditional MDPs

Examples

Lecture outline

Page 21: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

IRL in Traditional MDPs

GridWorld : •Controls:•First-Exit : Minimize cost until you hit goal

IRL Ill-Posed(Ng ’00)

Page 22: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

IRL in Traditional MDPs

Let be dynamics under policy Π. Then, any reward function satisfying

makes Π optimal

Page 23: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Find a reward function R* which explains the expert

behaviour.

Find R* such that

In fact a convex feasibility problem, but many challenges:

R=0 is a solution, more generally: reward function ambiguity

We typically only observe expert traces rather than the entire expert policy ¼* --- how to compute LHS?

Assumes the expert is indeed optimal --- otherwise infeasible

Computationally: assumes we can enumerate all policies

IRL in MDPs : Basic principle

E[P1

t=0 °tR¤(st)j¼¤] ¸ E[P1

t=0 °tR¤(st)j¼] 8¼

Page 24: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

ff

Subbing into

gives us:

Feature based reward function

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.

E[

1X

t=0

°tR(st)j¼] = E[

1X

t=0

°tw>Á(st)j¼]

= w>E[

1X

t=0

°tÁ(st)j¼]

= w>¹(¼)

Expected cumulative discounted sum of feature values or “feature expectations”

E[P1

t=0 °tR¤(st)j¼¤] ¸ E[P1

t=0 °tR¤(st)j¼] 8¼

Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

Page 25: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Feature expectations can be readily estimated from sample trajectories.

The number of expert demonstrations required scales with the number of features in the reward function.

The number of expert demonstration required does not depend on

Complexity of the expert’s optimal policy ¼*

Size of the state space directly

Feature based reward function

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

E[P1

t=0 °tR¤(st)j¼¤] ¸ E[P1

t=0 °tR¤(st)j¼] 8¼

Page 26: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Challenges:

Assumes we know the entire expert policy ¼*

assumes we can estimate expert feature expectations

R=0 is a solution (now: w=0), more generally: reward function ambiguity

Assumes the expert is indeed optimal---became even more of an issue with the more limited reward function expressiveness!

Computationally: assumes we can enumerate all policies

Recap of challenges

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

Page 27: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Standard max margin:

“Structured prediction” max margin:

Justification: margin should be larger for policies that are very different from ¼*.

Example: m(¼, ¼*) = number of states in which ¼* was observed and in which ¼ and ¼* disagree

Ambiguity

minwkwk22

s:t: w>¹(¼¤) ¸ w>¹(¼) + 1 8¼

minwkwk22

s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼) 8¼

Page 28: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Structured prediction max margin with slack variables:

Can be generalized to multiple MDPs (could also be same MDP with different initial state)

Expert suboptimality

minwkwk22 + C»

s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼)¡ » 8¼

minwkwk22 + C

X

i

»(i)

s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)

Page 29: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Resolved: access to ¼*, ambiguity, expert suboptimality

One challenge remains: very large number of constraints

Ratliff+al use subgradient methods.

In this lecture: constraint generation

Complete max-margin formulation

[Ratliff, Zinkevich and Bagnell, 2006]

minwkwk22 + C

X

i

»(i)

s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)

Page 30: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Initialize ¦(i) = {} for all i and then iterate

Solve

For current value of w, find the most violated constraint for all i by solving:

= find the optimal policy for the current estimate of the reward function (+ loss augmentation m)

For all i add ¼(i) to ¦(i)

If no constraint violations were found, we are done.

Constraint generation

max¼(i)

w>¹(¼(i)) + m(¼(i)¤; ¼(i))

minwkwk22 + C

X

i

»(i)

s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; 8¼(i) 2 ¦(i)

Page 31: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Every policy ¼ has a corresponding feature expectation vector ¹(¼),

which for visualization purposes we assume to be 2D

Visualization in feature expectation space

1

2

(*)

E[

1X

t=0

°tR(st)j¼] = w>¹(¼)

max margin

structured max margin (?)

wmm

wsmm

Page 32: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Every policy ¼ has a corresponding feature expectation vector ¹(¼),

which for visualization purposes we assume to be 2D

Constraint generation

1

(0)

(2)

2

(*)

(1)

max¼

E[

1X

t=0

°tRw(st)j¼] = max¼

w>¹(¼)

w(2)

w(3) = wmm

constraint generation:

Page 33: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Inverse RL starting point: find a reward function such that the expert outperforms other policies

Observation in Abbeel and Ng, 2004: for a policy ¼ to be guaranteed to perform as well as the expert policy ¼*, it

suffices that the feature expectations match:

implies that for all w with

Feature matching

Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.

k¹(¼)¡¹(¼¤)k1 · ²

kwk1 · 1:

E[

1X

t=0

°tRw¤(st)j¼] = w¤>¹(¼) ¸ w¤>¹(¼¤)¡ ² = E[

1X

t=0

°tRw¤(st)j¼¤]¡ ²

Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼

Holder's inequality:

8p; q ¸ 1 if1

p+

1

q= 1; then x>y · kxkpkykq

Page 34: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Examples of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

History of inverse RL

Mathematical formulations for inverse RL

Examples

Lecture outline

Page 35: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

GridWorld CPU times

Page 36: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Results

•Colored plots represent functions f(x,y)•Red - High, Blue – Low•Continuous state problems, 2d state space•x-axis : Position, y-axis: Velocity

Page 37: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Results on Continuous Problems

True Cost

Page 38: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Results on Continuous Problems

True Cost

Page 39: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Comparison to Other Algorithms

Optimal Policy

OptVEstimate

2 min

Ng ’00 Estimate

Abbeel ’04 Estimate

Discretized controls to for other algos to make them run tractably

Page 40: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Example of apprenticeship learning via inverse RL

Inverse RL vs. behavioral cloning

Sketch of history of inverse RL

Mathematical formulations for inverse RL

Case studies

Inverse Optimal Control with LMDPs

Lecture outline

Page 41: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Inverse RL vs. behavioral cloning

Sketch of history of inverse RL

Mathematical formulations for inverse RL

Inverse RL with LMDPs vs Inverse RL with Traditional MDPs

Summary

Page 42: Apprenticeship Learning via Inverse Reinforcement Learninghomes.cs.washington.edu/~todorov/courses/amath579/Inverse.pdf · Scientific inquiry Model animal and human behavior E.g.,

Ng and Russell. Algorithms for inverse reinforcement Learning (ICML 2000)

Abbeel and Ng. Apprenticeship learning via inverse reinforcement learning. (ICML 2004)

Boyd, Ghaoui, Feron and Balakrishnan. (1994). Linear matrix inequalities in system and control theory. SIAM.

Kolter, Abbeel and Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. (NIPS 2008)

Neu and Szepesvari. Apprenticeship learning using inverse reinforcement learning and gradient methods. (UAI 2007)

Ng and Russell. Algorithms for inverse reinforcement learning. (ICML 2000)

Ratliff, Bagnell and Zinkevich. Maximum margin planning. (ICML 2006)

Ratliff, Bradley, Bagnell and Chestnutt. Boosting structured prediction for imitation learning. (NIPS 2007)

Ramachandran and Amir. Bayesian inverse reinforcement learning. (IJCAI 2007)

Syed and Schapire. A game-theoretic approach to apprenticeship learning. (NIPS 2008)

Ziebart, Maas, Bagnell and Dey. Maximum entropy inverse reinforcement learning. (AAAI 2008)

Dvijotham, Todorov. Inverse Optimal Control with Linearly Solvable MDPs (ICML 2010)

More materials / References