Apprenticeship Learning via Inverse Reinforcement...
Transcript of Apprenticeship Learning via Inverse Reinforcement...
Inverse Reinforcement Learning
Krishnamurthy Dvijotham
UW
(Most slides borrowed/adapted from Pieter Abbeel)
Reinforcement Learning
Almost the same as Optimal Control
“Reinforcement” term coined by psychologists studying animal learning
Focus on discrete state spaces, highly stochastic environments, learning to control without knowing system model in general
Work with rewards instead of costs, usually discounted
Inverse Reinforcement Learning
Dynamics Model
Expert Trajectories
IOC AlgorithmReward
Function R
R explains expert trajectories
Scientific inquiry
Model animal and human behavior
E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.]
Apprenticeship learning/Imitation learning through inverse RL
Presupposition: reward function provides the most succinct and transferable definition of the task
Has enabled advancing the state of the art in various robotic domains
Modeling of other agents, both adversarial and cooperative
Motivation for inverse RL
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
Simulated highway driving
Abbeel and Ng, ICML 2004,
Syed and Schapire, NIPS 2007
Aerial imagery based navigation
Ratliff, Bagnell and Zinkevich, ICML 2006
Parking lot navigation
Abbeel, Dolgov, Ng and Thrun, IROS 2008
Urban navigation
Ziebart, Maas, Bagnell and Dey, AAAI 2008
Quadruped locomotion
Ratliff, Bradley, Bagnell and Chestnutt, NIPS 2007
Kolter, Abbeel and Ng, NIPS 2008
Examples
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
Input:
State space, action space
Transition model Psa(st+1 | st, at)
No reward function
Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …
(= trace of the teacher’s policy *)
Inverse RL:
Can we recover R ?
Apprenticeship learning via inverse RL
Can we then use this R to find a good policy ?
Behavioral cloning
Can we directly learn the teacher’s policy using supervised learning?
Problem setup
Formulate as standard machine learning problem
Fix a policy class
E.g., support vector machine, neural network, decision tree, deep belief net, …
Estimate a policy (=mapping from states to actions) from the training examples (s0, a0), (s1, a1), (s2, a2), …
Two of the most notable success stories:
Pomerleau, NIPS 1989: ALVINN
Sammut et al., ICML 1992: Learning to fly (flight sim)
Behavioral cloning
Which has the most succinct description: ¼* vs. R*?
Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.
Inverse RL vs. behavioral cloning
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
1964, Kalman posed the inverse optimal control problem and solved it in the 1D input case
1994, Boyd+al.: a linear matrix inequality (LMI) characterization for the general linear quadratic setting
2000, Ng and Russell: first MDP formulation, reward function ambiguity pointed out and a few solutions suggested
2004, Abbeel and Ng: inverse RL for apprenticeship learning---reward feature matching
2006, Ratliff+al: max margin formulation
Inverse RL history
2007, Ratliff+al: max margin with boosting---enables large vocabulary of reward features
2007, Ramachandran and Amir, and Neu and Szepesvari: reward function as characterization of policy class
2008, Kolter, Abbeel and Ng: hierarchical max-margin
2008, Syed and Schapire: feature matching + game theoretic formulation
2008, Ziebart+al: feature matching + max entropy
2010, Dvijotham, Todorov: Inverse RL with LMDPs
Active inverse RL? Inverse RL w.r.t. minmax control, partial observability, learning stage (rather than observing optimal policy), … ?
Inverse RL history
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
IRL with LMDP : OptV
Algorithm (OptV):1. Input: Transitions {(x_n,y_n)} sampled
from u*, Dynamics P(x’|x)2. Estimate v by maximizing log-likelihood:
3. Compute q from v using
No need to solve opt control prob repeatedly
Continuous State Spaces
How to represent v in a continuous state space?
Typical Sample (2d) Local Gaussian Bases
Discretize : Blows up exponentially with dimension
Intuition
RBFs
Representation of v in continuous state spaces?
1. Lots of approximating power in highly visited regions
2. Low approximating power (smoother approximation) in low-density regions
Basis Adaptation
Complete log-likelihood:
Algorithm (OptVA)1. Initialize θ “cleverly”2. Until Convergence:
1. LBFGS to compute L*(θ)2. Conjugate Gradient ascent on L*(θ)
OptQ
Log-likelihood of set of trajectories
Observe set of expert trajectories
Express z in terms of q
Resulting expression convex in q : Optimize!
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Historical sketch of inverse RL
Mathematical formulations for inverse RL
IRL with LMDPs
IRL with Traditional MDPs
Examples
Lecture outline
IRL in Traditional MDPs
GridWorld : •Controls:•First-Exit : Minimize cost until you hit goal
IRL Ill-Posed(Ng ’00)
IRL in Traditional MDPs
Let be dynamics under policy Π. Then, any reward function satisfying
makes Π optimal
Find a reward function R* which explains the expert
behaviour.
Find R* such that
In fact a convex feasibility problem, but many challenges:
R=0 is a solution, more generally: reward function ambiguity
We typically only observe expert traces rather than the entire expert policy ¼* --- how to compute LHS?
Assumes the expert is indeed optimal --- otherwise infeasible
Computationally: assumes we can enumerate all policies
IRL in MDPs : Basic principle
E[P1
t=0 °tR¤(st)j¼¤] ¸ E[P1
t=0 °tR¤(st)j¼] 8¼
ff
Subbing into
gives us:
Feature based reward function
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.
E[
1X
t=0
°tR(st)j¼] = E[
1X
t=0
°tw>Á(st)j¼]
= w>E[
1X
t=0
°tÁ(st)j¼]
= w>¹(¼)
Expected cumulative discounted sum of feature values or “feature expectations”
E[P1
t=0 °tR¤(st)j¼¤] ¸ E[P1
t=0 °tR¤(st)j¼] 8¼
Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
Feature expectations can be readily estimated from sample trajectories.
The number of expert demonstrations required scales with the number of features in the reward function.
The number of expert demonstration required does not depend on
Complexity of the expert’s optimal policy ¼*
Size of the state space directly
Feature based reward function
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
E[P1
t=0 °tR¤(st)j¼¤] ¸ E[P1
t=0 °tR¤(st)j¼] 8¼
Challenges:
Assumes we know the entire expert policy ¼*
assumes we can estimate expert feature expectations
R=0 is a solution (now: w=0), more generally: reward function ambiguity
Assumes the expert is indeed optimal---became even more of an issue with the more limited reward function expressiveness!
Computationally: assumes we can enumerate all policies
Recap of challenges
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
Standard max margin:
“Structured prediction” max margin:
Justification: margin should be larger for policies that are very different from ¼*.
Example: m(¼, ¼*) = number of states in which ¼* was observed and in which ¼ and ¼* disagree
Ambiguity
minwkwk22
s:t: w>¹(¼¤) ¸ w>¹(¼) + 1 8¼
minwkwk22
s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼) 8¼
Structured prediction max margin with slack variables:
Can be generalized to multiple MDPs (could also be same MDP with different initial state)
Expert suboptimality
minwkwk22 + C»
s:t: w>¹(¼¤) ¸ w>¹(¼) + m(¼¤; ¼)¡ » 8¼
minwkwk22 + C
X
i
»(i)
s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)
Resolved: access to ¼*, ambiguity, expert suboptimality
One challenge remains: very large number of constraints
Ratliff+al use subgradient methods.
In this lecture: constraint generation
Complete max-margin formulation
[Ratliff, Zinkevich and Bagnell, 2006]
minwkwk22 + C
X
i
»(i)
s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; ¼(i)
Initialize ¦(i) = {} for all i and then iterate
Solve
For current value of w, find the most violated constraint for all i by solving:
= find the optimal policy for the current estimate of the reward function (+ loss augmentation m)
For all i add ¼(i) to ¦(i)
If no constraint violations were found, we are done.
Constraint generation
max¼(i)
w>¹(¼(i)) + m(¼(i)¤; ¼(i))
minwkwk22 + C
X
i
»(i)
s:t: w>¹(¼(i)¤) ¸ w>¹(¼(i)) + m(¼(i)¤; ¼(i))¡ »(i) 8i; 8¼(i) 2 ¦(i)
Every policy ¼ has a corresponding feature expectation vector ¹(¼),
which for visualization purposes we assume to be 2D
Visualization in feature expectation space
1
2
(*)
E[
1X
t=0
°tR(st)j¼] = w>¹(¼)
max margin
structured max margin (?)
wmm
wsmm
Every policy ¼ has a corresponding feature expectation vector ¹(¼),
which for visualization purposes we assume to be 2D
Constraint generation
1
(0)
(2)
2
(*)
(1)
max¼
E[
1X
t=0
°tRw(st)j¼] = max¼
w>¹(¼)
w(2)
w(3) = wmm
constraint generation:
Inverse RL starting point: find a reward function such that the expert outperforms other policies
Observation in Abbeel and Ng, 2004: for a policy ¼ to be guaranteed to perform as well as the expert policy ¼*, it
suffices that the feature expectations match:
implies that for all w with
Feature matching
Let R(s) = w>Á(s), where w 2 <n, and Á : S !<n.
k¹(¼)¡¹(¼¤)k1 · ²
kwk1 · 1:
E[
1X
t=0
°tRw¤(st)j¼] = w¤>¹(¼) ¸ w¤>¹(¼¤)¡ ² = E[
1X
t=0
°tRw¤(st)j¼¤]¡ ²
Find w¤ such that w¤>¹(¼¤) ¸ w¤>¹(¼) 8¼
Holder's inequality:
8p; q ¸ 1 if1
p+
1
q= 1; then x>y · kxkpkykq
Examples of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
History of inverse RL
Mathematical formulations for inverse RL
Examples
Lecture outline
GridWorld CPU times
Results
•Colored plots represent functions f(x,y)•Red - High, Blue – Low•Continuous state problems, 2d state space•x-axis : Position, y-axis: Velocity
Results on Continuous Problems
True Cost
Results on Continuous Problems
True Cost
Comparison to Other Algorithms
Optimal Policy
OptVEstimate
2 min
Ng ’00 Estimate
Abbeel ’04 Estimate
Discretized controls to for other algos to make them run tractably
Example of apprenticeship learning via inverse RL
Inverse RL vs. behavioral cloning
Sketch of history of inverse RL
Mathematical formulations for inverse RL
Case studies
Inverse Optimal Control with LMDPs
Lecture outline
Inverse RL vs. behavioral cloning
Sketch of history of inverse RL
Mathematical formulations for inverse RL
Inverse RL with LMDPs vs Inverse RL with Traditional MDPs
Summary
Ng and Russell. Algorithms for inverse reinforcement Learning (ICML 2000)
Abbeel and Ng. Apprenticeship learning via inverse reinforcement learning. (ICML 2004)
Boyd, Ghaoui, Feron and Balakrishnan. (1994). Linear matrix inequalities in system and control theory. SIAM.
Kolter, Abbeel and Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. (NIPS 2008)
Neu and Szepesvari. Apprenticeship learning using inverse reinforcement learning and gradient methods. (UAI 2007)
Ng and Russell. Algorithms for inverse reinforcement learning. (ICML 2000)
Ratliff, Bagnell and Zinkevich. Maximum margin planning. (ICML 2006)
Ratliff, Bradley, Bagnell and Chestnutt. Boosting structured prediction for imitation learning. (NIPS 2007)
Ramachandran and Amir. Bayesian inverse reinforcement learning. (IJCAI 2007)
Syed and Schapire. A game-theoretic approach to apprenticeship learning. (NIPS 2008)
Ziebart, Maas, Bagnell and Dey. Maximum entropy inverse reinforcement learning. (AAAI 2008)
Dvijotham, Todorov. Inverse Optimal Control with Linearly Solvable MDPs (ICML 2010)
More materials / References