Continuous Inverse Optimal Control with Locally Optimal...

Continuous Inverse Optimal Control with Locally Optimal Examples

Sergey Levine [email protected] Koltun [email protected]

Computer Science Department, Stanford University, Stanford, CA 94305 USA

Abstract

Inverse optimal control, also known as inversereinforcement learning, is the problem of re-covering an unknown reward function in aMarkov decision process from expert demon-strations of the optimal policy. We introducea probabilistic inverse optimal control algo-rithm that scales gracefully with task dimen-sionality, and is suitable for large, continuousdomains where even computing a full policy isimpractical. By using a local approximationof the reward function, our method can alsodrop the assumption that the demonstrationsare globally optimal, requiring only local op-timality. This allows it to learn from exam-ples that are unsuitable for prior methods.

1. Introduction

Algorithms for inverse optimal control (IOC), alsoknown as inverse reinforcement learning (IRL), recoveran unknown reward function in a Markov decision pro-cess (MDP) from expert demonstrations of the corre-sponding policy. This reward function can be usedto perform apprenticeship learning, generalize the ex-pert’s behavior to new situations, or infer the expert’sgoals (Ng & Russell, 2000). Performing IOC in con-tinuous, high-dimensional domains is challenging, be-cause IOC algorithms are usually much more computa-tionally demanding than the corresponding “forward”control methods. In this paper, we present an IOCalgorithm that efficiently handles deterministic MDPswith large, continuous state and action spaces by con-sidering only the shape of the learned reward functionin the neighborhood of the expert’s demonstrations.

Since our method only considers the shape of the re-ward function around the expert’s examples, it does

Appearing in Proceedings of the 29 th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).

not integrate global information about the rewardalong alternative paths. This is analogous to trajec-tory optimization methods, which solve the forwardcontrol problem by finding a local optimum. However,while the lack of global optimality is a disadvantagefor solving the forward problem, it can actually beadvantageous in IOC. This is because it removes theassumption that the expert demonstrations are glob-ally optimal, thus allowing our algorithm to use exam-ples that only exhibit local optimality. For complextasks, human experts might find it easier to providesuch locally optimal examples. For instance, a skilleddriver might execute every turn perfectly, but still takea globally suboptimal route to the destination.

Our algorithm optimizes the approximate likelihood ofthe expert trajectories under a parameterized reward.The approximation assumes that the expert’s trajec-tory lies near a peak of this likelihood, and the result-ing optimization finds a reward function under whichthis peak is most prominent. Since this approach onlyconsiders the shape of the reward around the exam-ples, it does not require the examples to be globallyoptimal, and remains efficient even in high dimensions.We present two variants of our algorithm that learn thereward either as a linear combination of the providedfeatures, as is common in prior work, or as a nonlin-ear function of the features, as in a number of recentmethods (Ratliff et al., 2009; Levine et al., 2010; 2011).

2. Related Work

Most prior IOC methods solve the entire forward con-trol problem in the inner loop of an iterative proce-dure (Abbeel & Ng, 2004; Ratliff et al., 2006; Ziebart,2010). Such methods often use an arbitrary, possi-bly approximate forward solver, but this solver mustbe used numerous times during the learning process,making reward learning significantly more costly thanthe forward problem. Dvijotham and Todorov avoidrepeated calls to a forward solver by directly learn-ing a value function (Dvijotham & Todorov, 2010).However, this requires value function bases to impose


structure on the solution, instead of the more commonreward bases. Good value function bases are difficultto construct and are not portable across domains. Byonly considering the reward around the expert’s tra-jectories, our method removes the need to repeatedlysolve a difficult forward problem, without losing theability to utilize informative reward features.

More efficient IOC algorithms have been proposed forthe special case of linear dynamics and quadratic re-wards (LQR) (Boyd et al., 1994; Ziebart, 2010). How-ever, unlike in the forward case, LQR approaches aredifficult to generalize to arbitrary inverse problems,because learning a quadratic reward matrix aroundan example path does not readily generalize to otherstates in a non-LQR task. Because of this, such meth-ods have only been applied to tasks that conform tothe LQR assumptions (Ziebart, 2010). Our methodalso uses a quadratic expansion of the reward func-tion, but instead of learning the values of a quadraticreward matrix directly, it learns a general parameter-ized reward using its Hessian and gradient. As weshow in Section 5, a particularly efficient variant ofour algorithm can be derived when the dynamics arelinearized, and this derivation can in fact follow fromstandard LQR assumptions. However, this approxi-mation is not required, and the general form of ouralgorithm does not assume linearized dynamics.

Most previous methods also assume that the expertdemonstrations are globally optimal or near-optimal.Although this makes the examples more informativeinsofar as the learning algorithm can extract relevantglobal information, it also makes such methods unsuit-able for learning from examples that are only locallyoptimal. As shown in the evaluation, our method canlearn rewards even from locally optimal examples.

3. Background

We address deterministic, fixed-horizon control taskswith continuous states x = (x1, . . . ,xT )T, continuousactions u = (u1, . . . ,uT )T, and discrete time. Suchtasks are characterized by a dynamics function F ,which we define as

F(xt−1,ut) = xt,

as well as a reward function r(xt,ut). Given the initialstate x0, the optimal actions are given by

u = arg maxu

∑t

r(xt,ut).

IOC aims to find a reward function r under which theoptimal actions match the expert’s demonstrations,

Figure 1. A trajectory that is locally optimal but glob-ally suboptimal (black) and a globally optimal trajectory(grey). While prior methods usually require globally opti-mal examples, our approach can use examples that are onlylocally optimal. Warmer colors indicate higher reward.

given by D = {(x(1)0 ,u(1)), . . . , (x

(n)0 ,u(n))}. The al-

gorithm might also be presented with reward featuresf : (xt,ut)→ R that can be used to represent the un-known reward r. Unfortunately, real demonstrationsare rarely perfectly optimal, so we require a modelfor the expert’s behavior that can explain subopti-mality or “noise.” We employ the maximum entropyIRL (MaxEnt) model (Ziebart et al., 2008), which isclosely related to linearly-solvable MDPs (Dvijotham& Todorov, 2010). Under this model, the probabilityof the actions u is proportional to the exponential ofthe rewards encountered along the trajectory:1

P (u|x0) =1

Zexp

(∑t

r(xt,ut)

), (1)

where Z is the partition function. Under this model,the expert follows a stochastic policy that becomesmore deterministic when the stakes are high, and morerandom when all choices have similar value. In priorwork, the log likelihood derived from Equation 1 wasmaximized directly. However, computing the partitionfunction Z requires finding the complete policy underthe current reward, using a variant of value iteration(Ziebart, 2010). In high dimensional spaces, this be-comes intractable, since this computation scales expo-nentially with the dimensionality of the state space.

In the following sections, we present an approxima-tion to Equation 1 that admits efficient learning inhigh dimensional, continuous domains. In addition tobreaking the exponential dependence on dimensional-ity, this approximation removes the requirement thatthe example trajectories be globally optimal, and onlyrequires approximate local optimality. An example ofa locally optimal but globally suboptimal trajectory isshown in Figure 1: although another path has a highertotal reward, any local perturbation of the trajectorydecreases the reward total.

1In the case of multiple example trajectories, their prob-abilities are simply multiplied together to obtain P (D).


4. IOC with Locally Optimal Examples

To evaluate Equation 1 without computing the par-tition function Z, we apply the Laplace approxima-tion, which locally models the distribution as a Gaus-sian (Tierney & Kadane, 1986). Note that this is notequivalent to modeling the reward function itself as aGaussian, since Equation 1 uses the sum of the rewardsalong a path. In the context of IOC, this correspondsto assuming that the expert performs a local optimiza-tion when choosing the actions u, rather than globalplanning. This assumption is strictly less restrictivethan the assumption of global optimality.

Using r(u) to denote the sum of rewards along path(x0,u), we can write Equation 1 as

P (u|x0) = er(u)

[∫er(u)du

]−1

.

We approximate this probability using a second orderTaylor expansion of r around u:

r(u) ≈ r(u) + (u− u)T ∂r

∂u+

1

2(u− u)T ∂

2r

∂u2(u− u).

Denoting the gradient ∂r∂u as g and the Hessian ∂2r

∂u2 asH, the approximation to Equation 1 is given by

P (u|x0) ≈ er(u)

[∫er(u)+(u−u)Tg+ 1

2 (u−u)TH(u−u)du

]−1

=

[∫e−

12g

TH−1g+ 12 (H(u−u)+g)TH−1(H(u−u)+g)du

]−1

= e12g

TH−1g |−H|12 (2π)−

du2 ,

from which we obtain the approximate log likelihood

L =1

2gTH−1g +

1

2log |−H| − du

2log 2π. (2)

Intuitively, this likelihood indicates that reward func-tions under which the example paths have small gradi-ents and large negative Hessians are more likely. Themagnitude of the gradient corresponds to how closethe example is to a local peak in the (total) rewardlandscape, while the Hessian describes how steep thispeak is. For a given parameterization of the reward,we can learn the most likely parameters by maximizingEquation 2. In the next section, we discuss how thisobjective and its gradients can be computed efficiently.

5. Efficient Likelihood Optimization

We can optimize Equation 2 directly with any opti-mization method. The computation is dominated bythe linear system H−1g, so the cost is cubic in the

path length T and the action dimensionality. We willdescribe two approximate algorithms that evaluate thelikelihood in time linear in T by linearizing the dynam-ics. This greatly speeds up the method on longer ex-amples, though it should be noted that modern linearsolvers are well optimized for symmetric matrices suchas H, making it quite feasible to evaluate the likeli-hood without linearization for moderate length paths.

To derive the approximate linear-time solution toH−1g, we first express g and H in terms of the deriva-tives of r with respect to x and u individually:

g =∂r

∂u︸︷︷︸g

+∂x

∂u

T

︸︷︷︸J

∂r

∂x︸︷︷︸g

,

H =∂2r

∂u2︸︷︷︸H

+∂x

∂u

T

︸︷︷︸J

∂2r

∂x2︸︷︷︸H

∂x

∂u︸︷︷︸JT

+∂2x

∂u2︸︷︷︸H

∂r

∂x︸︷︷︸g

.

Since r(xt,ut) depends only on the state and action at

time t, H and H are block diagonal, with T blocks. Tobuild the Jacobian J, we differentiate the dynamics:

∂F∂ut

(xt−1,ut) =∂xt

∂ut= Bt,

∂F∂xt−1

(xt−1,ut) =∂xt

∂xt−1= At.

Future actions do not influence past states, so J isblock upper triangular. Using the Markov property,we can express the nonzero blocks recursively:

Jt1,t2 =∂xt1

∂ut2

T

=

{BT

t1 , t1 = t2

Jt1,t1−1ATt1 , t1 > t2

We can now write g and H almost entirely in terms ofmatrices that are block diagonal or block triangular.Unfortunately, the final second order term H does notexhibit such convenient structure. In particular, theHessian of the last state xT with respect to the actionsu can be arbitrarily dense. We will therefore disregardthis term. Since H is zero only when the dynamics arelinear, this corresponds to linearizing the dynamics.

5.1. Direct Likelihood Evaluation

We first describe an approach for directly evaluatingthe likelihood under the assumption that H is zero.We first exploit the structure of J to evaluate Jg intime linear in T , which is essential for computing g.This requires a simple recursion from t = T to 1:

[Jg]t ← BTt (gt + z(t)

g ) z(t+1)g ← AT

t (gt + z(t)g ), (3)

where zg accumulates the product of g with the off-diagonal elements of J. The linear system h = H−1g


can be solved with a stylistically similar recursion. Wefirst use the assumption that H is zero to factor H:

H = (HP + JH)JT,

where PJT = I. The nonzero blocks of P are

Pt,t = B†t Pt,t−1 = −B†tAt,

and B†t is a pseudoinverse of the potentially nonsquarematrix Bt. This linear system is solved in two passes:an upward pass to solve (HP + JH)h = g, and adownward pass to solve JTh = h. Each pass is a blockgeneralization of forward or back substitution and, likethe recursion in Equation 3, can exploit the structureof J to run in time linear in T . However, the upwardpass must also handle the off-diagonal entries of Pand the potentially nonsquare blocks of J, which arenot invertible. We therefore only construct a partialsolution on the upward pass, with each ht expressedin terms of ht−1. The final values are reconstructedon the downward pass, together with the solution h.The complete algorithm computes h = H−1g and thedeterminant |−H| in time linear in T , and is includedin Appendix A of the supplement.

For a given parameterization of the reward, we deter-mine the most likely parameters by maximizing thelikelihood with gradient-based optimization (LBFGSin our implementation). This requires the gradient ofEquation 2. For a reward parameter θ, the gradient is

∂L∂θ

= hT ∂g

∂θ− 1

2hT ∂H

∂θh +

1

2tr

(H−1 ∂H

∂θ

).

As before, the gradients of g and H can be expressedin terms of the derivatives of r at each time step, whichallows us to rewrite the gradient as

∂L∂θ

=∑ti

∂gti

∂θhti +

∑ti

∂gti

∂θ[JTh]ti+ (4)

1

2

∑tij

∂Htij

∂θ

([H−1]ttij − htihtj

)+

1

2

∑tij

∂Htij

∂θ

([JTH−1J]ttij − [JTh]ti[J

Th]tj)

+

1

2

∑ti

∂gti

∂θ

∑t1t2jk

([H−1]t1t2jk − ht1jht2k

)Ht1t2jkti,

where [H−1]ttij denotes the ijth entry in the block tt.

The last sum vanishes if H is zero, so all quantitiescan be computed in time linear in T . The diagonalblocks of H−1 and JTH−1J can be computed whilesolving for h = H−1g, as shown in Appendix A ofthe supplement. To find the gradients for any rewardparameterization, we can compute the gradients of g,g, H, and H, and then apply the above equation.

5.2. LQR-Based Likelihood Evaluation

While the approximate likelihood in Equation 2 makesno assumptions about the dynamics of the MDP,the algorithm in Section 5.1 linearizes the dynam-ics around the examples. This matches the assump-tions of the commonly studied linear-quadratic regula-tor (LQR) setting, and suggests an alternative deriva-tion of the algorithm as IOC in a linear-quadratic sys-tem, with linearized dynamics given by At and Bt,quadratic reward matrices given by the diagonal blocksof H and H, and linear reward vectors given by g andg. A complete derivation of the resulting algorithmis presented in Appendix B of the supplement, and issimilar to the MaxEnt LQR IOC algorithm describedby Ziebart (Ziebart, 2010), with an addition recursionto compute the derivatives of the parameterized re-ward Hessians. Since the gradients are computed re-cursively, this method lacks the convenient form pro-vided by Equation 4, but may be easier to implement.

6. Algorithms for IOC with LocallyOptimal Examples

We can use the objective in Equation 2 to learn rewardfunctions with a variety of representations. We presentone variant that learns the reward as a linear combi-nation of features, and a second variant that uses aGaussian process to learn nonlinear reward functions.

6.1. Learning Linear Reward Functions

In the linear variant, the algorithm is provided withfeatures f that depend on the state xt and action ut.The reward is given by r(xt,ut) = θTf(xt,ut), and theweights θ are learned. Letting g(k), g(k), H(k), andH(k) denote the gradients and Hessians of each featurewith respect to actions and states, the full gradientsand Hessians are sums of these quantities, weightedby θk – e.g., g =

∑k g

(k)θk. The gradient of g withrespect to θk is simply g(k), and the gradients of theother matrices are given analogously. The likelihoodgradient is then obtained from Equation 4.

When evaluating Equation 2, the log determinant ofthe negative Hessian is undefined when the determi-nant is not positive. This corresponds to the examplepath lying in a valley rather than on a peak of theenergy landscape. A high-probability reward functionwill avoid such cases, but it is nontrivial to find aninitial point for which the objective can be evaluated.We therefore add a dummy regularizer feature thatensures that the negative Hessian has a positive deter-minant. This feature has a gradient that is uniformlyzero, and a Hessian equal to the negative identity.


The initial weight θr on this feature must be set suchthat the negative Hessians of all example paths arepositive definite. We can find a suitable weight simplyby doubling θr until this requirement is met. Duringthe optimization, we must drive θr to zero in orderto solve the original problem. In this way, θr has therole of a relaxation, allowing the algorithm to explorethe parameter space without requiring the Hessian toalways be negative definite. Unfortunately, drivingθr to zero too quickly can create numerical instabil-ity, as the Hessians become ill-conditioned or singular.Rather than simply penalizing the regularizing weight,we found that we can maintain numerical stability andstill obtain a solution with θr = 0 by using the Aug-mented Lagrangian method (Birgin & Martınez, 2009).This method solves a sequence of maximization prob-lems that are augmented by a penalty term of the form

φ(j)(θr) = −1

2µ(j)θ2

r + λ(j)θr,

where µ(j) is a penalty weight, and λ(j) is an estimateof the Lagrange multiplier for the constraint θr = 0.After each optimization, µ(j+1) is increased by a factorof 10 if θr has not decreased, and λ(j+1) is set to

λ(j+1) ← λ(j) − µ(j)θr.

This approach allows θr to decrease gradually witheach optimization without using large penalty terms.

6.2. Learning Nonlinear Reward Functions

In the nonlinear variant of our algorithm, we representthe reward function as a Gaussian process (GP) thatmaps from feature values to rewards, as proposed byLevine et al. (Levine et al., 2011). The inputs of theGaussian process are a set of inducing feature pointsF = [f1 . . . fn]T, and the noiseless outputs y at thesepoints are learned. The location of the inducing pointscan be chosen in a variety of ways, but we follow Levineet al. and choose the points that lie on the examplepaths, which concentrates the learning on the regionswhere the examples are most informative. In additionto the outputs y, we also learn the hyperparameters λand β that describe the GP kernel function, given by

k(f i, f j) = β exp

(−1

2

∑k

λk

[(f ik − f jk)2 + 1i 6=jσ

2])

.

This kernel is a variant of the radial basis functionkernel, regularized by input noise σ2 (since the outputsare noiseless). The GP covariance is then defined asKij = k(f i, f j), producing the following GP likelihood:

logP (y, λ, β|F)=−1

2yTK−1y−1

2log |K|+logP (λ, β|F).

The last term in the likelihood is the prior on thehyperparameters. This prior encourages the featureweights λ to be sparse, and prevents degeneracies thatoccur as y → 0. The latter is accomplished witha prior that encodes the belief that no two inducingpoints are deterministically dependent, as captured bytheir partial correlation:

logP (λ, β|F) = −1

2tr(K−2

)−∑k

log (λk + 1)

This prior is discussed in more detail in previous work(Levine et al., 2011). The reward at a feature pointf(xt,ut) is given by the GP posterior mean, and canbe augmented with a set of linear features f`:

r(xt,ut) = ktα+ θTf`(xt,ut),

where α = K−1y, and kt is a row vector correspondingto the covariance between f(xt,ut) and each inducingpoint f i, given by kti = k(f(xt,ut), f

i). The exactlog likelihood, before the proposed approximation, isobtained by using the GP likelihood as a prior on theIOC likelihood in Equation 1:

logP (u|x0) =∑t

r(xt,ut)− logZ + logP (y, λ, β|F).

Only the IOC likelihood is altered by the proposed ap-proximation. The gradient and Hessian of the rewardwith respect to the states are

gt =∂kt

∂xtα+ g

(`)t and Ht =

∂2kt

∂x2t

α+ H(`)t ,

and the kernel derivatives are given by

∂kt

∂xt=∑k

∂kt

∂f tkg

(k)t

∂2kt

∂x2t

=∑k

∂kt

∂fkH

(k)t +

∑k1,k2

∂2kt

∂f tk1∂f tk2

g(k1)t g

(k2)t .

The feature derivatives g(k) and H(k) are defined in theprevious section, and g and H are given analogously.Using these quantities, the likelihood can be computedas described in Section 5. The likelihood gradients arederived in Appendix C of the supplement.

This algorithm can learn more expressive rewards indomains where a linear reward basis is not known, butwith the usual bias and variance tradeoff that comeswith increased model complexity. As shown in ourevaluation, the linear method requires fewer exampleswhen a linear basis is available, while the nonlinearvariant can work with much less expressive features.


true reward linear variant nonlinear variant

Figure 2. Robot arm rewards learned by our algorithms fora 4-link arm. One of eight examples is shown.

7. Evaluation

We evaluate our method on simulated robot arm con-trol, planar navigation, and simulated driving. In therobot arm task, the expert sets continuous torques oneach joint of an n-link planar robot arm. The rewarddepends on the position of the end-effector. Each linkhas an angle and a velocity, producing a state spacewith 2n dimensions. By changing the number of links,we can vary the dimensionality of the task. An exam-ple of a 4-link arm is shown in Figure 2. The complex-ity of this task makes it difficult to compare with priorwork, so we also include a simple planar navigationtask, in which the expert takes continuous steps on aplane, as shown in Figure 3. Finally, we use human-created examples on a simulated driving task, whichshows how our method can learn complex policies fromhuman demonstrations on a more realistic domain.

The reward function in the robot arm and navigationtasks has a Gaussian peak in the center, surroundedby four pits. The reward also penalizes each actionwith the square of its magnitude. The IOC algorithmsare provided with a grid of 25 evenly spaced Gaussianfeatures and the squared action magnitude. In thenonlinear test in Section 7.2, the Cartesian coordinatesof the arm end-effector are provided instead of the grid.

We compare the linear and nonlinear variants of ourmethod with the MaxEnt IRL and OptV algorithms(Ziebart et al., 2008; Dvijotham & Todorov, 2010). Wepresent results for the linear time algorithm in Sec-tion 5.1, though we found that both the LQR variantand the direct, non-linearized approach produced sim-ilar results. MaxEnt used a grid discretization for bothstates and actions, while OptV used discretized actionsand adapted the value function features as describedby Dvijotham and Todorov. Since OptV cannot learnaction-dependent rewards, it was provided with thetrue weight for the action penalty term.

To evaluate a learned reward, we first compute the op-timal paths with respect to this reward from 32 ran-dom initial states that are not part of the training set.We also find the paths that begin in the same initialstates but are optimal with respect to the true reward.In both cases, we compute evaluation paths that are

true reward linear nonlinear MaxEnt OptV

Figure 3. Planar navigation rewards learned from 16 lo-cally optimal examples. Black lines show optimal paths foreach reward originating from example initial states. Therewards learned by our algorithms better resemble the truereward than those learned by prior methods.

globally optimal, by first solving a discretization ofthe task with value iteration, and then finetuning thepaths with continuous optimization. Once the evalu-ation paths are computed for both reward functions,we obtain a reward loss by subtracting the true rewardalong the learned reward’s path from the true rewardalong the true optimal path. This loss is low when thelearned reward induces the same policy as the trueone, and high when the learned reward causes costlymistakes. Since the reward loss is measured entirelyon globally optimal paths, it captures how well eachalgorithm learns the true, global reward, regardless ofwhether the examples are locally or globally optimal.

7.1. Locally Optimal Examples

To test how well each method handles locally optimalexamples, we ran the navigation task with increasingnumbers of examples that were either globally or lo-cally optimal. As discussed above, globally optimalexamples were obtained with discretization, while lo-cally optimal examples were computed by optimizingthe actions from a random initialization. Each testwas repeated eight times with random initial statesfor each example.

The results in Figure 4 show that both variants ofour algorithm converge to the correct policy. The lin-ear variant requires fewer examples, since the featuresform a good linear basis for the true reward. MaxEntassumes global optimality and does not converge to thecorrect policy when the examples are only locally op-timal. It also suffers from discretization error. OptV

globally optimal examples

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4

Linear

Nonlinear

MaxEnt

OptV

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4locally optimal examples

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4

Linear

Nonlinear

MaxEnt

OptV

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4

Figure 4. Reward loss for each algorithm with either glob-ally or locally optimal planar navigation examples. Priormethods do not converge to the expert’s policy when theexamples are not globally optimal.


grid features

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4

Linear

Nonlinear

MaxEnt

OptVoff scale

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4position features

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4

Linear

Nonlinear

MaxEnt

OptV

off scale

off scale

examples

rew

ard

loss

4 8 16 32 640

1

2

3

4

Figure 5. Reward loss for each algorithm with the Gaussiangrid and end-effector position features on the 2-link robotarm task. Only the nonlinear variant of our method couldlearn the reward using only the position features.

has difficulty generalizing the reward to unseen partsof the state space, because the value function featuresdo not impose meaningful structure on the reward.

7.2. Linear and Nonlinear Rewards

On the robot arm task, we evaluated each method withboth the Gaussian grid features, and simple featuresthat only provide the position of the end effector, andtherefore do not form a linear basis for the true reward.The examples were globally optimal. The number oflinks was set to 2, resulting in a 4-dimensional statespace. Only the nonlinear variant of our algorithmcould successfully learn the reward from the simplefeatures, as shown in Figure 5. Even with the gridfeatures, which do form a linear basis for the reward,MaxEnt suffered greater discretization error due tothe complex dynamics of this task, while OptV couldnot meaningfully generalize the reward due to the in-creased dimensionality of the task.

7.3. High Dimensional Tasks

To evaluate the effect of dimensionality, we increasedthe number of robot arm links. As shown in Figure 6,the processing time of our methods scaled gracefullywith the dimensionality of the task, while the qual-ity of the reward did not deteriorate appreciably. Theprocessing time of OptV increased exponentially dueto the action space discretization. The MaxEnt dis-cretization was intractable with more than two links,and is therefore not shown.

robot arm reward loss

links

rew

ard

loss

2 3 4 5 6 7 80

1

2

3

4

Linear

Nonlinear

MaxEnt

OptV

off scale

off scale

links

rew

ard

loss

2 3 4 5 6 7 80

1

2

3

4robot arm processing time

links

seco

nds

2 3 4 5 6 7 80

500

1000

1500

2000

Linear

Nonlinear

MaxEnt

OptV

off scale

links

seco

nds

2 3 4 5 6 7 80

500

1000

1500

2000

Figure 6. Reward loss and processing time with increasingnumbers of robot arm links n, corresponding to state spaceswith 2n dimensions. Our methods efficiently learn goodrewards even as the dimensionality is increased.

style pathaveragespeed

timebehind

time infront

aggressivelearned 158.1 kph 3.5 sec 12.5 sechuman 158.2 kph 3.5 sec 16.7 sec

evasivelearned 150.1 kph 7.2 sec 3.7 sechuman 149.5 kph 4.5 sec 2.8 sec

tailgaterlearned 97.5 kph 111.0 sec 0.0 sechuman 115.3 kph 99.5 sec 7.0 sec

Table 1. Statistics for sample paths for the learned driv-ing rewards and the corresponding human demonstrationsstarting in the same initial states. The statistics of thelearned paths closely resemble the holdout demonstrations.

7.4. Human Demonstrations

We evaluate how our method handles human demon-strations on a simulated driving task. Although driv-ing policies have been learned by prior IOC methods(Abbeel & Ng, 2004; Levine et al., 2011), their discreteformulation required a discrete simulator where theagent makes simple decisions, such as choosing whichlane to switch to. In constrast, our driving simulatoris a fully continuous second order dynamical system.The actions correspond directly to the gas, breaks, andsteering of the simulated car, and the state space in-cludes position, orientation, and linear and angularvelocities. Because of this, prior methods that rely ondiscretization cannot tractably handle this domain.

We used our nonlinear method to learn from sixteen13-second examples of an aggressive driver that cuts offother cars, an evasive driver that drives fast but keepsplenty of clearance, and a tailgater who follows closelybehind the other cars. The features are speed, devi-ation from lane centers, and Gaussians covering thefront, back, and sides of the other cars on the road.Since there is no ground truth reward for these tasks,we cannot use the reward loss metric. We follow priorwork and quantify how much the learned policy resem-bles the demonstration by using task-relevant statistics(Abbeel & Ng, 2004). We measure the average speedof sample paths for the learned reward and the amountof time they spend within two car-lengths behind andin front of other cars, and compare these statistics withthose from an unobserved holdout set of user demon-strations that start in the same initial states. The re-sults in Table 1 show that the statistics of the learnedpolicies are similar to the holdout demonstrations.

aggressive evasive tailgater

othercars��

passingreward

��6

proximitypenalty

��:��tailgatingreward

-��

Figure 7. Learned driving style rewards. The aggressivereward is high in front of other cars, the evasive one avoidsthem, and the tailgater reward is high behind the cars.


Plots of the learned rewards are shown in Figure 7.Videos of the optimal paths for the learned rewardscan be downloaded from the project website, alongwith the supplementary appendices and source code:http://graphics.stanford.edu/projects/cioc.

8. Discussion and Future Work

We presented an IOC algorithm designed for continu-ous, high dimensional domains. Our method remainsefficient in high dimensional domains by using a localapproximation to the reward function likelihood. Thisapproximation also removes the global optimality re-quirement for the expert’s demonstrations, allowingthe method to learn the reward from examples thatare only locally optimal. Local optimality can be eas-ier to demonstrate than global optimality, particularlyin high dimensional domains. As shown in our evalua-tion, prior methods do not converge to the underlyingreward function when the examples are only locally op-timal, regardless of how many examples are provided.

Since our algorithm relies on the derivatives of the re-ward features to learn the reward function, we requirethe features to be differentiable with respect to thestates and actions. These derivatives must be pre-computed only once, so it is quite practical to use fi-nite differences when analytic derivatives are unavail-able, but features that exhibit discontinuities are stillpoorly suited for our method. Our current formulationalso only considers deterministic, fixed-horizon controlproblems, and an extension to stochastic or infinite-horizon cases is an interesting avenue for future work.

In addition, although our method handles examplesthat lack global optimality, it does not make use ofglobal optimality when it is present: the examplesare always assumed to be only locally optimal. Priormethods that exploit global optimality can infer moreinformation about the reward function from each ex-ample when the examples are indeed globally optimal.

An exciting avenue for future work is to apply this ap-proach to other high dimensional, continuous problemsthat have previously been inaccessible for inverse opti-mal control methods. One challenge with such applica-tions is to generalize and impose meaningful structurein high dimensional tasks without requiring detailedfeatures or numerous examples. While the nonlinearvariant of our algorithm takes one step in this direc-tion, it still relies on features to generalize the rewardto unseen regions of the state space. A more sophisti-cated way to construct meaningful, generalizable fea-tures would allow IOC to be easily applied to complex,high dimensional tasks.

Acknowledgements

We thank the anonymous reviewers for their construc-tive comments. Sergey Levine was supported by NSFGraduate Research Fellowship DGE-0645962.

References

Abbeel, Pieter and Ng, Andrew Y. Apprenticeshiplearning via inverse reinforcement learning. In Pro-ceedings of ICML, 2004.

Birgin, Ernesto G. and Martınez, Jose Mario. Practicalaugmented lagrangian methods. In Encyclopedia ofOptimization, pp. 3013–3023. 2009.

Boyd, S., El Ghaoui, L., Feron, E., and Balakrishnan,V. Linear Matrix Inequalities in System and ControlTheory. SIAM, Philadelphia, PA, June 1994.

Dvijotham, Krishnamurthy and Todorov, Emanuel.Inverse optimal control with linearly-solvableMDPs. In Proceedings of ICML, 2010.

Levine, Sergey, Popovic, Zoran, and Koltun, Vladlen.Feature construction for inverse reinforcement learn-ing. In Advances in Neural Information ProcessingSystems. 2010.

Levine, Sergey, Popovic, Zoran, and Koltun, Vladlen.Nonlinear inverse reinforcement learning with gaus-sian processes. In Advances in Neural InformationProcessing Systems. 2011.

Ng, Andrew Y. and Russell, Stuart J. Algorithmsfor inverse reinforcement learning. In Proceedingsof ICML, 2000.

Ratliff, Nathan, Bagnell, J. Andrew, and Zinkevich,Martin A. Maximum margin planning. In Proceed-ings of ICML, 2006.

Ratliff, Nathan, Silver, David, and Bagnell, J. Andrew.Learning to search: Functional gradient techniquesfor imitation learning. Autonomous Robots, 27(1):25–53, 2009.

Tierney, Luke and Kadane, Joseph B. Accurate ap-proximations for posterior moments and marginaldensities. Journal of the American Statistical Asso-ciation, 81(393):82–86, 1986.

Ziebart, Brian D. Modeling Purposeful Adaptive Be-havior with the Principle of Maximum Causal En-tropy. PhD thesis, Carnegie Mellon University, 2010.

Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew,and Dey, Anind K. Maximum entropy inverse rein-forcement learning. In AAAI Conference on Artifi-cial Intelligence, 2008.

http://graphics.stanford.edu/projects/cioc

Continuous Inverse Optimal Control with Locally Optimal...

Documents

Transcript of Continuous Inverse Optimal Control with Locally Optimal...