Guided Cost Learning: Deep Inverse Optimal Control via ...

13
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization Chelsea Finn CBFINN@EECS. BERKELEY. EDU Sergey Levine SVLEVINE@EECS. BERKELEY. EDU Pieter Abbeel PABBEEL@EECS. BERKELEY. EDU University of California, Berkeley, Berkeley, CA 94709 USA Abstract Reinforcement learning can acquire complex be- haviors from high-level specifications. However, defining a cost function that can be optimized effectively and encodes the correct task is chal- lenging in practice. We explore how inverse op- timal control (IOC) can be used to learn behav- iors from demonstrations, with applications to torque control of high-dimensional robotic sys- tems. Our method addresses two key challenges in inverse optimal control: first, the need for in- formative features and effective regularization to impose structure on the cost, and second, the dif- ficulty of learning the cost function under un- known dynamics for high-dimensional continu- ous systems. To address the former challenge, we present an algorithm capable of learning ar- bitrary nonlinear cost functions, such as neural networks, without meticulous feature engineer- ing. To address the latter challenge, we formu- late an efficient sample-based approximation for MaxEnt IOC. We evaluate our method on a series of simulated tasks and real-world robotic manip- ulation problems, demonstrating substantial im- provement over prior methods both in terms of task complexity and sample efficiency. 1. Introduction Reinforcement learning can be used to acquire complex be- haviors from high-level specifications. However, defining a cost function that can be optimized effectively and en- codes the correct task can be challenging in practice, and techniques like cost shaping are often used to solve com- plex real-world problems (Ng et al., 1999). Inverse optimal control (IOC) or inverse reinforcement learning (IRL) pro- vide an avenue for addressing this challenge by learning a Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). run policy q i on robot policy optimization step optimize cost initial distribution q 0 q i+1 human demonstrations Guided Cost Learning Ddemo Dsamp D traj c θ c θ θ argmin θ L IOC Figure 1. Right: Guided cost learning uses policy optimization to adaptively sample trajectories for estimating the IOC partition function. Bottom left: PR2 learning to gently place a dish in a plate rack. cost function directly from expert demonstrations, e.g. Ng et al. (2000); Abbeel & Ng (2004); Ziebart et al. (2008). However, designing an effective IOC algorithm for learn- ing from demonstration is difficult for two reasons. First, IOC is fundamentally underdefined in that many costs in- duce the same behavior. Most practical algorithms there- fore require carefully designed features to impose structure on the learned cost. Second, many standard IRL and IOC methods require solving the forward problem (finding an optimal policy given the current cost) in the inner loop of an iterative cost optimization. This makes them difficult to apply to complex, high-dimensional systems, where the forward problem is itself exceedingly difficult, particularly real-world robotic systems with unknown dynamics. To address the challenge of representation, we propose to use expressive, nonlinear function approximators, such as neural networks, to represent the cost. This reduces the engineering burden required to deploy IOC methods, and makes it possible to learn cost functions for which expert intuition is insufficient for designing good fea- tures. Such expressive function approximators, however, can learn complex cost functions that lack the structure typically imposed by hand-designed features. To mitigate this challenge, we propose two regularization techniques for IOC, one which is general and one which is specific to episodic domains, such as to robotic manipulation skills. In order to learn cost functions for real-world robotic tasks, arXiv:1603.00448v3 [cs.LG] 27 May 2016

Transcript of Guided Cost Learning: Deep Inverse Optimal Control via ...

Page 1: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization

Chelsea Finn [email protected] Levine [email protected] Abbeel [email protected]

University of California, Berkeley, Berkeley, CA 94709 USA

AbstractReinforcement learning can acquire complex be-haviors from high-level specifications. However,defining a cost function that can be optimizedeffectively and encodes the correct task is chal-lenging in practice. We explore how inverse op-timal control (IOC) can be used to learn behav-iors from demonstrations, with applications totorque control of high-dimensional robotic sys-tems. Our method addresses two key challengesin inverse optimal control: first, the need for in-formative features and effective regularization toimpose structure on the cost, and second, the dif-ficulty of learning the cost function under un-known dynamics for high-dimensional continu-ous systems. To address the former challenge,we present an algorithm capable of learning ar-bitrary nonlinear cost functions, such as neuralnetworks, without meticulous feature engineer-ing. To address the latter challenge, we formu-late an efficient sample-based approximation forMaxEnt IOC. We evaluate our method on a seriesof simulated tasks and real-world robotic manip-ulation problems, demonstrating substantial im-provement over prior methods both in terms oftask complexity and sample efficiency.

1. IntroductionReinforcement learning can be used to acquire complex be-haviors from high-level specifications. However, defininga cost function that can be optimized effectively and en-codes the correct task can be challenging in practice, andtechniques like cost shaping are often used to solve com-plex real-world problems (Ng et al., 1999). Inverse optimalcontrol (IOC) or inverse reinforcement learning (IRL) pro-vide an avenue for addressing this challenge by learning a

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

run policy qion robot

policyoptimization step

optimize cost

initialdistribution q0

qi+1

humandemonstrations

Guided Cost Learning

Ddemo

DsampDtraj

cθθ � argmin

θLIOC

Figure 1. Right: Guided cost learning uses policy optimizationto adaptively sample trajectories for estimating the IOC partitionfunction. Bottom left: PR2 learning to gently place a dish in aplate rack.

cost function directly from expert demonstrations, e.g. Nget al. (2000); Abbeel & Ng (2004); Ziebart et al. (2008).However, designing an effective IOC algorithm for learn-ing from demonstration is difficult for two reasons. First,IOC is fundamentally underdefined in that many costs in-duce the same behavior. Most practical algorithms there-fore require carefully designed features to impose structureon the learned cost. Second, many standard IRL and IOCmethods require solving the forward problem (finding anoptimal policy given the current cost) in the inner loop ofan iterative cost optimization. This makes them difficultto apply to complex, high-dimensional systems, where theforward problem is itself exceedingly difficult, particularlyreal-world robotic systems with unknown dynamics.

To address the challenge of representation, we proposeto use expressive, nonlinear function approximators, suchas neural networks, to represent the cost. This reducesthe engineering burden required to deploy IOC methods,and makes it possible to learn cost functions for whichexpert intuition is insufficient for designing good fea-tures. Such expressive function approximators, however,can learn complex cost functions that lack the structuretypically imposed by hand-designed features. To mitigatethis challenge, we propose two regularization techniquesfor IOC, one which is general and one which is specific toepisodic domains, such as to robotic manipulation skills.

In order to learn cost functions for real-world robotic tasks,

arX

iv:1

603.

0044

8v3

[cs

.LG

] 2

7 M

ay 2

016

Page 2: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

our method must be able to handle unknown dynamics andhigh-dimensional systems. To that end, we propose a costlearning algorithm based on policy optimization with lo-cal linear models, building on prior work in reinforcementlearning (Levine & Abbeel, 2014). In this approach, asillustrated in Figure 1, the cost function is learned in theinner loop of a policy search procedure, using samples col-lected for policy improvement to also update the cost func-tion. The cost learning method itself is a nonlinear gener-alization of maximum entropy IOC (Ziebart et al., 2008),with samples used to approximate the partition function. Incontrast to previous work that optimizes the policy in theinner loop of cost learning, our approach instead updatesthe cost in the inner loop of policy search, making it prac-tical and efficient. One of the benefits of this approach isthat we can couple learning the cost with learning the pol-icy for that cost. For tasks that are too complex to acquire agood global cost function from a small number of demon-strations, our method can still recover effective behaviorsby running our policy learning method and retaining thelearned policy. We elaborate on this further in Section 4.4.

The main contribution of our work is an algorithm thatlearns nonlinear cost functions from user demonstrations,at the same time as learning a policy to perform the task.Since the policy optimization “guides” the cost towardgood regions of the space, we call this method guided costlearning. Unlike prior methods, our algorithm can handlecomplex, nonlinear cost function representations and high-dimensional unknown dynamics, and can be used on realphysical systems with a modest number of samples. Ourevaluation demonstrates the performance of our method ona set of simulated benchmark tasks, showing that it outper-forms previous methods. We also evaluate our method ontwo real-world tasks learned directly from human demon-strations. These tasks require using torque control and vi-sion to perform a variety of robotic manipulation behaviors,without any hand-specified cost features.

2. Related WorkOne of the basic challenges in inverse optimal control(IOC), also known as inverse reinforcement learning (IRL),is that finding a cost or reward under which a set of demon-strations is near-optimal is underdefined. Many differentcosts can explain a given set of demonstrations. Prior workhas tackled this issue using maximum margin formulations(Abbeel & Ng, 2004; Ratliff et al., 2006), as well as prob-abilistic models that explain suboptimal behavior as noise(Ramachandran & Amir, 2007; Ziebart et al., 2008). Wetake the latter approach in this work, building on the max-imum entropy IOC model (Ziebart, 2010). Although theprobabilistic model mitigates some of the difficulties withIOC, there is still a great deal of ambiguity, and an impor-

tant component of many prior methods is the inclusion ofdetailed features created using domain knowledge, whichcan be linearly combined into a cost, including: indicatorsfor successful completion of the task for a robotic ball-in-cup game (Boularias et al., 2011), learning table tennis withfeatures that include distance of the ball to the opponent’selbow (Muelling et al., 2014), providing the goal positionas a known constraint for robotic grasping (Doerr et al.,2015), and learning highway driving with indicators forcollision and driving on the grass (Abbeel & Ng, 2004).While these features allow for the user to impose struc-ture on the cost, they substantially increase the engineeringburden. Several methods have proposed to learn nonlinearcosts using Gaussian processes (Levine et al., 2011) andboosting (Ratliff et al., 2007; 2009), but even these meth-ods generally operate on features rather than raw states. Weinstead use rich, expressive function approximators, in theform of neural networks, to learn cost functions directlyon raw state representations. While neural network costrepresentations have previously been proposed in the lit-erature (Wulfmeier et al., 2015), they have only been ap-plied to small, synthetic domains. Previous work has alsosuggested simple regularization methods for cost functions,based on minimizing `1 or `2 norms of the parameter vec-tor (Ziebart, 2010; Kalakrishnan et al., 2013) or by usingunlabeled trajectories (Audiffren et al., 2015). When usingexpressive function approximators in complex real-worldtasks, we must design substantially more powerful regular-ization techniques to mitigate the underspecified nature ofthe problem, which we introduce in Section 5.

Another challenge in IOC is that, in order to determine thequality of a given cost function, we must solve some variantof the forward control problem to obtain the correspond-ing policy, and then compare this policy to the demon-strated actions. Most early IRL algorithms required solv-ing an MDP in the inner loop of an iterative optimization(Ng et al., 2000; Abbeel & Ng, 2004; Ziebart et al., 2008).This requires perfect knowledge of the system dynamicsand access to an efficient offline solver, neither of whichis available in, for instance, complex robotic control do-mains. Several works have proposed to relax this require-ment, for example by learning a value function instead ofa cost (Todorov, 2006), solving an approximate local con-trol problem (Levine & Koltun, 2012; Dragan & Srinivasa,2012), generating a discrete graph of states (Byravan et al.,2015; Monfort et al., 2015), or only obtaining an optimaltrajectory rather than policy (Ratliff et al., 2006; 2009).However, these methods still require knowledge of the sys-tem dynamics. Given the size and complexity of the prob-lems addressed in this work, solving the optimal controlproblem even approximately in the inner loop of the costoptimization is impractical. We show that good cost func-tions can be learned by instead learning the cost in the inner

Page 3: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

loop of a policy optimization. Our inverse optimal controlalgorithm is most closely related to other previous sample-based methods based on the principle of maximum entropy,including relative entropy IRL (Boularias et al., 2011) andpath integral IRL (Kalakrishnan et al., 2013), which canalso handle unknown dynamics. However, unlike theseprior methods, we adapt the sampling distribution usingpolicy optimization. We demonstrate in our experimentsthat this adaptation is crucial for obtaining good results oncomplex robotic platforms, particularly when using com-plex, nonlinear cost functions.

To summarize, our proposed method is the first to combineseveral desirable features into a single, effective algorithm:it can handle unknown dynamics, which is crucial for real-world robotic tasks, it can deal with high-dimensional,complex systems, as in the case of real torque-controlledrobotic arms, and it can learn complex, expressive costfunctions, such as multilayer neural networks, which re-moves the requirement for meticulous hand-engineering ofcost features. While some prior methods have shown goodresults with unknown dynamics on real robots (Boulariaset al., 2011; Kalakrishnan et al., 2013) and some have pro-posed using nonlinear cost functions (Ratliff et al., 2006;2009; Levine et al., 2011), to our knowledge no priormethod has been demonstrated that can provide all of thesebenefits in the context of complex real-world tasks.

3. Preliminaries and OverviewWe build on the probabilistic maximum entropy inverse op-timal control framework (Ziebart et al., 2008). The demon-strated behavior is assumed to be the result of an expertacting stochastically and near-optimally with respect to anunknown cost function. Specifically, the model assumesthat the expert samples the demonstrated trajectories {τi}from the distribution

p(τ) =1

Zexp(−cθ(τ)), (1)

where τ = {x1,u1, . . . ,xT ,uT } is a trajectory sample,cθ(τ) =

∑t cθ(xt,ut) is an unknown cost function pa-

rameterized by θ, and xt and ut are the state and actionat time step t. Under this model, the expert is most likelyto act optimally, and can generate suboptimal trajectorieswith a probability that decreases exponentially as the tra-jectories become more costly. The partition function Zis difficult to compute for large or continuous domains,and presents the main computational challenge in max-imum entropy IOC. The first applications of this modelcomputed Z exactly with dynamic programming (Ziebartet al., 2008). However, this is only practical in small, dis-crete domains. More recent methods have proposed to es-timate Z by using the Laplace approximation (Levine &Koltun, 2012), value function approximation (Huang & Ki-tani, 2014), and samples (Boularias et al., 2011). As dis-

cussed in Section 4.1, we take the sample-based approachin this work, because it allows us to perform inverse opti-mal control without a known model of the system dynam-ics. This is especially important in robotic manipulationdomains, where the robot might interact with a variety ofobjects with unknown physical properties.

To represent the cost function cθ(xt,ut), IOC or IRL meth-ods typically use a linear combination of hand-crafted fea-tures, given by cθ(ut,ut) = θTf(ut,xt) (Abbeel & Ng,2004). This representation is difficult to apply to morecomplex domains, especially when the cost must be com-puted from raw sensory input. In this work, we explorethe use of high-dimensional, expressive function approxi-mators for representing cθ(xt,ut). As we discuss in Sec-tion 6, we use neural networks that operate directly on therobot’s state, though other parameterizations could also beused with our method. Complex representations are gener-ally considered to be poorly suited for IOC, since learningcosts that associate the right element of the state with thegoal of the task is already quite difficult even with simplelinear representations. However, as we discuss in our evalu-ation, we found that such representations could be learnedeffectively by adaptively generating samples as part of apolicy optimization procedure, as discussed in Section 4.2.

4. Guided Cost LearningIn this section, we describe the guided cost learning al-gorithm, which combines sample-based maximum en-tropy IOC with forward reinforcement learning using time-varying linear models. The central idea behind this methodis to adapt the sampling distribution to match the maximumentropy cost distribution p(τ) = 1

Z exp(−cθ(τ)), by di-rectly optimizing a trajectory distribution with respect tothe current cost cθ(τ) using a sample-efficient reinforce-ment learning algorithm. Samples generated on the physi-cal system are used both to improve the policy and more ac-curately estimate the partition function Z. In this way, thereinforcement learning step acts to “guide” the samplingdistribution toward regions where the samples are moreuseful for estimating the partition function. We will firstdescribe how the IOC objective in Equation (1) can be esti-mated with samples, and then describe how reinforcementlearning can adapt the sampling distribution.

4.1. Sample-Based Inverse Optimal Control

In the sample-based approach to maximum entropy IOC,the partition function Z =

∫exp(−cθ(τ))dτ is estimated

with samples from a background distribution q(τ). Priorsample-based IOC methods use a linear representation ofthe cost function, which simplifies the corresponding costlearning problem (Boularias et al., 2011; Kalakrishnanet al., 2013). In this section, we instead derive a sample-

Page 4: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

based approximation for the IOC objective for a gen-eral nonlinear parameterization of the cost function. Thenegative log-likelihood corresponding to the IOC modelin Equation (1) is given by:

LIOC(θ) =1

N

∑τi∈Ddemo

cθ(τi) + logZ

≈ 1

N

∑τi∈Ddemo

cθ(τi) +log1

M

∑τj∈Dsamp

exp(−cθ(τj))q(τj)

,

where Ddemo denotes the set of N demonstrated trajecto-ries, Dsamp the set of M background samples, and q de-notes the background distribution from which trajectoriesτj were sampled. Prior methods have chosen q to be uni-form (Boularias et al., 2011) or to lie in the vicinity of thedemonstrations (Kalakrishnan et al., 2013). To compute thegradients of this objective with respect to the cost parame-ters θ, letwj =

exp(−cθ(τj))q(τj)

and Z =∑j wj . The gradient

is then given by:

dLIOC

dθ=

1

N

∑τi∈Ddemo

dcθdθ

(τi)−1

Z

∑τj∈Dsamp

wjdcθdθ

(τj)

When the cost is represented by a neural network or someother function approximator, this gradient can be com-puted efficiently by backpropagating −wjZ for each trajec-tory τj ∈ Dsamp and 1

N for each trajectory τi ∈ Ddemo.

4.2. Adaptive Sampling via Policy Optimization

The choice of background sample distribution q(τ) for es-timating the objective LIOC is critical for successfully ap-plying the sample-based IOC algorithm. The optimal im-portance sampling distribution for estimating the partitionfunction

∫exp(−cθ(τ))dτ is q(τ) ∝ | exp(−cθ(τ))| =

exp(−cθ(τ)). Designing a single background distributionq(τ) is therefore quite difficult when the cost cθ is un-known. Instead, we can adaptively refine q(τ) to gener-ate more samples in those regions of the trajectory spacethat are good according to the current cost function cθ(τ).To this end, we interleave the IOC optimization, which at-tempts to find the cost function that maximizes the like-lihood of the demonstrations, with a policy optimizationprocedure, which improves the trajectory distribution q(τ)with respect to the current cost.

Since one of the main advantages of the sample-basedIOC approach is the ability to handle unknown dynam-ics, we must also choose a policy optimization procedurethat can handle unknown dynamics. To this end, we adaptthe method presented by Levine & Abbeel (2014), whichperforms policy optimization under unknown dynamicsby iteratively fitting time-varying linear dynamics to sam-ples from the current trajectory distribution q(τ), updat-ing the trajectory distribution using a modified LQR back-

Algorithm 1 Guided cost learning1: Initialize qk(τ) as either a random initial controller or from

demonstrations2: for iteration i = 1 to I do3: Generate samples Dtraj from qk(τ)4: Append samples: Dsamp ← Dsamp ∪ Dtraj5: Use Dsamp to update cost cθ using Algorithm 26: Update qk(τ) using Dtraj and the method from (Levine &

Abbeel, 2014) to obtain qk+1(τ)7: end for8: return optimized cost parameters θ and trajectory distribu-

tion q(τ)

ward pass, and generating more samples for the next itera-tion. The trajectory distributions generated by this methodare Gaussian, and each iteration of the policy optimiza-tion procedure satisfies a KL-divergence constraint of theform DKL(q(τ)‖q(τ)) ≤ ε, which prevents the policy fromchanging too rapidly (Bagnell & Schneider, 2003; Peterset al., 2010; Rawlik & Vijayakumar, 2013). This has theadditional benefit of not overfitting to poor initial estimatesof the cost function. With a small modification, we canuse this algorithm to optimize a maximum entropy versionof the objective, given by minq Eq[cθ(τ)] − H(τ), as dis-cussed in prior work (Levine & Abbeel, 2014). This variantof the algorithm allows us to recover the trajectory distribu-tion q(τ) ∝ exp(−cθ(τ)) at convergence (Ziebart, 2010), agood distribution for sampling. For completeness, this pol-icy optimization procedure is summarized in Appendix A.

Our sample-based IOC algorithm with adaptive samplingis summarized in Algorithm 1. We call this method guidedcost learning because policy optimization is used to guidesampling toward regions with lower cost. The algorithmconsists of taking successive policy optimization steps,each of which generates samples Dtraj from the latest tra-jectory distribution qk(τ). After sampling, the cost func-tion is updated using all samples collected thus far for thepurpose of policy optimization. No additional backgroundsamples are required for this method. This procedure re-turns both a learned cost function cθ(xt,ut) and a trajec-tory distribution q(τ), which corresponds to a time-varyinglinear-Gaussian controller q(ut|xt). This controller can beused to execute the learned behavior.

4.3. Cost Optimization and Importance Weights

The IOC objective can be optimized using standardnonlinear optimization methods and the gradient dLIOC

dθ .Stochastic gradient methods are often preferred for high-dimensional function approximators, such as the neuralnetworks. Such methods are straightforward to apply toobjectives that factorize over the training samples, but thepartition function does not factorize trivially in this way.Nonetheless, we found that our objective could still be op-

Page 5: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

timized with stochastic gradient methods by sampling asubset of the demonstrations and background samples ateach iteration. When the number of samples in the batch issmall, we found it necessary to add the sampled demon-strations to the background sample set as well; withoutadding the demonstrations to the sample set, the objectivecan become unbounded and frequently does in practice.The stochastic optimization procedure is presented in Al-gorithm 2, and is straightforward to implement with mostneural network libraries based on backpropagation.

Estimating the partition function requires us to use impor-tance sampling. Although prior work has suggested drop-ping the importance weights (Kalakrishnan et al., 2013;Aghasadeghi & Bretl, 2011), we show in Appendix Bthat this produces an inconsistent likelihood estimate andfails to recover good cost functions. Since our sam-ples are drawn from multiple distributions, we computea fusion distribution to evaluate the importance weights.Specifically, if we have samples from k distributionsq1(τ), . . . , qκ(τ), we can construct a consistent estimatorof the expectation of a function f(τ) under a uniform dis-tribution as E[f(τ)] ≈ 1

M

∑τj

11k

∑κ qκ(τj)

f(τj). Accord-

ingly, the importance weights are zj = [ 1k∑κ qκ(τj)]

−1,and the objective is now:

LIOC(θ) =1

N

∑τi∈Ddemo

cθ(τi) + log1

M

∑τj∈Dsamp

zj exp(−cθ(τj))

The distributions qκ underlying background samples areobtained from the controller at iteration k. We must alsoappend the demonstrations to the samples in Algorithm 2,yet the distribution that generated the demonstrations is un-known. To estimate it, we assume the demonstrations comefrom a single Gaussian trajectory distribution and computeits empirical mean and variance. We found this approxi-mation sufficiently accurate for estimating the importanceweights of the demonstrations, as shown in Appendix B.

4.4. Learning Costs and Controllers

In contrast to many previous IOC and IRL methods, ourapproach can be used to learn a cost while simultaneouslyoptimizing the policy for a new instance of the task not in

Algorithm 2 Nonlinear IOC with stochastic gradients1: for iteration k = 1 to K do2: Sample demonstration batch Ddemo ⊂ Ddemo

3: Sample background batch Dsamp ⊂ Dsamp4: Append demonstration batch to background batch:

Dsamp ← Ddemo ∪ Dsamp

5: Estimate dLIOCdθ

(θ) using Ddemo and Dsamp

6: Update parameters θ using gradient dLIOCdθ

(θ)7: end for8: return optimized cost parameters θ

the demos, such as a new position of a target cup for a pour-ing task, as shown in our experiments. Since the algorithmproduces both a cost function cθ(xt,ut) and a controllerq(ut|xt) that optimizes this cost on the new task instance,we can directly use this controller to execute the desiredbehavior. In this way, the method actually learns a policyfrom demonstration, using the additional knowledge thatthe demonstrations are near-optimal under some unknowncost function, similar to recent work on IOC by direct lossminimization (Doerr et al., 2015). The learned cost func-tion cθ(xt,ut) can often also be used to optimize new poli-cies for new instances of the task without additional costlearning. However, we found that on the most challengingtasks we tested, running policy learning with IOC in theloop for each new task instance typically succeeded morefrequently than running IOC once and reusing the learnedcost. We hypothesize that this is because training the policyon a new instance of the task provides the algorithm withadditional information about task variation, thus producinga better cost function and reducing overfitting. The intu-ition behind this hypothesis is that the demonstrations onlycover a small portion of the degrees of variation in the task.Observing samples from a new task instance provides thealgorithm with a better idea of the particular factors thatdistinguish successful task executions from failures.

5. Representation and RegularizationWe parametrize our cost functions as neural networks, ex-panding their expressive power and enabling IOC to be ap-plied to the state of a robotic system directly, without hand-designed features. Our experiments in Section 6.2 confirmthat an affine cost function is not expressive enough to learnsome behaviors. Neural network parametrizations are par-ticularly useful for learning visual representations on rawimage pixels. In our experiments, we make use of the unsu-pervised visual feature learning method developed by Finnet al. (2016) to learn cost functions that depend on visual in-put. Learning cost functions on raw pixels is an interestingdirection for future work, which we discuss in Section 7.

While the expressive power of nonlinear cost functions pro-vide a range of benefits, they introduce significant modelcomplexity to an already underspecified IOC objective.To mitigate this challenge, we propose two regularizationmethods for IOC. Prior methods regularize the IOC objec-tive by penalizing the `1 or `2 norm of the cost parame-ters θ (Ziebart, 2010; Kalakrishnan et al., 2013). For high-dimensional nonlinear cost functions, this regularizer is of-ten insufficient, since different entries in the parameter vec-tor can have drastically different effects on the cost. We usetwo regularization terms. The first term encourages the costof demo and sample trajectories to change locally at a con-stant rate (lcr), by penalizing the second time derivative:

Page 6: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

glcr(τ)=∑xt∈τ

[(cθ(xt+1)− cθ(xt))− (cθ(xt)− cθ(xt−1))]2

This term reduces high-frequency variation that is oftensymptomatic of overfitting, making the cost easier to reop-timize. Although sharp changes in the cost slope are some-times preferred, we found that temporally slow-changingcosts were able to adequately capture all of the behaviorsin our experiments.

The second regularizer is more specific to one-shotepisodic tasks, and it encourages the cost of the states of ademo trajectory to decrease strictly monotonically in timeusing a squared hinge loss:

gmono(τ) =∑xt∈τ

[max(0, cθ(xt)− cθ(xt−1)− 1)]2

The rationale behind this regularizer is that, for tasks thatessentially consist of reaching a target condition or state,the demonstrations typically make monotonic progress to-ward the goal on some (potentially nonlinear) manifold.While this assumption does not always hold perfectly, weagain found that this type of regularizer improved perfor-mance on the tasks in our evaluation. We show a detailedcomparison with regard to both regularizers in Appendix E.

6. Experimental EvaluationWe evaluated our sampling-based IOC algorithm on a setof robotic control tasks, both in simulation and on a realrobotic platform. Each of the experiments involve complexsecond order dynamics with force or torque control and nomanually designed cost function features, with the raw stateprovided as input to the learned cost function.

We also tested the consistency of our algorithm on a toypoint mass example for which the ground truth distribu-tion is known. These experiments, discussed fully in Ap-pendix B, show that using a maximum entropy version ofthe policy optimization objective (see Section 4.2) and us-ing importance weights are both necessary for recoveringthe true distribution.

6.1. Simulated Comparisons

In this section, we provide simulated comparisons betweenguided cost learning and prior sample-based methods. Wefocus on task performance and sample complexity, and alsoperform comparisons across two different sampling distri-bution initializations and regularizations (in Appendix E).

To compare guided cost learning to prior methods, we ranexperiments on three simulated tasks of varying difficulty,all using the MuJoCo physics simulator (Todorov et al.,2012). The first task is 2D navigation around obstacles,modeled on the task by Levine & Koltun (2012). Thistask has simple, linear dynamics and a low-dimensionalstate space, but a complex cost function, which we visu-

2D Navigation

samples

true

cost

5 25 45 65 85

-450

-400

-350

PIIRL, demo initPIIRL, rand. initRelEnt, demo initRelEnt, rand. initours, demo initours, rand. inituniform

green: goalred: obstacles

Reaching

samples

dist

ance

5 25 45 65 850

0.2

0.4

0.6

0.8

initialstate

goalstate

Peg Insertion

samples

dist

ance

5 25 45 65 850

0.1

0.2

0.3

0.4

0.5

Figure 2. Comparison to prior work on simulated 2D navigation,reaching, and peg insertion tasks. Reported performance is aver-aged over 4 runs of IOC on 4 different initial conditions . For peginsertion, the depth of the hole is 0.1m, marked as a dashed line.Distances larger than this amount failed to insert the peg.

alize in Figure 2. The second task involves a 3-link armreaching towards a goal location in 2D, in the presence ofphysical obstacles. The third, most challenging, task is 3Dpeg insertion with a 7 DOF arm. This task is significantlymore difficult than tasks evaluated in prior IOC work as itinvolves complex contact dynamics between the peg andthe table and high-dimensional, continuous state and ac-tion spaces. The arm is controlled by selecting torques atthe joint motors at 20 Hz. More details on the experimentalsetup are provided in Appendix D.

In addition to the expert demonstrations, prior methods re-quire a set of “suboptimal” samples for estimating the parti-tion function. We obtain these samples in one of two ways:by using a baseline random controller that randomly ex-plores around the initial state (random), and by fitting alinear-Gaussian controller to the demonstrations (demo).The latter initialization typically produces a motion thattracks the average demonstration with variance propor-tional to the variation between demonstrated motions.

Between 20 and 32 demonstrations were generated froma policy learned using the method of Levine & Abbeel(2014), with a ground truth cost function determined bythe agent’s pose relative to the goal. We found that forthe more precise peg insertion task, a relatively complexground truth cost function was needed to afford the neces-

Page 7: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

sary degree of precision. We used a cost function of theform wd2 + v log(d2 + α), where d is the distance be-tween the two tips of the peg and their target positions, andv and α are constants. Note that the affine cost is inca-pable of exactly representing this function. We generateddemonstration trajectories under several different startingconditions. For 2D navigation, we varied the initial posi-tion of the agent, and for peg insertion, we varied the posi-tion of the peg hole. We then evaluated the performance ofour method and prior sample-based methods (Kalakrishnanet al., 2013; Boularias et al., 2011) on each task from fourarbitrarily-chosen test states. We chose these prior meth-ods because, to our knowledge, they are the only methodswhich can handle unknown dynamics.

We used a neural network cost function with two hiddenlayers with 24–52 units and rectifying nonlinearities of theform max(z, 0) followed by linear connections to a set offeatures yt, which had a size of 20 for the 2D navigationtask and 100 for the other two tasks. The cost is then givenby cθ(xt,ut) = ‖Ayt + b‖2 + wu‖ut‖2 (2)

with a fixed torque weight wu and the parameters consist-ing of A, b, and the network weights. These cost functionsrange from about 3,000 parameters for the 2D navigationtask to 16,000 parameters for peg insertion. For further de-tails, see Appendix C. Although the prior methods learnonly linear cost functions, we can extend them to the non-linear setting following the derivation in Section 4.1.

Figure 2 illustrates the tasks and shows results for eachmethod after different numbers of samples from the testcondition. In our method, five samples were used at eachiteration of policy optimization, while for the prior meth-ods, the number of samples corresponds to the number of“suboptimal” samples provided for cost learning. For theprior methods, additional samples were used to optimizethe learned cost. The results indicate that our method isgenerally capable of learning tasks that are more complexthan the prior methods, and is able to effectively handlecomplex, high-dimensional neural network cost functions.In particular, adding more samples for the prior methodsgenerally does not improve their performance, because allof the samples are drawn from the same distribution.

6.2. Real-World Robotic Control

We also evaluated our method on a set of real robotic ma-nipulation tasks using the PR2 robot, with comparisons torelative entropy IRL, which we found to be the better of thetwo prior methods in our simulated experiments. We chosetwo robotic manipulation tasks which involve complex dy-namics and interactions with delicate objects, for whichit is challenging to write down a cost function by hand.For all methods, we used a two-layer neural network costparametrization and the regularization objective described

human demo initial pose final pose

pour

ing

dish

Figure 3. Dish placement and pouring tasks. The robot learnedto place the plate gently into the correct slot, and to pour al-monds, localizing the target cup using unsupervised visual fea-tures. A video of the learned controllers can be found athttp://rll.berkeley.edu/gcl

in Section 5, and compared to an affine cost function on onetask to evaluate the importance of non-linear cost represen-tations. The affine cost followed the form of equation 2but with yt equal to the input xt.1 For both tasks, between25 and 30 human demonstrations were provided via kines-thetic teaching, and each IOC algorithm was initialized byautomatically fitting a controller to the demonstrations thatroughly tracked the average trajectory. Full details on bothtasks are in Appendix D, and summaries are below.

In the first task, illustrated in Figure 3, the robot must gen-tly place a grasped plate into a specific slot of dish rack.The state space consists of the joint angles, the pose of thegripper relative to the target pose, and the time derivativesof each; the actions correspond to torques on the robot’smotors; and the input to the cost function is the pose andvelocity of the gripper relative to the target position. Notethat we do not provide the robot with an existing trajec-tory tracking controller or any manually-designed policyrepresentation beyond linear-Gaussian controllers, in con-trast to prior methods that use trajectory following (Kalakr-ishnan et al., 2013) or dynamic movement primitives withfeatures (Boularias et al., 2011). Our attempt to design ahand-crafted cost function for inserting the plate into thedish rack produced a fast but overly aggressive behaviorthat cracked one of the plates during learning.

The second task, also shown in Figure 3, consisted of pour-ing almonds from one cup to another. In order to succeed,the robot must keep the cup upright until reaching the targetcup, then rotate the cup so that the almonds are poured. In-stead of including the position of the target cup in the statespace, we train autoencoder features from camera imagescaptured from the demonstrations and add a pruned featurepoint representation and its time derivative to the state, asproposed by Finn et al. (2016). The input to the cost func-tion includes these visual features, as well as the pose and

1Note that a cost function that is quadratic in the state is linearin the coefficients of the monomials, and therefore corresponds toa linear parameterization.

Page 8: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

dish (NN) RelEnt IRL GCL q(ut|xt) GCL reopt.success rate 0% 100% 100%# samples 100 90 90pouring (NN) RelEnt IRL GCL q(ut|xt) GCL reopt.success rate 10% 84.7% 34%# samples 150,150 75,130 75,130pouring (affine) RelEnt IRL GCL q(ut|xt) GCL reopt.success rate 0% 0% –# samples 150 120 –

Table 1. Performance of guided cost learning (GCL) and relativeentropy (RelEnt) IRL on placing a dish into a rack and pouringalmonds into a cup. Sample counts are for IOC, omitting thosefor optimizing the learned cost. An affine cost is insufficient forrepresenting the pouring task, thus motivating using a neural net-work cost (NN). The pouring task with a neural network cost isevaluated for two positions of the target cup; average performanceis reported.

velocity of the gripper. Note that the position of the targetcup can only be obtained from the visual features, so the al-gorithm must learn to use them in the cost function in orderto succeed at the task.

The results, presented in Table 1, show that our algorithmsuccessfully learned both tasks. The prior relative entropyIRL algorithm could not acquire a suitable cost function,due to the complexity of this domain. On the pouring task,where we also evaluated a simpler affine cost function, wefound that only the neural network representation could re-cover a successful behavior, illustrating the need for richand expressive function approximators when learning costfunctions directly on raw state representations.2

The results in Table 1 also evaluate the generalizability ofthe cost function learned by our method and prior work. Onthe dish rack task, we can use the learned cost to optimizenew policies for different target dish positions successfully,while the prior method does not produce a generalizablecost function. On the harder pouring task, we found that thelearned cost succeeded less often on new positions. How-ever, as discussed in Section 4.4, our method produces botha policy q(ut|xt) and a cost function cθ when trained on anovel instance of the task, and although the learned costfunctions for this task were worse, the learned policy suc-ceeded on the test positions when optimized with IOC inthe inner loop using our algorithm. This indicates an inter-esting property of our approach: although the learned costfunction is local in nature due to the choice of samplingdistribution, the learned policy tends to succeed even whenthe cost function is too local to produce good results in verydifferent situations. An interesting avenue for future workis to further explore the implications of this property, and toimprove the generalizability of the learned cost by succes-sively training policies on different novel instances of the

2We did attempt to learn costs directly on image pixels, butfound that the problem was too underdetermined to succeed. Bet-ter image-specific regularization is likely required for this.

task until enough global training data is available to pro-duce a cost function that is a good fit to the demonstrationsin previously unseen parts of the state space.

7. Discussion and Future WorkWe presented an inverse optimal control algorithm thatcan learn complex, nonlinear cost representations, such asneural networks, and can be applied to high-dimensionalsystems with unknown dynamics. Our algorithm uses asample-based approximation of the maximum entropy IOCobjective, with samples generated from a policy learningalgorithm based on local linear models (Levine & Abbeel,2014). To our knowledge, this approach is the first to com-bine the benefits of sample-based IOC under unknown dy-namics with nonlinear cost representations that directly usethe raw state of the system, without the need for manualfeature engineering. This allows us to apply our methodto a variety of real-world robotic manipulation tasks. Ourevaluation demonstrates that our method outperforms priorIOC algorithms on a set of simulated benchmarks, andachieves good results on several real-world tasks.

Our evaluation shows that our approach can learn goodcost functions for a variety of simulated tasks. For com-plex robotic motion skills, the learned cost functions tendto explain the demonstrations only locally. This makesthem difficult to reoptimize from scratch for new condi-tions. It should be noted that this challenge is not uniqueto our method. In our comparisons, no prior sample-basedmethod was able to learn good global costs for these tasks.However, since our method interleaves cost optimizationwith policy learning, it still recovers successful policies forthese tasks. For this reason, we can still learn from demon-stration simply by retaining the learned policy, and discard-ing the cost function. This allows us to tackle substantiallymore challenging tasks that involve direct torque control ofreal robotic systems with feedback from vision.

To incorporate vision into our experiments, we used unsu-pervised learning to acquire a vision-based state represen-tation, following prior work (Finn et al., 2016). An ex-citing avenue for future work is to extend our approachto learn cost functions directly from natural images. Theprincipal challenge for this extension is to avoid overfit-ting when using substantially larger and more expressivenetworks. Our current regularization techniques mitigateoverfitting to a high degree, but visual inputs tend to varydramatically between demonstrations and on-policy sam-ples, particularly when the demonstrations are provided bya human via kinesthetic teaching. One promising avenuefor mitigating these challenges is to introduce regulariza-tion methods developed for domain adaptation in computervision (Tzeng et al., 2015), to encode the prior knowledgethat demonstrations have similar visual features to samples.

Page 9: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

AcknowledgementsThis research was funded in part by ONR through a YoungInvestigator Program award, the Army Research Officethrough the MAST program, and an NSF fellowship. Wethank Anca Dragan for thoughtful discussions.

ReferencesAbbeel, P. and Ng, A. Apprenticeship learning via inverse

reinforcement learning. In International Conference onMachine Learning (ICML), 2004.

Aghasadeghi, N. and Bretl, T. Maximum entropy inversereinforcement learning in continuous state spaces withpath integrals. In International Conference on IntelligentRobots and Systems (IROS), 2011.

Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh,M. Maximum Entropy Semi-Supervised Inverse Rein-forcement Learning. In International Joint Conferenceon Artificial Intelligence (IJCAI), July 2015.

Bagnell, J. A. and Schneider, J. Covariant policy search. InInternational Joint Conference on Artificial Intelligence(IJCAI), 2003.

Boularias, A., Kober, J., and Peters, J. Relative entropyinverse reinforcement learning. In International Confer-ence on Artificial Intelligence and Statistics (AISTATS),2011.

Byravan, A., Monfort, M., Ziebart, B., Boots, B., and Fox,D. Graph-based inverse optimal control for robot manip-ulation. In International Joint Conference on ArtificialIntelligence (IJCAI), 2015.

Doerr, A., Ratliff, N., Bohg, J., Toussaint, M., and Schaal,S. Direct loss minimization inverse optimal control. InProceedings of Robotics: Science and Systems (R:SS),Rome, Italy, July 2015.

Dragan, Anca and Srinivasa, Siddhartha. Formalizing as-sistive teleoperation. In Proceedings of Robotics: Sci-ence and Systems (R:SS), Sydney, Australia, July 2012.

Finn, Chelsea, Tan, Xin Yu, Duan, Yan, Darrell, Trevor,Levine, Sergey, and Abbeel, Pieter. Deep spatial autoen-coders for visuomotor learning. International Confer-ence on Robotics and Automation (ICRA), 2016.

Huang, D. and Kitani, K. Action-reaction: Forecasting thedynamics of human interaction. In European Conferenceon Computer Vision (ECCV), 2014.

Kalakrishnan, M., Pastor, P., Righetti, L., and Schaal,S. Learning objective functions for manipulation. InInternational Conference on Robotics and Automation(ICRA), 2013.

Levine, S. and Abbeel, P. Learning neural network poli-cies with guided policy search under unknown dynamics.In Advances in Neural Information Processing Systems(NIPS), 2014.

Levine, S. and Koltun, V. Continuous inverse optimal con-trol with locally optimal examples. In International Con-ference on Machine Learning (ICML), 2012.

Levine, S., Popovic, Z., and Koltun, V. Nonlinear in-verse reinforcement learning with gaussian processes.In Advances in Neural Information Processing Systems(NIPS), 2011.

Levine, S., Wagener, N., and Abbeel, P. Learning contact-rich manipulation skills with guided policy search. InInternational Conference on Robotics and Automation(ICRA), 2015.

Monfort, M., Lake, B. M., Ziebart, B., Lucey, P., andTenenbaum, J. Softstar: Heuristic-guided probabilisticinference. In Advances in Neural Information Process-ing Systems, pp. 2746–2754, 2015.

Muelling, K., Boularias, A., Mohler, B., Scholkopf, B., andPeters, J. Learning strategies in table tennis using inversereinforcement learning. Biological Cybernetics, 108(5),2014.

Ng, A., Harada, D., and Russell, S. Policy invariance underreward transformations: Theory and application to re-ward shaping. In International Conference on MachineLearning (ICML), 1999.

Ng, A., Russell, S., et al. Algorithms for inverse reinforce-ment learning. In International Conference on MachineLearning (ICML), 2000.

Peters, J., Mulling, K., and Altun, Y. Relative entropy pol-icy search. In AAAI Conference on Artificial Intelligence,2010.

Ramachandran, D. and Amir, E. Bayesian inverse rein-forcement learning. In AAAI Conference on ArtificialIntelligence, volume 51, 2007.

Ratliff, N., Bagnell, J. A., and Zinkevich, M. A. Maxi-mum margin planning. In International Conference onMachine Learning (ICML), 2006.

Ratliff, N., Bradley, D., Bagnell, J. A., and Chestnutt,J. Boosting structured prediction for imitation learning.2007.

Ratliff, N., Silver, D., and Bagnell, J. A. Learning tosearch: Functional gradient techniques for imitationlearning. Autonomous Robots, 27(1), 2009.

Page 10: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

Rawlik, K. and Vijayakumar, S. On stochastic optimalcontrol and reinforcement learning by approximate in-ference. Robotics, 2013.

Todorov, E. Linearly-solvable markov decision problems.In Advances in Neural Information Processing Systems(NIPS), 2006.

Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physicsengine for model-based control. In International Con-ference on Intelligent Robots and Systems (IROS), 2012.

Tzeng, E., Hoffman, J., Darrell, T., and Saenko, K. Si-multaneous deep transfer across domains and tasks. InInternational Conference on Computer Vision (ICCV),2015.

Wulfmeier, M., Ondruska, P., and Posner, I. Maximumentropy deep inverse reinforcement learning. arXivpreprint arXiv:1507.04888, 2015.

Ziebart, B. Modeling purposeful adaptive behavior withthe principle of maximum causal entropy. PhD thesis,Carnegie Mellon University, 2010.

Ziebart, B., Maas, A., Bagnell, J. A., and Dey, A. K. Max-imum entropy inverse reinforcement learning. In AAAIConference on Artificial Intelligence, 2008.

Page 11: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

A. Policy Optimization under UnknownDynamics

The policy optimization procedure employed in thiswork follows the method described by Levine andAbbeel (Levine & Abbeel, 2014), which we summarizein this appendix. The aim is to optimize Gaussian trajec-tory distributions q(τ) = q(x1)

∏t q(xt+1|xt,ut)q(ut|xt)

with respect to their expected cost Eq(τ)[cθ(τ)]. Thisoptimization can be performed by iteratively optimiz-ing Eq(τ)[cθ(τ)] with respect to the linear-Gaussian con-ditionals q(ut|xt) under a linear-Gaussian estimate ofthe dynamics q(xt+1|xt,ut). This optimization can beperformed using the standard linear-quadratic regulator(LQR). However, when the dynamics of the system are notknown, the linearization q(xt+1|xt,ut) cannot be obtaineddirectly. Instead, Levine and Abbeel (Levine & Abbeel,2014) propose to estimate the local linear-Gaussian dy-namics q(xt+1|xt,ut) using samples from q(τ), whichcan be obtained by running the linear-Gaussian controllerq(ut|xt) on the physical system. The policy optimizationprocedure then consists of iteratively generating samplesfrom q(ut|xt), fitting q(xt+1|xt,ut) to these samples, andupdating q(ut|xt) under these fitted dynamics by usingLQR.

This policy optimization procedure has several importantnuances. First, the LQR update can modify the con-troller q(ut|xt) arbitrarily far from the previous controller.However, because the real dynamics are not linear, thisnew controller might experience very different dynamicson the physical system than the linear-Gaussian dynamicsq(xt+1|xt,ut) used for the update. To limit the changein the dynamics under the current controller, Levine andAbbeel (Levine & Abbeel, 2014) propose solving a mod-ified, constrained problem for updating q(ut|xt), given asfollowing:

maxq∈N

Eq[cθ(τ)] s.t. DKL(q‖q) ≤ ε,

where q is the previous controller. This constrained prob-lem finds a new trajectory distribution q(τ) that is close tothe previous distribution q(τ), so that the dynamics viola-tion is not too severe. The step size ε can be chosen adap-tively based on the degree to which the linear-Gaussian dy-namics are successful in predicting the current cost (Levineet al., 2015). Note that when policy optimization is inter-leaved with IOC, special care must be taken when adaptingthis step size. We found that an effective strategy was touse the step size rule described in prior work (Levine et al.,2015). This update involves comparing the predicted andactual improvement in the cost. We used the preceding costfunction to measure both improvements since this cost wasused to make the update.

The second nuance in this procedure is in the scheme used

truedistribution

ground truthdemo i.w.

empiricallyestimateddemo i.w.

no maxenttrajopt no i.w.

KL: 0 230.66 272.71 726.28 9145.35

Figure 4. KL divergence between trajectories produced by ourmethod, and various ablations, to the true distribution. Guidedcost learning recovers trajectories that come close to both themean and variance of the true distribution using 40 demonstratedtrajectories, whereas the algorithm without MaxEnt policy opti-mization or without importance weights recovers the mean butnot the variance.

to estimate the dynamics q(xt+1|xt,ut). Since these dy-namics are linear-Gaussian, they can be estimated by solv-ing a separate linear regression problem at each time step,using the samples gathered at this iteration. The samplecomplexity of this procedure scales linearly with the di-mensionality of the system. However, this sample com-plexity can be reduced dramatically if we consider thefact that they dynamics at nearby time steps are stronglycorrelated, even across iterations (due to the previouslymentioned KL-divergence constraint). This property canbe exploited by fitting a crude global model to all of thesamples gathered during the policy optimization proce-dure, and then using this global model as a prior for thelinear regression. A good choice for this global modelis a Gaussian mixture model (GMM) over tuples of theform (xt,ut,xt+1), as discussed in prior work (Levine &Abbeel, 2014). This GMM is refitted at each iteration usingall available interaction data, and acts as a prior when fittingthe time-varying linear-Gaussian dynamics q(xt+1|xt,ut).

B. Consistency EvaluationWe evaluated the consistency of our algorithm by generat-ing 40 demonstrations from 4 known linear Gaussian tra-jectory distributions of a second order point mass system,each traveling to the origin from different starting posi-tions. The purpose of this experiment is to verify that,in simple domains where the exact cost function can belearned, our method is able to recover the true cost functionsuccessfully. To do so, we measured the KL divergencebetween the trajectories produced by our method and thetrue distribution underlying the set of demonstrations. Asshown in Figure 4, the trajectory distribution produced byguided cost learning with ground truth demo importanceweights (weights based on the true distribution from whichthe demonstrations were sampled, which is generally un-known) comes very close to the true distribution, with a KLdivergence of 230.66 summed over 100 timesteps. Empiri-

Page 12: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

cally estimating the importance weights of the demos pro-duces trajectories with a slightly higher KL divergence of272.71, costing us very little in this domain. Dropping thedemo and sample importance weights entirely recovers asimilar mean, but significantly overestimates the variance.Finally, running the algorithm without a maximum entropyterm in the policy optimization objective (see Section 4.2)produces trajectories with similar mean, but 0 variance.These results indicate that correctly incorporating impor-tance weights into sample-based maximum entropy IOC iscrucial for recovering the right cost function. This contrastswith prior work, which suggests dropping the importanceweights (Kalakrishnan et al., 2013).

C. Neural Network Parametrization andInitialization

We use expressive neural network function approximatorsto represent the cost, using the form:

cθ(xt,ut) = ‖Ayt + b‖2 + wu‖ut‖2

This parametrization can be viewed as a cost that isquadratic in a set of learned nonlinear features yt = fθ(xt)where fθ is a multilayer neural network with rectifyingnonlinearities of the form max(z, 0). Since simpler costfunctions are generally preferred, we initialize the fθ to bethe identity function by setting the parameters of the firstfully-connected layer to contain the identity matrix and thenegative identity matrix (producing hidden units which aredouble the dimension of the input), and all subsequent lay-ers to the identity matrix. We found that this initializationimproved generalization of the learned cost.

D. Detailed Description of Task SetupAll of the simulated experiments used the MuJoCo simu-lation package (Todorov et al., 2012), with simulated fric-tional contacts and torque motors at the joints used for ac-tuation. All of the real world experiments were on a PR2robot, using its 7 DOF arm controlled via direct effort con-trol. Both the simulated and real world controllers wererun for 5 seconds at 20 Hz resulting in 100 time steps perrollout. We describe the details of each system below.

In all tasks except for 2D navigation (which has a smallstate space and complex cost), we chose the dimension ofthe hidden layers to be approximately double the size of thestate, making it capable of representing the identity func-tion.

2D Navigation: The 2D navigation task has 4 state di-mensions (2D position and velocity) and 2 action dimen-sions. Forty demonstrations were generated by optimiz-ing trajectories for 32 randomly selected positions, with at

least 1 demonstration from each starting position. The neu-ral network cost was parametrized with 2 hidden layers ofdimension 40 and a final feature dimension of 20.

Reaching: The 2D reaching task has 10 dimensions (3joint angles and velocities, 2-dimensional end effector po-sition and velocity). Twenty demonstrations were gen-erated by optimizing trajectories from 12 different initialstates with arbitrarily chosen joint angles. The neural net-work was parametrized with 2 hidden layers of dimension24 and a final feature dimension of 100.

Peg insertion: The 3D peg insertion task has 26 dimen-sions (7 joint angles, the pose of 2 points on the peg in3D, and the velocities of both). Demonstrations were gen-erated by shifting the hole within a 0.1 m × 0.1 m regionon the table. Twenty demonstrations were generated fromsixteen demonstration conditions. The neural network wasparametrized with 2 hidden layers of dimension 52 and afinal feature dimension of 100.

Dish: The dish placing task has 32 dimensions (7 jointangles, the 3D position of 3 points on the end effector, andthe velocities of both). Twenty demonstrations were col-lected via kinesthetic teaching on nine positions along a43 cm dish rack. A tenth position, spatially located withinthe demonstrated positions, was used during IOC. The in-put to the cost consisted of the 3 end effector points in3D relative to the target pose (which fully define the poseof the gripper) and their velocities. The neural networkwas parametrized with 1 hidden layer of dimension 64 anda final feature dimension of 100. Success was based onwhether or not the plate was in the correct slot and not bro-ken.

Pouring: The pouring task has has 40 dimensions (7 jointangles and velocities, the 3D position of 3 points on theend effector and their velocities, 2 learned visual featurepoints in 2D and their velocities). Thirty demonstrationswere collected via kinesthetic teaching. For each demon-stration, the target cup was placed at a different position onthe table within a 28 cm × 13 cm rectangle. The autoen-coder was trained on images from the 30 demonstrations(consisting of 3000 images total). The input to the cost wasthe same as the state but omitting the joint angles and veloc-ities. The neural network was parametrized with 1 hiddenlayer of dimension 80 and a final feature dimension of 100.To measure success, we placed 15 almonds in the graspedcup and measured the percentage of the almonds that werein the target cup after the executed motion.

Page 13: Guided Cost Learning: Deep Inverse Optimal Control via ...

Guided Cost Learning

Reaching

samples

dist

ance

5 25 45 65 850

0.2

0.4

0.6

0.8

Peg Insertion

samples

dist

ance

5 25 45 65 850

0.1

0.2

0.3

0.4

0.5

ours, demo initours, rand. initours, no lcr reg, demo initours, no lcr reg, rand. initours, no mono reg, demo initours, no mono reg, rand. init

Figure 5. Comparison showing ablations of our method with leav-ing out one of the two regularization terms. The monotonic regu-larization improves performance in three of the four task settings,and the local constant rate regularization significantly improvesperformance in all settings. Reported distance is averaged overfour runs of IOC on four different initial conditions.

E. Regularization EvaluationWe evaluated the performance with and without each ofthe two regularization terms proposed in Section 5 on thesimulated reaching and peg insertion tasks. As shown inFigure 5, both regularization terms help performance. No-tably, the learned trajectories fail to insert the peg into thehole when the cost is learned using no local constant rateregularization.