One-Shot Learning of Manipulation Skills with Online ...pabbeel/papers/... · One-Shot Learning of...

One-Shot Learning of Manipulation Skills with Online DynamicsAdaptation and Neural Network Priors

Justin Fu, Sergey Levine, Pieter Abbeel

Abstract— One of the key challenges in applying reinforce-ment learning to complex robotic control tasks is the need togather large amounts of experience in order to find an effectivepolicy for the task at hand. Model-based reinforcement learningcan achieve good sample efficiency, but requires the ability tolearn a model of the dynamics that is good enough to learn aneffective policy. In this work, we develop a model-based rein-forcement learning algorithm that combines prior knowledgefrom previous tasks with online adaptation of the dynamicsmodel. These two ingredients enable highly sample-efficientlearning even in regimes where estimating the true dynamicsis very difficult, since the online model adaptation allows themethod to locally compensate for unmodeled variation in thedynamics. We encode the prior experience into a neural networkdynamics model, adapt it online by progressively refitting a locallinear model of the dynamics, and use model predictive controlto plan under these dynamics. Our experimental results showthat this approach can be used to solve a variety of complexrobotic manipulation tasks in just a single attempt, using priordata from other manipulation behaviors.

I. INTRODUCTION

One of the remarkable features of human and animal motorcontrol is the ability to quickly adapt to new situations.When a child is asked, for example, to stack two unfamiliarLego blocks, he or she might play with the objects andtake a small amount of time to learn about their dynam-ics, but will typically succeed at the task very quickly.This is sometimes referred to as one-shot learning, wherean agent must successfully perform a task given one, orvery few attempts. Reinforcement learning (RL) providesa computational framework for robots to learn new motorskills, but typical applications of RL focus more on masterythan one-shot learning, and require a substantial number oftraining episodes [1]. Model-based RL methods reduce therequired interaction time by acquiring a model of the systemdynamics, and using this model to discover an effectivepolicy. However, model-based RL algorithms that operate inthis way must be equipped with a model that can representa good approximation to the dynamics, and they must beprovided with sufficient experience to optimize this model toproduce accurate dynamics predictions [2]. A single attemptat the task is often insufficient to obtain such an accuratemodel, and while these challenges can be mitigated byincorporating domain knowledge about the system [3], [4] ordemonstrations [5], they make the development of a general-purpose one-shot learning method difficult.

We propose to address the first challenge by developing amethod that can use a coarse model of the system dynamics

Department of Electrical Engineering and Computer Science, Universityof California, Berkeley, Berkeley, CA 94709

Fig. 1: Diagram of our method: the robot uses prior experiencefrom other tasks (a) to fit a neural network model of the dynamicsof object interaction tasks (b). When faced with a new task (c),our algorithm learns a new model online during task execution,using the neural network as a prior. This new model is used to planactions, allowing for one-shot learning of new skills.

that is adapted online to better match the most recent expe-rience. This frees the method from needing a representationthat can capture global dynamics with high accuracy, andrequires only the ability to accurately represent the dynamicslocally, which we show can be done with a simple linearmodel. The accuracy of the global model still affects theproficiency with which a new task can be performed, sincemore accurate models require less adaptation, but even aninaccurate initial model is sufficient for our method to per-form a variety of manipulation tasks on the first attempt. Weshow that even a simple linear global model can allow thisframework to perform a variety of manipulation tasks withnonlinear dynamics, while a more sophisticated nonlinearmodel based on neural networks greatly improves the successrate on more complex tasks. Online adaptation also offersus a way to address the second challenge: since the globalmodel does not need to be very accurate, it can be estimatedusing data from tasks that are different from the one beingattempted. This enables our approach to perform one-shotlearning of new tasks by using the robot’s prior experienceon other tasks, without requiring explicit domain knowledgeor demonstrations to be provided by the designer. A diagramof our method is shown in Figure 1. We demonstrate ourapproach by learning a variety of challenging, contact-richmanipulation behaviors on a PR2 robot. All of our motionskills involve low-level torque control of the robot’s motors,and high dimensional state, corresponding to joint angles,

joint velocities, and end-effector pose.

II. RELATED WORK

Reinforcement learning techniques have been applied to arange of robotic control problems [2]. Such methods can bebroadly categorized as model-free methods, which directlylearn a control policy from system interaction [6], and model-based methods, which first learn a model of the systemdynamics, and then optimize the policy under this model[1]. Model-based methods can be substantially more sample-efficient and typically achieve the fastest learning times, butthey require a model representation that can be used tolearn an accurate estimate of the true dynamics. Typically, inorder to learn the model from a small amount of interaction,they make various assumptions, such as smoothness [7],[8] or access to prior knowledge about the system [3], [4],since the general problem of learning a complex, nonlinearfunction from a small amount of data is difficult. Contact-richrobotic manipulation skills present an especially challengingmodel learning problem, and simple assumptions, such assmoothness and prior knowledge, are often insufficient toacquire a model that is accurate enough for manipulation.

In this paper, we instead specifically seek to learn newbehaviors with a coarse model of the system dynamicsthat does not adequately capture all of the intricacies ofthe real system, and locally adapt this model online basedon the most recent experience. In this way, our approachresembles adaptive control [9]. However, unlike standardadaptive control, we incorporate an expressive (but poten-tially inaccurate) prior model of the dynamics, constructedfrom previous experience on other manipulation tasks, anduse only high-level objectives, such as the desired positionof a target object, rather than simple trajectory tracking. Thismakes our method suitable for complex robotic manipulationskills with only high-level specification in the form of acost function. Experience-based priors have previously beensuggested in reinforcement learning [10], though typicallyin the context of accelerating iterative model-free learning.In contrast, our work demonstrates one-shot learning, wherethe robot can immediately perform new tasks by leveragingprior experience.

In order to choose the actions under our locally adaptedmodel, we use model predictive control (MPC) based ondifferential dynamic programming (DDP) [11]. Neural net-work models have recently been combined with MPC forplanar task-space control of cutting tasks [12]. While thisapproach is similar in spirit to the one presented in thiswork, we tackle a wider range of robotic manipulation tasksusing direct, low-level torque control of the entire 7 degreeof freedom arm. Since the tasks that we tackle are quitedifferent from the prior experience used to train the neuralnetwork, local adaptation is required for success, as shownin our experimental results. Outside of robotic manipulation,MPC has been combined with online and offline adaptation[13], [14], [15], but typically in the context of trajectorytracking, rather than learning new skills with high-levelspecification.

III. BACKGROUND

In reinforcement learning, as well as in optimal con-trol, the aim is to control a dynamical system given byxt+1 = f(xt,ut) by choosing the actions ut to minimize thetotal cost

∑Tt=1 `(xt,ut), where T is the time horizon. We

consider fixed-horizon episodic tasks in this paper, thoughour method can also be extended to an infinite horizon formu-lation. When the system dynamics f(xt,ut) are unknown,we can construct an estimate of the dynamics f(xt,ut), forexample by fitting a parameterized function approximatorto data from prior system interactions, and then optimize asequence of actions under the estimated dynamics f(xt,ut).

We will make use of the iterative linear quadratic regulator(iLQR) algorithm [16] to optimize the actions with respect tothe cost under an estimated dynamics model. This algorithmcan be viewed as an application of the Gauss-Newton methodfor trajectory optimization, and requires iteratively lineariz-ing the dynamics around the current nominal trajectory,denoted τ = {x1, u1, . . . , xT , uT }, constructing a quadraticapproximation to the cost, computing the optimal actionswith respect to this approximation of the dynamics andcost, and running the resulting actions forward to obtaina new nominal trajectory. In the derivation below, we willassume that the linearized stochastic dynamics are given byp(xt+1|xt,ut) = N (fxtxt+futut+fct,Ft), where Ft is thecovariance of the Gaussian dynamics noise, and the quadraticapproximation to the cost consists of a linear term `xutand a quadratic term `xu,xut, where the subscripts denotedifferentiation with respect to the vector [xt; ut]. When thedynamics are linear and the cost is quadratic, the Q-functionand the value function are both quadratic, and given by

V (xt) =1

2xTt Vx,xtxt + xT

t Vxt + const

Q(xt,ut) =1

2[xt;ut]

TQxu,xut[xt;ut]+[xt;ut]TQxut+const

We can express them with the following recurrence:

Qxu,xut = `xu,xut + γfTxutVx,xt+1fxut

Qxut = `xut + γfTxutVxt+1

Vx,xt = Qx,xt −QTu,xtQ

−1u,utQu,xt

Vxt = Qxt −QTu,xtQ

−1u,utQut,

which allows us to compute the optimal control law asg(xt) = ut + kt + Kt(xt − xt), where Kt = −Q−1

u,utQu,xt

and kt = −Q−1u,utQut. Performing a forward rollout using

this control law allows us to find a new nominal trajectory,and the backward dynamic programming pass is repeatedaround this trajectory to take the next Gauss-Newton step.The discount factor γ allows us to reduce the weight on laterstates. While DDP is typically used with γ = 1, we foundthat we could obtain better results under uncertain, estimateddynamics with γ = 0.95. Intuitively, this reflects the factthat predictions further into the future are less likely tocorrespond to reality under uncertain dynamics, so it makessense not to weight them as highly.

Iterative LQR can be used to optimize a trajectory offlineunder known system dynamics, or even with dynamicsestimated from previous executions on a physical system[17]. However, in this work we instead use iterative LQRto perform model-predictive control, where the policy isrecomputed in real time at each time step to update the nextaction in response to the current state. Iterative LQR is wellsuited for MPC because it is fast, can be effectively usedwith short horizons, and readily allows warm-starting withthe previous solution [18].

However, using iterative LQR for MPC still requires usto obtain an estimate of the linearized system dynamics,given by fxt and fut. The dynamics can be estimatedfrom a simulator of the physical system, but this requiresconsiderable knowledge about the system being controlled.In the case of robotic manipulation, it is reasonable to expectto obtain a good model of the robot, but we often lack amodel of the objects that the robot is interacting with, sobeing able to handle unknown dynamics is highly desirable.In Section II, we discussed how previous methods haveapproached this problem by using a variety of physical andstatistical estimates. In this work, we take a different ap-proach, and do not attempt to accurately learn a global modelof the dynamics. Instead, we assume that we have access toonly a weak prior model of the form p(xt,ut,xt+1), andestimate a local linear model of the dynamics under thisprior. The local linear model is updated at each time step,which allows our method to gradually adapt to changingdynamics conditions and compensate for inaccuracies in theprior model.

IV. MODEL-BASED REINFORCEMENT LEARNING WITHONLINE DYNAMICS ADAPTATION

Our model-based reinforcement learning approach withonline dynamics adaptation consists of using MPC (withiterative LQR) to repeatedly update the robot’s policy underan evolving dynamics model. The dynamics model is lineartime-varying, but is recomputed at each time step basedon the recently observed states and actions, as well as thedynamics prior, which can take on one of a number ofdifferent representations and is trained on previous robotexperience, which might even be from a different task. Inthis section, we describe how linear dynamics can be fittedunder a dynamics prior, as well as our scheme for updatingthe dynamics online based on the robot’s recent experience.We then summarize our model-based reinforcement learningalgorithm.

A. Fitting Dynamics with Priors

In order to fit a linear model of the dynamics to a setof N samples {xi,ui,x

′i}, we can simply use standard

linear regression to determine fx, fu, and fc. To makeit more convenient to incorporate prior information, wewill first reformulate this linear regression fit and view itas fitting a Gaussian model to the dataset {x,u,x′}, andthen conditioning this Gaussian to obtain p(xt+1|xt,ut).While this is equivalent to linear regression, it allows us

to easily incorporate a normal inverse-Wishart prior on thisGaussian in order to bring in prior information. Let Σ bethe empirical covariance of our dataset, and let µ be theempirical mean. The normal-inverse-Wishart prior is definedby prior parameters Φ, µ0, m, and n0. Under this prior, themaximum a posteriori estimates for the covariance Σ andmean µ are given by

Σ =Φ +N Σ + Nm

N+m (µ− µ0)(µ− µ0)T

N + n0

µ =mµ0 + n0µ

m+ n0. (1)

Having obtained Σ and µ, we can obtain an estimate ofthe dynamics p(xt+1|xt,ut) by conditioning the distributionN (µ,Σ) on [xt; ut], which produces linear-Gaussian dynam-ics p(xt+1|xt,ut) = N (fxtxt + futut + fct,Ft), given by

fxu = Σ−1[xu,xu]Σ[xu,x′]

fc = µ[x′] − fxuµ[xu]

F = Σ[x′,x′] − fxuΣ[xu,xu]fTxu (2)

where the bracketed subscripts denote the submatrix corre-sponding to the specified elements. The parameters of thenormal-inverse-Wishart prior are obtained from some priormodel of the dynamics. We discuss how a variety of globaldynamics model, such as Gaussian mixture models andneural networks, can be used to produce the prior parametersΦ, µ0, m, and n0 in Section V.

B. Online Estimation of Locally Linear Dynamics

In the batch setting, such as the one described in priorwork [17], the empirical mean µ and covariance Σ areestimated from a batch of previously collected data. In thiswork, we instead would like to update the model onlineduring execution. To do this efficiently, we use an onlineestimate for µ and Σ. The mean can be estimated simply byusing the following update:

µt ← βµt−1 + (1− β)pt, (3)

where pt = [xt−1; ut−1; xt] is the tth observation and βis a discounting factor that causes the model to forget olddata. For the covariance, first observe that, in the batch case,Σ = 1

N

∑Ni=1 pip

Ti − µµT. We can therefore estimate Σt

as Σt = ∆t − µtµTt , where ∆t is the current estimate of

1t

∑tt′=1 pt′p

Tt′ , which we update according to:

∆t ← β∆t−1 + (1− β)ptpTt . (4)

These updates allow us to estimate the empirical mean µt

and covariance Σt at the tth step using recently observed statetransitions. The posterior mean and covariance Σ and µ canthen be recovered using the prior and the method describedin the previous section. We initialize µ0 and ∆0 by fittinga single Gaussian to the data used for training the prior, inorder to start with a reasonable estimate of the dynamics.

The value of β determines how quickly the algorithm “for-gets” past experiences. When the true dynamics are nonlin-ear, as in most robotic manipulation tasks, we should forget

Algorithm 1 Model-based reinforcement learning with on-line adaptation

1: for time step t = 1 to T do2: Observe state xt

3: Update µt and ∆t via Equations (3) and (4)4: Compute Σt = ∆t − µtµ

Tt

5: Evaluate prior to obtain Φ, µ0, m, and n0 (seeSection V)

6: Update β and N as described in Equation (5)7: Compute µ and Σ via Equation (1)8: Compute fxt, fut, fct, and Ft from µ and Σ via

Equation (2)9: Run LQR to compute Kt, kt, and Qu,ut

10: Sample ut from N (ut + kt + Kt(xt − xt), Q−1u,ut)

11: Take action ut

12: end for

past experiences more quickly when the state is changingrapidly. However, when the robot becomes stuck in a difficultsituation that requires a more accurate dynamics estimate, itmust incorporate more data, which requires a larger value ofβ. The effective sample size N is also important, since itcontrols the relative strength of the prior. The true effectivesample size is in fact inversely proportional to 1 − β. Weadaptively adjust both β and N based on the relative accuracyof the empirical and prior dynamics estimates. The intuitionis that, when the prior is less effective, we should weight theempirical estimate more highly. We should also raise β, sincethe empirical estimate factors more highly in the dynamics,and therefore must use more of the past data to improveits accuracy. This intuition is also supported by the inverserelationship between 1− β and N .

Specifically, we can form the prediction for the mostrecently observed transition from (xt−1,ut−1) to xt usingboth the empirically estimated parameters Σt and µt andthe prior parameters 1

n0Φt and µ0t. We condition both

N (µt, Σt) and N (µ0t,1n0

Φt) on (xt−1,ut−1) and predictthe most probable value for the current state, which wedenote xt for the prediction from the empirical parametersand xt for the prediction from the prior parameters. Wethen compute the ratio of the errors in these predictions,to compare whether the prior or empirical estimate is moreaccurate and update β and N :

ρ =‖xt − xt‖2

‖xt − xt‖2β = 1− η0ρ N = ν0/ρ, (5)

where η0 and ν0 are hyperparameters of the algorithm. Notethat, in these updates, N is inversely proportional to 1− β,as expected, and the proportionality constant is controlled byη0 and ν0, which we set as ν0 = 1 and η0 = 8 in all of ourexperiments.

C. Algorithm Summary

The structure of our method is summarized in Algorithm 1.At each time step, the method observes the current state xt

and uses [xt−1,ut−1,xt] to update the current estimate of

the empirical mean and covariance µ and Σ as described inthe previous section. The method then evaluates the prior toobtain Φ, µ0, m, and n0, and updates β and N based onthe accuracy of the empirical and prior state prediction. Theempirical and prior estimates are then combined according toEquation (1) to construct the posterior mean and covariance,from which we can obtain an estimate for the currentdynamics. These estimated dynamics are then used, togetherwith a local second order expansion of the cost function,to optimize a new linear feedback policy using LQR. Thisfeedback policy, given by g(xt) = ut + kt + Kt(xt − xt),can then be used to choose the next action ut.

In practice we often want to perform a small amount ofexploration. For example, if the robot is attempting a peginsertion task, and the peg becomes jammed in the hole,simply applying the estimated optimal action repeatedly maybe insufficient. The robot must “wiggle” its arm to figure outthe contact dynamics. To that end, we add a small amount ofGaussian noise to the action. The amount of noise to add isdetermined by the Q-function obtained from LQR, by settingthe covariance to be proportional to Q−1

u,ut. This choice ofcovariance is motivated by the observation that it produces amaximum entropy policy that properly trades off randomnessand cost minimization, as discussed in prior work [19].

V. NEURAL NETWORK DYNAMICS PRIORS

In this section, we discuss several possible choices forthe dynamics prior, describe how the normal-inverse-Wishartprior parameters can be obtained from these priors, and gointo detail on the particular neural network prior that we usein this work. The various choices for the priors are comparedin our experimental evaluation in Section VI.

A. Gaussian and Gaussian Mixture Priors

As discussed in the previous section, our online estimate ofthe dynamics linearization makes use of a dynamics prior.The simplest prior can be obtained by fitting a Gaussiandistribution to vectors [x; u; x′] obtained from a large batchof prior data. If the mean and covariance of this prior dataare given by µ and Σ, the prior is given by Φ = n0Σ andµ0 = µ, while n0 and m should be set to the number ofdata points in the prior datasets. In practice, setting n0 andm to 1 tends to produce substantially better results, since theonline empirical mean and covariance are typically obtainedfrom a much smaller number of samples.

A more sophisticated choice of prior explored in previouswork is a Gaussian mixture model over vectors [x; u; x′],which allows modeling of nonlinear dynamics [17]. Underthis model, the state transition tuple is assumed to come froma distribution that depends on some hidden state hi, whichcorresponds to the mixture element identity. In practice,this hidden state might correspond to the type of contactprofile experienced by a robotic arm at step i. The prior isobtained by inferring the hidden state hi for the latest tuple[xi−1; ui−1; xi], and using the mean and covariance of thecorresponding mixture element to obtain µ and Σ. The priorparameters can then be obtained as described above.

B. Neural Network Priors

The Gaussian prior is limited to representing globallylinear dynamics, while the Gaussian mixture model can onlyrepresent a small number of locally linear modes, and haslimited representational capacity in complex state spaces.Recent work has shown that neural networks can learn verygood dynamics models for tasks ranging from helicopterflight [20] to cutting vegetables [12]. We therefore alsoevaluated the use of neural networks for representing thedynamics prior, by training a neural network to map fromstate-action tuples [x; u] to the next state x′.

In order to construct a dynamics prior from a neuralnetwork f([x; u]) at the current state and action [xi; ui], wefirst linearize to obtain

f([x; u]) ≈ f([xi; ui]) +df

d[x; u]

T

([x; u]− [xi; ui]) .

From this linearization, we can construct a local prior meanand covariance according to

µ =

[[xi; ui]

f([xi; ui])

]

Σ =

Σxu,xudf

d[x;u]

TΣxu,xu

Σxu,xudf

d[x;u]df

d[x;u]

TΣxu,xu

dfd[x;u] + Σx′,x′

.The prior state-action covariance Σxu,xu determines thestrength of the prior, and we set it to αI, where α is afree parameter. The conditional covariance Σx′,x′ can beobtained from the empirical covariance between the networkpredictions f([x; u]) and the target next states x′ in thetraining dataset. Using the above µ and Σ, the normal-inverse-Wishart prior parameters can be obtained in the sameway for the Gaussian and mixture of Gaussians priors.

C. Neural Network Architectures

In order to represent the dynamics prior, we used aneural networks with two hidden layers, each with rectifiedlinear units given by z = max(a, 0). The first hidden layerconsists of 60 hidden units, and the second consists of 40.We evaluated two network architectures, both of which areillustrated in Figure 2. The first takes the current state xt

and current action ut as input, and predicts accelerations.These accelerations are then integrated via a semi-implicitintegration rule to obtain a prediction for xt+1, and thenetwork is optimized to minimize the error in predicting theentire next state xt+1.

Our second architecture is identical to the first, but insteadtakes as input the current and previous states and actions(xt−1,ut−1,xt,ut). This additional temporal context is veryimportant when the dynamics prior is trained on othertasks, especially tasks that involve contacts with objectsin the environment. Since the state xt only includes theconfiguration of the robot, it does not model the geometryof the scene, which might change across tasks. The state istherefore not Markovian across tasks, and a network withoutadditional context cannot accurately predict the next statext+1 from only the current state xt and action ut. We found

Fig. 2: Diagram of the neural network architectures used in ourexperiments. We found that using a short temporal context as input,as shown in network (2), improved the results for manipulation tasksthat involved contact dynamics. Both networks produce accelera-tions which are used to predict the next state.

that the network without temporal context tried to explaincontacts by exploiting certain regularities. For example, whenthe end-effector stopped after hitting an object, the networkassumed that a hard contact had occurred and there is animpassable obstruction in the way. With a temporal contextthat included the previous state and action, the network wasable to determine whether or not a contact was happeningby comparing the previous applied joint torques to theacceleration actually experienced by the robot.

D. Prior Training Data

The neural network prior must be trained on previousinteraction data in order to provide a helpful prior model ofthe system dynamics. However, because the neural networkonly acts as a prior, it can be trained on previous interactiondata from different tasks. This allows our method to performone-shot learning of new manipulation tasks, using no priordata from that task itself. In practice, the amount of datarequired to learn an effective prior is considerable, so weused data collected from a variety of sources to providea sufficiently diverse dataset. The total training set for thephysical robot experiments had 6.6 hours of data, collectedat 20 Hz. For each individual task, we excluded the datacollected for that task when training the prior, which reducedthe effective dataset by about 15%. This corresponds to atype of holdout cross-validation. Likewise, in the simulatedexperiments, we collected approximately 12 simulation hoursof data at 20Hz across 4 different tasks, and excluded the datafrom the task being executed from the training set.

The dataset consisted of trials from the various evaluationtasks (workbench, gears, airplane, car, and ring tasks forthe physical system and peg insertion and stacking for thesimulated system), which are described in the next section,as well as random motion of the arm in free space. Foreach task, data was collected using a previous reinforcement

Fig. 3: Simulated tasks used for evaluation: (a) Cylindrical peginsertion (b) Cross-shaped peg insertion (c) Stacking over a cylin-drical peg (d) Stacking over a square-shaped peg

learning method [21], though in practice this data couldhave also been generated using prior experience from themethod presented in this paper, human demonstrations, orany other method that provides good coverage of interestingstates (e.g. states that involve contact with objects in theworld). Although the size of this dataset is considerable, theresulting prior, when combined with our online adaptationmethod, can generalize effectively to other tasks, making itquite universal.

VI. EXPERIMENTAL EVALUATION

A. Experimental Setup

We evaluated our method on a range of manipulation tasksusing a physical PR2 mobile manipulator and a simulatedPR2 arm using the MuJoCo simulator [11]. The simulatedtasks provide a large-scale comparison between methods, andthe physical tasks demonstrate effectiveness on a real system.

On the physical system, our method was used to controlthe right arm of the robot, while the left arm was usedto brace objects for two-object manipulation and assemblytasks. We evaluated the method by inserting a toy nail intoa tool bench, including a high-friction version of the taskdesigned to make it more difficult, placing wooden rings ontopegs, putting together parts of a gear assembly, assemblinga toy car and airplane, and stacking blocks. These tasks areshown in Figure 4. Our simulated tasks consisted of a pair ofpeg insertion tasks with different shaped pegs, and a pair ofpeg stacking tasks, where the robot needs to fit a piece witha square-shaped opening over a cylindrical or square shapedpeg. These are shown in Figure 3. A trial was defined to bea failure if the distance to a target pose was not sufficientlysmall at the end of a time threshold of 10 seconds.

The state space consisted of several measurements fromthe active right arm of the robot: joint angles, joint angularvelocities, the pose of the end effector encoded as 3 Cartesianpoints, and the velocities of those 3 points. This amountedto a 32-dimensional state space. The controller operated at20 Hz with an MPC horizon of 15 timesteps (0.75 seconds).

The cost function for each task depended on the distanced between the current pose of the end-effector and a targetpose required for successful completion. For example, in thecase of the toy nail task, the target pose involved positioning

the gripper such that the nail was successfully inserted intoa hole in the toy tool bench. The shape of this cost followsprior work [21], and is given by r`(d) = wd2+v log(d2+α),where α = 10−5, w = 1.0, and v = 0.01. This type ofshape encourages the robot to accurately place the object inprecisely the desired configuration. In addition, we placed aquadratic penalty on the magnitude of the torques.

As described in the previous section, the neural networkdynamics prior was trained using data collected from all ofthe other tasks, but no data from the task being tested. Thiscorresponds to a hold-one-out cross-validation scheme.

B. Comparisons

We evaluated our adaptive method with two neural net-work priors: network #1 in Figure 2, which uses only thecurrent state xt and action ut to predict the next statext+1, and network #2, which also uses the preceding stateand action. In addition, we also compared to a number ofbaselines. The non-adaptive baseline used network #2, whichwas the better of the two neural network priors, to directlyplan actions using MPC, which most closely resembles thestructure of a previously proposed neural network-basedMPC method [12]. The Gaussian process baselines used aGaussian process either as a prior for the adaptive approach,or as the model (without adaption) for MPC, which mostclosely reflects a range of recent Gaussian process model-based reinforcement learning algorithms [7], [22]. Sincethe Gaussian process is a nonparametric model, we wereunable to use the same amount of data with this model andmaintain real time performance. We therefore subsampledthe data to the largest size that still permitted online MPC(approximately 10000 training points, which was about 5%of the total training set for the physical robot). Finally, theregularized model substituted a global linear model insteadof the neural network dynamics prior.

We show the success rates for each method on the simu-lated robot in Table I and on the physical robot in Tables II(for the standard and high-friction variant of the toolbenchand toy nail task, as well as for the wooden ring on apeg task) and III (for the block, gears, airplane, and carassembly tasks). Note that each of the runs used to evaluateeach method was performed separately, with no informationretained between runs. The adaptive method outperformedthe non-adaptive variants, indicating the importance of onlineadaptation for models trained on other tasks. The neuralnetwork prior also achieved the best overall performance,particularly when using the preceding state and action ascontext, although even the simple least-squares prior wasable to accomplish some of the simpler tasks.

We also show the success rates for our adaptive methodwith network #2 on each of the other tasks. These resultsshow that our method was able to succeed on a wide rangeof challenging manipulation tasks on the first try, withoutusing any data from that task to train the dynamics prior.

Fig. 4: Tasks used in our evaluation: (a) inserting a toy nail into a toolbench, (b) inserting the nail with a high-friction surface to increasedifficulty, (c) placing a wooden ring on a tight-fitting peg, (d) stacking toy blocks, (e) putting together part of a gear assembly, (f)assembling a toy airplane and (g) a toy car.

Task Ins, Cylinder (a) Ins, Cross (b) Stack, Cylinder (c) Stack, Square (d) TotalAdaptation + regularization 10/25 5/25 18/25 10/25 43/100Adaptation + GMM prior 10/25 6/25 25/25 18/25 59/100Contextual network (#2) 8/25 1/25 19/25 12/25 40/100Adaptation + network #1 14/25 8/25 20/25 14/25 56/100Adaptation + Contextual network (#2) (full method) 19/25 11/25 25/25 20/25 75/100

TABLE I: Success rate of our method on simulated tasks, as well as comparisons to a range of baselines representative of prior methodson three of the tasks.

Task Toolbench (a) W/ friction (b) Ring (c) TotalAdaptation + regularization 2/5 0/5 4/5 6/15Adaptation + GMM prior 0/5 1/5 2/5 3/15Gaussian process 0/5 0/5 0/5 0/15Adaptation + gaussian process 1/5 0/5 0/5 1/15Contextual network (#2) 3/5 3/5 4/5 10/15Adaptation + network #1 4/5 0/5 3/5 7/15Adaptation + Contextual network (#2) (full method) 5/5 4/5 4/5 13/15

TABLE II: Success rate of our method on each physical task. Note that our approach regularlysucceeds at the task despite only using training data from other tasks.

Task Success RateBlock (d) 4/5Gears (e) 4/5Airplane (f) 3/5Car (g) 3/5

TABLE III: Success rates on addi-tional tasks using adaptation withnetwork #2 only.

Task Block Toolbench RingTarget position error 0.5 cm 1.0 cm 1.5 cm 0.5 cm 1.0 cm 1.5 cm 0.5 cm 1.0 cm 1.5 cmContextual network (#2) 2/5 0/5 0/5 4/5 2/5 0/5 4/5 4/5 1/5Adaptation + contextual network (#2) (full method) 4/5 1/5 0/5 3/5 3/5 1/5 4/5 3/5 0/5

TABLE IV: Robustness results in the presence of target position errors. The target was intentionally offset from the true position of thetarget by the indicated amount in random direction along the horizontal plane.

C. Robustness

We also evaluated the robustness of our method to obser-vation errors. In these experiments, the target position in thecost function was intentionally corrupted by fixed magnitudeerrors. In real-world scenarios, these errors might stem fromimperfect observations produced, for example, by a visionsystem. The results of the robustness experiments on two ofthe tasks are presented in Table IV. For comparison, we alsoinclude robustness results for the neural network only (usingnetwork #2), without adaptation. The ring was the easiest ofthe three tasks, since the rounded top of the peg providessome tolerance, while the block was the hardest. Note thatadaptation is particularly helpful on the harder tasks.

D. Qualitative Results and Conclusions

Our experimental results show that our model-basedreinforcement learning algorithm with online adaptationcan achieve a range of challenging manipulation taskson the first attempt. Furthermore, our comparison of thevarious prior models shows that a neural network prior

with a short temporal context achieved the best results,though no prior model by itself was as successful asthe corresponding adaptive variant. Videos of our tasks,which can be viewed in the supplementary material andon the project website, (http://rll.berkeley.edu/iros2016onlinecontrol/index.html) show thatthe resulting behaviors can take some time to succeed at thetask (though never more than 20 seconds). Much of this timeis spent exploring unfamiliar dynamical modes, for exampleafter an unexpected contact. This kind of exploration reflectsa very natural strategy for performing unfamiliar tasks:probing the object until progress is made in the desireddirection, while gradually acquiring a more accurate model.It is precisely this probing and gradual self-improvement thatis key to the robustness and effectiveness of this approach.

Acknowledgements

This research was funded in part by the Army ResearchOffice through the MAST program, and by Darpa underAward #N66001-15-2-4047.

VII. DISCUSSION AND FUTURE WORK

In this paper, we presented a model-based reinforcementlearning algorithm that uses online adaptation of its dynamicsmodel combined with a coarse prior model obtained fromexperience on other prior tasks. We use an MPC methodbased on iterative LQR to choose the actions under thecurrent adapted model, which is a local linear approximationto the nonlinear dynamics of the system. Although this linearmodel is simple, our method is able to perform complexmanipulation tasks in highly nonlinear systems by adaptingit to the most recent experience and re-planning. The coarseprior model is therefore not required to accurately model thetrue dynamics of the system, and only provide a guess thatis refined through online adaptation. Our experiments showthat even a fully linear prior model can be used for somebehaviors, while a more sophisticated neural network priorallows our method to complete more complex tasks.

We believe that combining online model adaptation withprior experience by means of a dynamics prior is a powerfulidea because it allows our method to succeed even with unex-pected environmental variation and inaccurate prior models.Furthermore, by using rich function approximators such aslarge neural networks to distill prior experience into a conciseparametric representation, the robot can progressively be-come more proficient at acquiring new skills as its experiencegrows, similar to how a person can quickly learn new skillsby drawing on past experiences. However, in contrast tohumans, robots can pool their collective experience and usethe combined data to train shared dynamics priors. Exploringthis type of multi-robot learning, with shared priors butindividual adaptation, is an exciting direction for future work.

While in our evaluations, we aggregated data from allother tasks to train the dynamics model, we may want toonly use related tasks to avoid negative transfer. This maybe accomplished by clustering or grouping the prior tasks.

Although we demonstrate that our approach can completea variety of challenging manipulation tasks, a limitationof this method is that we require a full, Markovian stateof the system, which is needed to perform MPC usingiterative LQR. This assumption is not unusual for MPC-based methods, but can be limiting in the context of roboticmanipulation. One approach for addressing this is to learna latent state representation from raw sensory input, as pro-posed in recent work [23], [24], [25], and perform MPC onthis learned representation with online dynamics adaptation.

Another avenue for future work is to combine other kindsof prior information with online updates. For example, a priorpolicy might be constructed from experience of related tasks,and refined in a similar fashion as the dynamics adaptation.This could allow our relatively short-horizon MPC procedureto acquire longer lookahead through the use of the priorpolicy, and allow the use of rich sensory data as demonstratedin recent work on policy search with neural networks [26].

REFERENCES

[1] M. Deisenroth, G. Neumann, and J. Peters, “A survey on policy searchfor robotics,” Foundations and Trends in Robotics, vol. 2, no. 1-2, pp.

1–142, 2013.[2] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in

robotics: A survey,” International Journal of Robotic Research, vol. 32,no. 11, pp. 1238–1274, 2013.

[3] D. Nguyen-Tuong and J. Peters, “Using model knowledge for learn-ing inverse dynamics,” in International Conference on Robotics andAutomation (ICRA), 2010.

[4] M. Cutler and J. P. How, “Efficient reinforcement learning for robotsusing informative simulated priors,” in International Conference onRobotics and Automation (ICRA), 2015.

[5] Y. Wu and Y. Demiris, “Towards one shot learning by imitationfor humanoid robots,” in International Conference on Robotics andAutomation (ICRA), 2010.

[6] J. Peters and S. Schaal, “Reinforcement learning of motor skills withpolicy gradients,” Neural Networks, vol. 21, no. 4, pp. 682–697, 2008.

[7] M. Deisenroth and C. Rasmussen, “PILCO: a model-based and data-efficient approach to policy search,” in International Conference onMachine Learning (ICML), 2011.

[8] J. Boedecker, J. T. Springenberg, J. Wulfing, and M. Riedmiller,“Approximate real-time optimal control based on sparse Gaussian pro-cess models,” in Adaptive Dynamic Programming and ReinforcementLearning (ADPRL), 2014.

[9] K. J. Astrom and B. Wittenmark, Adaptive Control, 2nd ed. Boston,MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1994.

[10] D. Windate, N. D. Goodman, D. M. Roy, L. P. Kaelbling, and J. B.Tenenbaum, “Bayesian policy search with policy priors,” Proceedingsof the Twenty-Second International Joint Conference on ArtificialIntelligence (IJCAI), 2011.

[11] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics enginefor model-based control,” in IEEE/RSJ International Conference onIntelligent Robots and Systems, 2012.

[12] I. Lenz, R. Knepper, and A. Saxena, “DeepMPC: Learning deeplatent features for model predictive control,” in Robotics: Science andSystems (RSS), 2015.

[13] H. Fukushima, T. Kim, and T. Sugie, “Adaptive model predictive con-trol for a class of constrained linear systems based on the comparisonmodel,” Automatica, vol. 43, no. 2, pp. 301–308, Feb. 2007.

[14] A. Aswani, P. Bouffard, and C. Tomlin, “Extensions of learning-based model predictive control for real-time application to a quadrotorhelicopter,” in American Control Conference (ACC), 2012.

[15] G. Chowdhary, M. Muhlegg, J. How, and F. Holzapfel, “Concurrentlearning adaptive model predictive control,” in Advances in AerospaceGuidance, Navigation and Control. Springer Berlin Heidelberg, 2013.

[16] W. Li and E. Todorov, “Iterative linear quadratic regulator design fornonlinear biological movement systems,” in ICINCO (1), 2004, pp.222–229.

[17] S. Levine and P. Abbeel, “Learning neural network policies withguided policy search under unknown dynamics,” in Advances in NeuralInformation Processing Systems (NIPS), 2014.

[18] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of com-plex behaviors through online trajectory optimization,” in IEEE/RSJInternational Conference on Intelligent Robots and Systems, 2012.

[19] S. Levine and V. Koltun, “Guided policy search,” in InternationalConference on Machine Learning (ICML), 2013.

[20] A. Punjani and P. Abbeel, “Deep learning helicopter dynamics mod-els,” in International Conference on Robotics and Automation (ICRA),2015.

[21] S. Levine, N. Wagener, and P. Abbeel, “Learning contact-rich manip-ulation skills with guided policy search,” in International Conferenceon Robotics and Automation (ICRA), 2015.

[22] Y. Pan and E. Theodorou, “Probabilistic differential dynamic program-ming,” in Neural Information Processing Systems (NIPS), 2014.

[23] M. Riedmiller, S. Lange, and A. Voigtlaender, “Autonomous reinforce-ment learning on raw visual input data in a real world application,” inInternational Joint Conference on Neural Networks, 2012.

[24] M. Watter, J. T. Springenberg, J. Boedecker, and M. Riedmiller,“Embed to control: A locally linear latent dynamics model for controlfrom raw images,” in Advances in Neural Information ProcessingSystems (NIPS), 2015.

[25] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel,“Deep spatial autoencoders for visuomotor learning,” InternationalConference on Robotics and Automation (ICRA), 2016.

[26] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” Journal of Machine Learning Research(JMLR), 2016.

One-Shot Learning of Manipulation Skills with Online ...pabbeel/papers/... · One-Shot Learning of...

Documents

Transcript of One-Shot Learning of Manipulation Skills with Online ...pabbeel/papers/... · One-Shot Learning of...