Goal-Based Imitation Learning for Probabilistic Human...

6
Goal-Based Imitation Learning for Probabilistic Human-Robot Collaboration Michael Jae-Yoon Chung, Jinna Lei, Ankit Gupta, Dieter Fox, and Rajesh P.N. Rao Computer Science & Engineering University of Washington Seattle, Washington 98195-2350 Email: {mjyc,jinna,ankit,fox,rao}@cs.washington.edu Abstract—Robots that can collaborate with humans and learn new skills on-the-fly could revolutionize robotics, significantly increasing the realm of applicability of robots in our day-to- day lives. Considerable progress has been made in the area of robotic imitation learning (learning from human demonstration) but the focus has been on imitating human-demonstrated action trajectories. Such a trajectory-centric approach ignores the fact that often, it is not the trajectory that is important but the goal of the action. In this paper, we present a goal-based approach to imitation learning that allows a robot to: (1) learn probabilistic models of actions through self-discovery and experience, (2) utilize these learned models for inferring the goals of human demonstrations, and (3) perform goal-based imitation for human- robot collaboration. Such an approach allows imitation learning even when the robot has very different actuators from humans. We illustrate our approach in the context of a robotic arm learning a tabletop organization task. We show that the robot can learn a probabilistic model of its actions on objects of different shapes, and use this model for both goal-inference and goal-based imitation of human actions. We additionally demonstrate how the use of probabilistic models allows the robot to seek human assistance whenever it recognizes that its inferred actions are too uncertain, risky, or impossible to perform. I. I NTRODUCTION Recent advances in imitation learning (also referred to as learning by demonstration and apprenticeship learning) have led to a number of impressive demonstrations of robotic skill learning from humans [1]–[5]. Most of these results involve following action trajectories given by a human in the same space as the robot’s actuator space. For example, demonstrations in [1], [3], [4] were collected by a human manually moving the robot’s arm while in [2], demonstrations were obtained by joystick control of a helicopter. Even in the case of imitation learning in humanoids (e.g., [5]) where manual or joystick control is not possible, a motion-capture system is typically used to provide a trajectory of tracked human poses for the humanoid robot to imitate. In many instances of imitation learning, it is not the trajec- tory that is important but the goal of the action. For example, if the goal is to open a box, it does not matter which hand is used to hold the box and which fingers are used to lift the lid. Thus, if opening the box is part of a complex task being demonstrated by a human, the robot needs to recognize the goal of the human action and then employ its own actuators to achieve the same goal. Such an approach, known as goal- based imitation [6], acknowledges the fact that robots often (a) (b) Fig. 1. Robotic tabletop task setup. (a): The robot is located left side of work area and the Kinect R looks down from left side in the robot perspective. The three predefined areas that distinguish object states are notated. (b): Toy tabletop objects. have different actuators than humans but may still be able to achieve the same goals as a human demonstrator, albeit using different means. Goal-based imitation is thus a solution to the problem of heterogeneous imitation learning [7], where the human and robot have different actuators (in contrast, the traditional case of homogeneous imitation learning assumes the robot and human have the same actuator space). In this paper, we present a goal-based framework for robotic imitation learning. The framework utilizes graphical models for probabilistic reasoning and learning under uncertainty. Our approach utilizes three components: (1) a self-discovery module that allows the robot to learn probabilistic state- transition models through self-exploration 1 , (2) a goal infer- ence module that allows the robot to infer the intention of a human demonstrator using its learned state-transition models, and (3) a module for goal-based imitation that uses the inferred goal and the learned state-transition model to choose the action most likely to achieve the goal. In contrast to many previous approaches to human action understanding and activity recognition (e.g., [9]), our approach leverages robotic self-discovery for learning probabilistic mod- els that can subsequently be used for interpreting human actions. Thus, rather than relying on pre-wired models of 1 For the preliminary studies in this paper, we used a simple exhaustive exploration strategy, but a more sophisticated approach based on reward functions (e.g., [8]) could also be used. CONFIDENTIAL. Limited circulation. For review only. Preprint submitted to 2013 IEEE International Conference on Robotics and Automation. Received September 17, 2012.

Transcript of Goal-Based Imitation Learning for Probabilistic Human...

Page 1: Goal-Based Imitation Learning for Probabilistic Human ...neural.cs.washington.edu/wiki/uploads/a/a5/ICRA13_2022_MS.pdf · and (3) a module for goal-based imitation that uses the inferred

Goal-Based Imitation Learning for ProbabilisticHuman-Robot Collaboration

Michael Jae-Yoon Chung, Jinna Lei, Ankit Gupta, Dieter Fox, and Rajesh P.N. RaoComputer Science & Engineering

University of WashingtonSeattle, Washington 98195-2350

Email: mjyc,jinna,ankit,fox,[email protected]

Abstract—Robots that can collaborate with humans and learnnew skills on-the-fly could revolutionize robotics, significantlyincreasing the realm of applicability of robots in our day-to-day lives. Considerable progress has been made in the area ofrobotic imitation learning (learning from human demonstration)but the focus has been on imitating human-demonstrated actiontrajectories. Such a trajectory-centric approach ignores the factthat often, it is not the trajectory that is important but the goalof the action. In this paper, we present a goal-based approach toimitation learning that allows a robot to: (1) learn probabilisticmodels of actions through self-discovery and experience, (2)utilize these learned models for inferring the goals of humandemonstrations, and (3) perform goal-based imitation for human-robot collaboration. Such an approach allows imitation learningeven when the robot has very different actuators from humans.We illustrate our approach in the context of a robotic armlearning a tabletop organization task. We show that the robot canlearn a probabilistic model of its actions on objects of differentshapes, and use this model for both goal-inference and goal-basedimitation of human actions. We additionally demonstrate howthe use of probabilistic models allows the robot to seek humanassistance whenever it recognizes that its inferred actions are toouncertain, risky, or impossible to perform.

I. INTRODUCTION

Recent advances in imitation learning (also referred to aslearning by demonstration and apprenticeship learning) haveled to a number of impressive demonstrations of roboticskill learning from humans [1]–[5]. Most of these resultsinvolve following action trajectories given by a human inthe same space as the robot’s actuator space. For example,demonstrations in [1], [3], [4] were collected by a humanmanually moving the robot’s arm while in [2], demonstrationswere obtained by joystick control of a helicopter. Even inthe case of imitation learning in humanoids (e.g., [5]) wheremanual or joystick control is not possible, a motion-capturesystem is typically used to provide a trajectory of trackedhuman poses for the humanoid robot to imitate.

In many instances of imitation learning, it is not the trajec-tory that is important but the goal of the action. For example,if the goal is to open a box, it does not matter which handis used to hold the box and which fingers are used to lift thelid. Thus, if opening the box is part of a complex task beingdemonstrated by a human, the robot needs to recognize thegoal of the human action and then employ its own actuatorsto achieve the same goal. Such an approach, known as goal-based imitation [6], acknowledges the fact that robots often

(a) (b)

Fig. 1. Robotic tabletop task setup. (a): The robot is located left side ofwork area and the Kinect R© looks down from left side in the robot perspective.The three predefined areas that distinguish object states are notated. (b): Toytabletop objects.

have different actuators than humans but may still be ableto achieve the same goals as a human demonstrator, albeitusing different means. Goal-based imitation is thus a solutionto the problem of heterogeneous imitation learning [7], wherethe human and robot have different actuators (in contrast, thetraditional case of homogeneous imitation learning assumesthe robot and human have the same actuator space).

In this paper, we present a goal-based framework for roboticimitation learning. The framework utilizes graphical modelsfor probabilistic reasoning and learning under uncertainty.Our approach utilizes three components: (1) a self-discoverymodule that allows the robot to learn probabilistic state-transition models through self-exploration1, (2) a goal infer-ence module that allows the robot to infer the intention of ahuman demonstrator using its learned state-transition models,and (3) a module for goal-based imitation that uses the inferredgoal and the learned state-transition model to choose the actionmost likely to achieve the goal.

In contrast to many previous approaches to human actionunderstanding and activity recognition (e.g., [9]), our approachleverages robotic self-discovery for learning probabilistic mod-els that can subsequently be used for interpreting humanactions. Thus, rather than relying on pre-wired models of

1For the preliminary studies in this paper, we used a simple exhaustiveexploration strategy, but a more sophisticated approach based on rewardfunctions (e.g., [8]) could also be used.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2013 IEEE International Conference onRobotics and Automation. Received September 17, 2012.

Page 2: Goal-Based Imitation Learning for Probabilistic Human ...neural.cs.washington.edu/wiki/uploads/a/a5/ICRA13_2022_MS.pdf · and (3) a module for goal-based imitation that uses the inferred

A

Xi Xf

A

Xi Xf

G

A

Xi Xf

(a) State-transition model (b) Action inference (c) Goal-based imitation model

G

A

Xi Xf

G

A

Xi Xf

G

A

Xi Xf

(d) Goal Inference (e) Goal-based action inference (f) State prediction

Fig. 2. Graphical Models. (a) through (f) illustrate the use of graphicalmodels for learning state-transitions, action inference, goal inference, goal-based imitation, and state prediction. Shaded nodes denote evidence (knownvalues).

human actions or labeled human data, our method allowsthe robot to learn increasingly sophisticated models of actionbased on interactions with the world and use these models forinterpreting the actions of others. Such a scalable approachto action understanding is consistent with recent cognitivetheories of human intention recognition, such as the “Likeme” hypothesis [10] which states that children utilize internalmodels learned through self-experience to interpret the actsand goals of others.

We demonstrate the feasibility of the proposed approachin a human-robot collaborative learning task in which therobot learns to organize objects of different shapes on a table.The robotic system consists of a robotic arm-and-gripper andan RGBD camera for visual feedback. Due to the physicallimitations of the arm-and-gripper system, some objects maybe hard or impossible to grasp but easier to push whileothers may be reliably picked. In such a setting, attempting toimitate the actual trajectories followed by the human hand andfingers will likely result in failure but a goal-based approachmay succeed. We demonstrate that (a) the robot can learnprobabilistic models of its actions on objects through self-exploration, and (2) use these models for both goal-inferenceand goal-based imitation of human actions on objects. Ourresults additionally demonstrate that the robot can leverageits probabilistic models to seek human assistance wheneverit finds that its inferred actions are too uncertain, risky, orimpossible to perform.

The paper is organized as follows. We explain the learningand inference algorithms in Section 2. We then describe thehardware setup and the robot’s task in Section 3. We showthe inference results and provide a discussion in Section 4and finally conclude by discussing some limitations and futurework in Section 5.

II. LEARNING THROUGH GRAPHICAL MODELS

In this section, we describe how probabilistic graphicalmodels can be used for learning state-transitions, inferringgoals, and performing goal-based imitation. We first describe

the task briefly to provide some context and then expalin thealgorithms.

A. Overview of the task

The environment consists of a set of objects on a tabletopwhich can be moved around by the human or the robotic arm.The position and in-plane rotation of the object defines itsstate. We define goals based on whether the object has reachedaparticular state. The robotic arm can manipulate the state ofany object using a set of actions. The robotic arm needs tolearn probability models over the states, actions, and goals.

B. Notations

Let ΩX be the set of states in the environment, and letΩA be the set of all possible actions available to the robot(these can be different from the possible actions of the humandemonstrator). Let ΩG be the set of possible goals. We assumeall three sets are finite. Each goal G represents an abstract task,which can be achieved using one or more actions in ΩA. Thebasic template for the probabilistic graphical model we use isthe one-step state-transition model in Fig. 2(a). Initially, therobot is in state Xi; when it executes action A, it stochasticallyenters a state Xf as determined by the transition probabilityP (Xf |Xi, A).

C. Learning through self-experience

We assume that the transition probability distribution isinitially unknown to the robot and must be learned throughexploration, similar to the manner in which infants explorethe consequences of their actions through exploration and“body babbling” [11]. The robot collects training data tuplesx , a , x’ for the graphical model in Fig. 2(a), wherex, x′ ∈ ΩX , a ∈ ΩA. Given the training data, simple maximumlikelihood parameter estimation is used to learn the transitionprobability distribution.

D. Goal-based graphical models and goal inference

After the transition parameters are learned, goal-basedgraphical models can be learned as follows. We considerthe case where “goals” are labels for abstract tasks, suchas moving an object from location A to location B. Theobject could be picked and placed at B, pushed to B, ortransported using some other action, but the goal remains thesame. We capture this notion of abstract goals as follows.For a goal (or task) g, the initial state and desired finalstate (Xi and Xf ) are identified and the marginal distributionPr(A|Xi, Xf ) is computed using simple Bayesian inferencein the graphical model shown in Fig. 2(b). A goal-based modelis then created by augmenting the initial model in Fig. 2(a)with a new node G as shown in Fig. 2(c). This provides acompact way of representing and reasoning about abstractgoals. For each specific goal or task g, we set the conditionalPr(A|Xi, G = g) = Pr(A|Xi, Xf ) where Xi and Xf are theinitial and desired states for goal g.

For goal inference, we assume the robot observes xi andxf from human demonstration (e.g., change in the state of an

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2013 IEEE International Conference onRobotics and Automation. Received September 17, 2012.

Page 3: Goal-Based Imitation Learning for Probabilistic Human ...neural.cs.washington.edu/wiki/uploads/a/a5/ICRA13_2022_MS.pdf · and (3) a module for goal-based imitation that uses the inferred

Algorithm 1 Learning Through Self Experience(ΩX , ΩA)1: for all x ∈ ΩX and a ∈ ΩA pairs do2: for k = 1→ n do3: Execute action a and record observed state x′.4: end for5: Compute Pr(Xf |Xi = x,A = a) based on observed

x′s.6: end for7: for all Xi and Xf pairs do8: Compute Pr(A|Xi, Xf ).9: Determine g using Xi, Xf .

10: Set Pr(A|Xi, G = g) = Pr(A|Xi, Xf )11: end for12: Construct graphical model, G, from Pr(Xf |Xi, A) and

Pr(A|Xi, G = g).

Algorithm 2 Goal Inference and Action Selection(G,xi,xf ,τ )1: Compute Pr(G|Xi, Xf ) using G and junction tree algo-

rithm.2: gMAP ←max(Pr(G|X = xi, Xf = xf ))3: aMAP ←max(Pr(A|Xi = xi, Xf = xf , G = gMAP ))4: if Pr(Xf = xf |Xi = xi, A = aMAP , G = gMAP ) < τ

then5: af = ask human6: else7: af = aMAP

8: end if9: return af

object, such as from location A to B). The robot then computesthe posterior distribution over goals G given xi, and xf , asdepicted in Fig. 2(d) (note that the variable A is marginalizedout during inference).

E. Goal-based imitation and action selection

Goal-based imitation is implemented as a two-stage process:first, we infer the likely goal of the human demonstration usingthe graphical model described above, and second, we eitherexecute the action most likely to achieve this goal or seekhuman assistance.

In more detail, the robot senses the current state x using itssensors and infers the human’s goal g by taking the mode ofthe posterior distribution of G from the goal-inference step.It then computes the posterior over actions A as shown inFig. 2(e) and selects the maximum a posteriori action aMAP .Since aMAP is not guaranteed to succeed, we predict proba-bility of reaching the desired final state x′ by computing theposterior probability of Xf as shown in 2(f). This probabilityof reaching the desired state is above a prespecified threshold,τ , the robot executes aMAP , otherwise it executes the “Askhuman” action to request human assistance.

Algorithm 1 and Algorithm refalgo:pgbi2 summarize thecomplete sequence of steps in our proposed approach.

Fig. 3. Object bounding boxes: We use a simple rotated bounding boxdetection algorithm provided by OpenCV. The detected bounding boxes arelater used for grasping.

III. ROBOTIC LEARNING TASK

A. Hardware setup

For our experiments, we use the Gambit robot arm-and-gripper designed at Intel Labs Seattle. The Gambit is well-suited to tabletop manipulation tasks and has previously beenshown to perform well in tasks with humans in the loop [12].It has seven controllable degrees of freedom (DoFs): baserotation, shoulder joint, elbow joint, forearm rotation, twowrist joints, and a parallel jaw gripper. The three revoluteDoFs (base, shoulder, and forearm) provide position controlwhile the other DoFs (forearm and two wrist joints) provideorientation control via a roll-pitch-roll spherical wrist. For ourtask, the tabletop working area is a circle with a radius of ≈60 cm. The low-level gambit driver software is built on top ofROS (Robot Operating System) [13] which runs on a dedicatedIntel Atom net-top PC. Application programs running on aseparate computer commands the arm using high-level drivers.

For sensing the current state of objects on the table, we usethe Kinect R© camera, which provides a stream of registeredcolor and depth images, and has a working range of approxi-mately 0.5 m to 5 m. The camera is mounted on the base frameof Gambit and looks down on the table surface as shown inFigure 1. The location of the camera was chosen to maximizethe view of the robot’s work area while minimizing physicalinterference with the robot arm movements.

We defined three possible areas that are used for definingstate of the objects as shown in Fig. 1. We tried to havethose area where the robot’s gripper can reach without inversekinecmatics failures.

The robot takes as input the stream of color and depthimages from the Kinect camera and first segments out thebackground and the human’s hands. The remaining pixelscorrespond to the objects which are then used to determine thestate of objects on the table. Once the object state is identified,the robot can perform a fixed set of actions on the object. Wenow describe these components in detail.

B. Segmenting the Image Stream

For every color-depth image, we use background subtractiontechnique on the depth channel to remove the background. Wedo this by noting the background depth at each pixel when thesystem starts. For each next frame, we segment out the pixelsfor which the depth is greater than equal to the backgrounddepth at that pixel. To make our system robust to noise in

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2013 IEEE International Conference onRobotics and Automation. Received September 17, 2012.

Page 4: Goal-Based Imitation Learning for Probabilistic Human ...neural.cs.washington.edu/wiki/uploads/a/a5/ICRA13_2022_MS.pdf · and (3) a module for goal-based imitation that uses the inferred

Fig. 4. The push to right action, before and after: The push to left actionworks similarily. To push off table, the arm goes down to the lower side oftable. For the place actions, we use almost identical actions to the picking upchess pieces used in [12].

the Kinect’s depth channel, any depths within 15mm of thebackground depth are also segmented out.

The remaining foreground pixels consist of human’s handsand objects. We learn a color model for the skin in the HSVcolor space and classify the foreground pixels as skin or not.We then do connected components analysis on skin pixels andremove the components which connect to the table edge. Thisis based on the heuristic that the human’s arms will alwaysbe coming from outside the table region and hence the handpixels will be connected to the table edges.

After segmenting out the hand pixels, the remaning pixelscorrespond to the objects. Doing a connected componentanalysis on these pixels give us the pixel components cor-responding to each of the objects. For our experiments, weassume that: (1) the maximum number of objects on thetable is known ahead of time, and (2) two objects cannot becreated or removed in a single image frame. These reasonableassumptions allow object identification without requiring acomplex object recognition engine.

C. Object State Estimation

We use discrete states to characterize the current stateof an object. The experiments in this paper assumed threediscrete states for objects based on their location: (1) “LEFT”signifying that the object is on the left side of the bluelineon the table as shown in Figure 1; (2) “RIGHT” denotingthat the object is on the right side of the blueline; and (3)“OFFTABLE” signifying that the object is not in the tabletopwork area. For each object identified on the table by themethod in the previous section, we determine its state based onthe location of its centroid position. We also fit a bounding boxto each identified object. The centroid position and rotationangle of the bounding box are stored with the object stateinformation for robotic manipulation later as need be.

D. Robot Actions

The Gambit arm was endowed with a fixed set of six high-level actions for manipulating objects: place LEFT (PlaceL),place RIGHT (PlaceR), place OFFTABLE (PlaceOT), push toLEFT (pushL), push to RIGHT (pushR), and push OFFTABLE(pushOT). For computing these actions, we use a custominverse kinematic solver used successfully in a previous projectinvolving the Gambit [12].

For the “place” actions, the robot first attempts to pick upthe object by moving its end effector above the centroid of theobject and rotating the gripper to align itself perpendicular tothe major axis of the object (determined by the bounding box).If the robot successfully picked up the object, it places theobject down at the location (LEFT, RIGHT, or OFFTABLE)indicated by the place command.

For the “push” actions, the robot first positions its endeffector behind the object based on its centroid and directionof the push. For pushL and pushR, the gripper yaw angle isrotated perpendicular to the major axis of the table, while forpushOT, it is rotated parallel to the major axis of the table. Thisensures that object contact area is maximized to reduce thechance of the object slipping while pushing. The robot pushesthe object until it changes state (or the inverse kinematic solverfails to find a possible solution).

IV. EXPERIMENTS AND RESULTS

To verify the feasibility of our proposed approach, weexplore a tabletop organization task involving plastic objectsof three different shapes, resembling a pear, a lemon, anda miniature broiled chicken. These objects could be in oneof three different states: LEFT, RIGHT, and OFFTABLE, asdefined in the previous section. The robot’s aim is to learnand imitate the actions of humans on these objects as theyorganize these objects on the table.

The objects were chosen to demonstrate the different typesof challenges inherent in the two types of manipulation actionsavailable to the robot: place and push. The pear-shaped objectsare pointed and slippery at the top, which makes them almostimpossible for the robot to successfully pick up for a placeaction. On the other hand, the wide bodies of these objectsmake them easy to push. The lemon-shaped objects are hardto manipulate using either place or push actions because theirspherical shapes makes them both hard to pick up (becausethey can slip through the grippers) and push (because they canroll away from the gripper). For humans, none of these objectsposes a challenge for either picking or pushing.

Fig. 5 shows the learned transition probabilities based on therobot’s self-discovery phase. The transition models are learnedby performing 10 trials for each initial state and action pair(Xi, A) for each object type. We deliberately used a smallnumber of trials to test whether the method could cope withless training data and more uncertainty in the transition model.

Since the state space is small, we are able to enumerate andtest all of the interesting possible human demonstrations. Byinteresting, we mean that the state changes after action execu-tion (for example from RIGHT to OFFTABLE) and the initialstate is not OFFTABLE.2. There are total four interesting statechanges for each object that can be demonstrated by a human:1) LEFT to RIGHT, LEFT to OFFTABLE, RIGHT to LEFT,and RIGHT to OFFTABLE.

Fig. 6(a) shows the inferred goals given all possible inter-esting initial and final state transitions, using the graphical

2The current implementation does not allow the robot to pick up objectsthat are located OFFTABLE

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2013 IEEE International Conference onRobotics and Automation. Received September 17, 2012.

Page 5: Goal-Based Imitation Learning for Probabilistic Human ...neural.cs.washington.edu/wiki/uploads/a/a5/ICRA13_2022_MS.pdf · and (3) a module for goal-based imitation that uses the inferred

PR PO UR UO0

0.2

0.4

0.6

0.8

1

Xi = LEFT

Pear

PL PO UL UO0

0.2

0.4

0.6

0.8

1

Xi = RIGHT

PR PO UR UO0

0.2

0.4

0.6

0.8

1

Lemon

PL PO UL UO0

0.2

0.4

0.6

0.8

1

PR PO UR UO0

0.2

0.4

0.6

0.8

1

Chicken

PL PO UL UO0

0.2

0.4

0.6

0.8

1

Xf = LEFT X

f = RIGHT X

f = OFFTABLE

Fig. 5. Learned transition model. Each row represents different typesof objects, and each column represents the different initial states Xi. Thecolors of bars represent different final states Xf . The y-axis represents therange of probabilities and the x-axis represents the six different manipulationactions available to the robot (PL = place LEFT, PR = place RIGHT, PO= place OFFTABLE, UL = pUsh LEFT, UR = pUsh RIGHT, UO = pUshOFFTABLE). We do not show actions that cause self-transitions given aninitial state.

TABLE ICOMPARISON OF APPROACHES: THE SUCCESS RATES OF THREE

DIFFERENT APPROACHES TO IMITATION ARE SHOWN.

Trajectory Goal-Based Interactive Goal-Based

Pick & Place Demonstration

Pear 0.2250 0.6750 0.8250Lemon 0.5250 0.6750 0.6750Chicken 0.6500 0.8000 0.8000

Push Demonstration

Pear 0.6750 0.6750 0.7250Lemon 0.5750 0.6750 0.6750Chicken 0.7250 0.8000 0.8000

model in Fig. 2(a). For all cases, our model correctly infersthe intended goal state of the human. Fig. 6(b) shows theMAP action for a given initial state and goal. Our modelcorrectly identifies whether a “place” or a “push” action isbetter, given the dynamics of the object as encoded in thelearned graphical model for the object. Note that the twomost preferred actions are always push or place actions inthe correct direction. Finally, Fig. 6(c) shows the predictedstate distribution given an action and an initial state. The robotcalculates the posterior probability of getting to the desiredstate, and executes the action if this probability is above apredetermined threshold or asks the human collaborator forhelp otherwise. In future implementations, we hope to selectthe threshold automatically based on a reward function andpolicy within a POMDP framework.

Table I compares trajectory-based imitation with our pro-posed goal-based approach. Using our computed transitionprobabilities, we calculate the hypothetical success rate of apurely trajectory-based approach, assuming that the trajectory-based approach tries to mimic the human action, i.e., executea place action if the human executes a place, and a push actionif the human executes a push. For our goal-based approach,we use the posterior distribution shown in Fig. 6(b). Finally,the ”Interactive Goal-Based” mode assumes that the robot

Left Right Off0

0.2

0.4

0.6

0.8

1

LEFT to RIGHT

Pear

Left Right Off0

0.2

0.4

0.6

0.8

1

LEFT to OFFTABLE

Left Right Off0

0.2

0.4

0.6

0.8

1

RIGHT to LEFT

Left Right Off0

0.2

0.4

0.6

0.8

1

RIGHT to OFFTABLE

Left Right Off0

0.2

0.4

0.6

0.8

1

Lemon

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Chicken

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

(a) Goal Inference

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

LEFT to RIGHT

Pear

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

LEFT to OFFTABLE

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

RIGHT to LEFT

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

RIGHT to OFFTABLE

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

Lemon

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

Chicken

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

PL PR PO UL UR UO0

0.2

0.4

0.6

0.8

1

(b) Action Inference

Left Right Off0

0.2

0.4

0.6

0.8

1

LEFT to RIGHT

Pear

Left Right Off0

0.2

0.4

0.6

0.8

1

LEFT to OFFTABLE

Left Right Off0

0.2

0.4

0.6

0.8

1

RIGHT to LEFT

Left Right Off0

0.2

0.4

0.6

0.8

1

RIGHT to OFFTABLE

Left Right Off0

0.2

0.4

0.6

0.8

1

Lemon

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Chicken

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

Left Right Off0

0.2

0.4

0.6

0.8

1

(c) Final State Prediction

Fig. 6. (a) Most likely goals. Initial and final states are at the top of eachcolumn. The height of the bar represents the posterior probability of each goalstate, with the true goal state marked by an asterisk. (b) Inferring actions.For each initial and desired final state, the plots show the posterior probabilityof each of the six actions, with the MAP action indicated by an asterisk. (c)Predicting final state. The plots show the posterior probability of reaching thedesired final state, given the initial state. The red bar marks 0.5, the thresholdbelow which the robot asks for human help in the Interactive Goal-Basedmode.

may ask a human for help, with a 100% success rate if thehuman executes the requested action. The third column inTable I shows what the performance would be if we requirethe robot to be 50% sure of reaching a desired state. 3 Theseresults demonstrate the advantage of a goal-based approachover purely trajectory-based imitation.

3We do not see perfect imitation results on the third column because wedo not ask the human for help in every case. In some cases, the probability ofsuccess will surpass the confidence threshold, but not make the state transitionsuccessfully.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2013 IEEE International Conference onRobotics and Automation. Received September 17, 2012.

Page 6: Goal-Based Imitation Learning for Probabilistic Human ...neural.cs.washington.edu/wiki/uploads/a/a5/ICRA13_2022_MS.pdf · and (3) a module for goal-based imitation that uses the inferred

V. SUMMARY AND CONCLUSION

We have proposed a new approach to robotic imitationlearning based on inference of goals in learned graphicalmodels. We demonstrated, using a simple tabletop organizationtask, that a robot can: (a) learn probabilistic models of the con-sequences of its actions on objects through self-exploration, (b)use these learned models to infer the goal of a human actionon these objects, and (c) perform goal-based imitation. Ourresults show how a goal-based approach can allow imitationlearning even when the robot has different actuators andmotor capabilities from a human. Our results also illustrate theusefulness of probabilistic models for human-robot interaction(HRI): the robot use its estimates of uncertainty to decidewhen to execute a computed action and when to seek humanassistance to avoid accidents while maximizing human-robotcollaborative throughput.

Our results point to a number of interesting open issues.First, learning through self-exploration is impractical for largestate spaces; following the example of human infants, someform of directed self-exploration, based, for example, onobserving human teachers, may constitute a viable alternativeto learning probabilistic models of the robot’s own actions.Alternately, the probabilistic models could expressed usingcontinuous state representations and nonparametric Bayesianmodels such as Gaussian processes (e.g., [14]).

Finally, the approach we have presented lends itself natu-rally to generalization based on relational probabilistic models[15], [16] and hierarchical Bayesian representations [17]. Suchmodels have the potential to significantly increase the scala-bility and applicability of our suggested approach to large-scale scenarios, besides facilitating transfer of learned skillsacross tasks and domains. We intend to explore such relationalmodels in future work.

ACKNOWLEDGMENTS

This research was supported by ONR Cognitive Scienceprogram grant no. N000140910097 and by the Intel Scienceand Technology Center (ISTC), Seattle. The authors wouldlike to thank Cynthia Matuszek and Kendall Lowrey for theiradvice and help during the initial stages of this project.

REFERENCES

[1] P. Kormushev, S. Calinon, and D. Caldwell, “Robot motor skill coordi-nation with em-based reinforcement learning,” in Proc. IEEE/RSJ IntlConf. on Intelligent Robots and Systems (IROS), 2010, pp. 3232–3237.

[2] P. Abbeel, A. Coates, and A. Ng, “Autonomous helicopter aerobaticsthrough apprenticeship learning,” The International Journal of RoboticsResearch, 2010.

[3] J. Kober and J. Peters, “Policy search for motor primitives in robotics,”Advances in neural information processing systems, vol. 21, pp. 849–856, 2009.

[4] S. Calinon, F. Guenter, and A. Billard, “On learning, representing,and generalizing a task in a humanoid robot,” Systems, Man, andCybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 37, no. 2,pp. 286–298, 2007.

[5] D. Grimes, R. Chalodhorn, and R. Rao, “Dynamic imitation in ahumanoid robot through nonparametric probabilistic inference,” in Pro-ceedings of Robotics: Science and Systems. Citeseer, 2006.

[6] D. Verma and R. Rao, “Goal-based imitation as probabilistic inferenceover graphical models,” Advances in neural information processingsystems, vol. 18, p. 1393, 2006.

[7] B. Price and C. Boutilier, “Imitation and reinforcement learning in agentswith heterogeneous actions,” in Seveteenth International Conference onMachine Learning ICML2000. Citeseer, 2000.

[8] M. Deisenroth and C. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” 2011.

[9] L. Liao, D. Patterson, D. Fox, and H. Kautz, “Learning and inferringtransportation routines,” Artificial Intelligence, vol. 171, no. 5, pp. 311–331, 2007.

[10] A. Meltzoff, “Like me: a foundation for social cognition,” Developmen-tal science, vol. 10, no. 1, pp. 126–134, 2007.

[11] R. Rao, A. Shon, and A. Meltzoff, “A bayesian model of imitation ininfants and robots,” Imitation and social learning in robots, humans,and animals, pp. 217–248, 2004.

[12] C. Matuszek, B. Mayton, R. Aimi, M. Deisenroth, L. Bo, R. Chu,M. Kung, L. LeGrand, J. Smith, and D. Fox, “Gambit: An autonomouschess-playing robotic system,” in Robotics and Automation (ICRA), 2011IEEE International Conference on. IEEE, 2011, pp. 4291–4297.

[13] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Ng, “ROS: an open-source robot operating system,”in ICRA Workshop on Open Source Software, vol. 3, no. 3.2, 2009.

[14] A. Friesen and R. Rao, “Gaze following as goal inference: A Bayesianmodel,” 2011.

[15] D. Koller and N. Friedman, Probabilistic graphical models: principlesand techniques. The MIT Press, 2009.

[16] M. Richardson and P. Domingos, “Markov logic networks,” Machinelearning, vol. 62, no. 1, pp. 107–136, 2006.

[17] J. Tenenbaum, C. Kemp, T. Griffiths, and N. Goodman, “How to grow amind: Statistics, structure, and abstraction,” science, vol. 331, no. 6022,p. 1279, 2011.

CONFIDENTIAL. Limited circulation. For review only.

Preprint submitted to 2013 IEEE International Conference onRobotics and Automation. Received September 17, 2012.