Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning...

8
Motion Reasoning for Goal-Based Imitation Learning De-An Huang 1,2 , Yu-Wei Chao *,2 , Chris Paxton *,2 , Xinke Deng 2,3 , Li Fei-Fei 1 , Juan Carlos Niebles 1 , Animesh Garg 2,4 , Dieter Fox 2,5 Abstract— We address goal-based imitation learning, where the aim is to output the symbolic goal from a third-person video demonstration. This enables the robot to plan for execution and reproduce the same goal in a completely different environment. The key challenge is that the goal of a video demonstration is often ambiguous at the level of semantic actions. The human demonstrators might unintentionally achieve certain subgoals in the demonstrations with their actions. Our main contribution is to propose a motion reasoning framework that combines task and motion planning to disambiguate the true intention of the demonstrator in the video demonstration. This allows us to robustly recognize the goals that cannot be disambiguated by previous action-based approaches. We evaluate our approach by collecting a dataset of 96 video demonstrations in a mockup kitchen environment. We show that our motion reasoning plays an important role in recognizing the actual goal of the demonstrator and improves the success rate by over 20%. We further show that by using the automatically inferred goal from the video demonstration, our robot is able to reproduce the same task in a real kitchen environment. I. I NTRODUCTION We are interested in allowing robots to learn new tasks from video demonstrations. Recently, there has been rapid progress in imitation learning [1–4], which even enables learning a new task from a single demonstration of the task [5–7]. By leveraging meta-learning [8], the robot learns to follow the actions in the demonstration. In many cases, however, the robot does not have to thoroughly follow the actions in the demonstration to complete the task. Instead, what matters more is the goal or the intention of the demonstrator [9–12]. For example, if the goal is to get a bowl, it does not matter which hand we use to pick the bowl, and whether we get the bowl from the cabinet or the dishwasher. Understanding the intention of human demonstrator is important for human-robot interaction [13] and can enable the robot to generalize a wider range of scenarios [9]. There has been many works that aim to infer the in- tention of humans in both robotics [10, 14–17] and cog- nitive science [18–20]. However, these works are mostly limited to trajectory prediction in 2D environments, and goal recognition from real-world videos remains challenging. The main challenge of applying goal reasoning to real-world human demonstration is that the goal or the intention can be ambiguous and cannot be fully determined by either or both the final state and the sequence of high-level actions. For 2D trajectory prediction, it is reasonable to assume that * These authors contributed equally to the paper. 1 Stanford University, 2 NVIDIA, 3 UIUC, 4 University of Toronto, 5 University of Washington. Action 1 in_storage? Action 2 on_stove Action 1* in_storage Fig. 1. The first action moved the cracker box into the storage and the second action moved the bowl onto the stove. However, in storage is not necessarily intentional. One can also explain that Action 1 is moving the cracker box out of the way for Action 2. This is even more obvious if we compare Action 1 with Action 1 * in the last row. the final state defines the goal because the goal is just the final location of the object. However, this is not the case in real-world demonstrations. Take the kitchen environment in Figure 1 as an example. There are many objects (sugar box, bowl, etc) in the scene, and it is unclear what the goal of the demonstration is from just the final state or configuration. An alternative is to recognize the sequence of actions performed in the demonstration [10]. In Figure 1, the person first moved the cracker box to the storage region with Action 1, and moved the bowl onto the stove with Action 2. Based on these actions, one might infer that the goal is: in storage(cracker box), on stove(bowl). However, the interpretation is not always unique. One can also interpret Action 1 as just moving the cracker box out of the way, so that it is not blocking Action 2 to move the bowl onto the stove. In this case, despite the person moved the cracker box, it is not part of the goal. This becomes more clear if we compare Action 1 with Action 1 * in the last row. It is more likely in Action 1 * that the person is intentionally moving the cracker box into the storage. We as humans can reliably interpret others’ intentions despite the multiple explanations for their actions. This ability to reason about the actual intention in ambiguous scenarios allows us to interact smoothly and generalize to drastically different scenarios. The next question is: How can our system differentiate Action 1 and Action 1 * ? Or how can our system determine if the person is moving the cracker box intentionally to achieve in storage(cracker box) or just moving it out of the arXiv:1911.05864v1 [cs.RO] 13 Nov 2019

Transcript of Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning...

Page 1: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

Motion Reasoning for Goal-Based Imitation Learning

De-An Huang1,2, Yu-Wei Chao∗,2, Chris Paxton∗,2, Xinke Deng2,3,Li Fei-Fei1, Juan Carlos Niebles1, Animesh Garg2,4, Dieter Fox2,5

Abstract— We address goal-based imitation learning, wherethe aim is to output the symbolic goal from a third-person videodemonstration. This enables the robot to plan for execution andreproduce the same goal in a completely different environment.The key challenge is that the goal of a video demonstration isoften ambiguous at the level of semantic actions. The humandemonstrators might unintentionally achieve certain subgoalsin the demonstrations with their actions. Our main contributionis to propose a motion reasoning framework that combines taskand motion planning to disambiguate the true intention of thedemonstrator in the video demonstration. This allows us torobustly recognize the goals that cannot be disambiguated byprevious action-based approaches. We evaluate our approachby collecting a dataset of 96 video demonstrations in a mockupkitchen environment. We show that our motion reasoningplays an important role in recognizing the actual goal of thedemonstrator and improves the success rate by over 20%. Wefurther show that by using the automatically inferred goal fromthe video demonstration, our robot is able to reproduce thesame task in a real kitchen environment.

I. INTRODUCTION

We are interested in allowing robots to learn new tasksfrom video demonstrations. Recently, there has been rapidprogress in imitation learning [1–4], which even enableslearning a new task from a single demonstration of thetask [5–7]. By leveraging meta-learning [8], the robot learnsto follow the actions in the demonstration. In many cases,however, the robot does not have to thoroughly follow theactions in the demonstration to complete the task. Instead,what matters more is the goal or the intention of thedemonstrator [9–12]. For example, if the goal is to geta bowl, it does not matter which hand we use to pickthe bowl, and whether we get the bowl from the cabinetor the dishwasher. Understanding the intention of humandemonstrator is important for human-robot interaction [13]and can enable the robot to generalize a wider range ofscenarios [9].

There has been many works that aim to infer the in-tention of humans in both robotics [10, 14–17] and cog-nitive science [18–20]. However, these works are mostlylimited to trajectory prediction in 2D environments, and goalrecognition from real-world videos remains challenging. Themain challenge of applying goal reasoning to real-worldhuman demonstration is that the goal or the intention canbe ambiguous and cannot be fully determined by either orboth the final state and the sequence of high-level actions.For 2D trajectory prediction, it is reasonable to assume that

∗These authors contributed equally to the paper.1Stanford University, 2NVIDIA, 3UIUC, 4University of Toronto,5University of Washington.

Actio

n 1

in_storage?

Actio

n 2

on_stove

Actio

n 1*

in_storage

Fig. 1. The first action moved the cracker box into the storage and thesecond action moved the bowl onto the stove. However, in storage isnot necessarily intentional. One can also explain that Action 1 is movingthe cracker box out of the way for Action 2. This is even more obvious ifwe compare Action 1 with Action 1∗ in the last row.

the final state defines the goal because the goal is just thefinal location of the object. However, this is not the case inreal-world demonstrations. Take the kitchen environment inFigure 1 as an example. There are many objects (sugar box,bowl, etc) in the scene, and it is unclear what the goal of thedemonstration is from just the final state or configuration.

An alternative is to recognize the sequence of actionsperformed in the demonstration [10]. In Figure 1, the personfirst moved the cracker box to the storage region withAction 1, and moved the bowl onto the stove with Action2. Based on these actions, one might infer that the goal is:in storage(cracker box), on stove(bowl). However,the interpretation is not always unique. One can also interpretAction 1 as just moving the cracker box out of the way, sothat it is not blocking Action 2 to move the bowl onto thestove. In this case, despite the person moved the crackerbox, it is not part of the goal. This becomes more clearif we compare Action 1 with Action 1∗ in the last row. Itis more likely in Action 1∗ that the person is intentionallymoving the cracker box into the storage. We as humanscan reliably interpret others’ intentions despite the multipleexplanations for their actions. This ability to reason about theactual intention in ambiguous scenarios allows us to interactsmoothly and generalize to drastically different scenarios.

The next question is: How can our system differentiateAction 1 and Action 1∗? Or how can our system determine ifthe person is moving the cracker box intentionally to achievein storage(cracker box) or just moving it out of the

arX

iv:1

911.

0586

4v1

[cs

.RO

] 1

3 N

ov 2

019

Page 2: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

way in Figure 1? Prior plan recognition approaches [16, 21,22] are not directly applicable because the trajectory ofthe demonstration is already completed. On the other hand,action recognition based approaches [10, 14], which modelthe movements of the objects independently, are not capableof determining whether moving the cracker box into thestorage is intentional or not.

Our key observation is that by moving the cracker box thedemonstrator actually achieves two things simultaneously. Inaddition to in storage(cracker box), the new locationof the cracker box also makes it out of the way frommoving the bowl onto the stove. By providing this alternativeexplanation of the demonstrator’s action, it becomes lesslikely that the demonstrator moves the cracker box to achievein storage(cracker box). We refer to being out of theway as a motion predicate achieved by moving the crackerbox to differentiate it from the standard predicates. We referto the standard predicates like in storage(cracker box)

as task predicates. Consequently, the question is to decidewhether an action aims to achieve the motion predicate orthe task predicate or both.

Despite the potential alternative interpretations of thedemonstration, we assume the demonstrator aims to coopera-tively and unambiguously communicate the task goal throughthe demonstration [23]. In other words, the demonstrationsare still intent-expressive or legible [13]. However, as wehave discussed, the intention cannot be fully captured by thehigh-level actions in cases like Figure 1.

Our main contribution is to observe that the legibility ofsuch demonstrations thus resides in the low-level trajectoriesor motions instead of the high-level actions. This legible mo-tion hypothesis allows us to formulate the decision betweentask and motion predicates as inverse planning [18].

We evaluate our hypothesis by collecting a dataset of realvideo demonstrations conditioned on a given set of goalswithin the kitchen domain. Our results show that our inverseplanning formulation based on the task and motion predicatesis able to reliably infer the intention or the goal of demon-strations. This provides the fundamental basis for goal-basedimitation learning from real-world videos. To demonstrate itsutility, we apply our goal recognition approach to addressthird-person imitation from observation [24, 25], where therobots need to execute the tasks based on demonstrationsfrom different environments and demonstrators. Based on thedemonstrations collected in a mockup tabletop kitchen, ourrobotic system is able to infer the demonstrated goal or high-level concepts [26] and reproduce the same goal with a robotin a real kitchen.

II. RELATED WORK

Goal and Intention Recognition. We aim to recognize thegoal or intention of a video demonstration. Related problemshave been explored in plan and goal recognition [3, 15, 16,21, 22, 27], goal-based imitation [9, 10, 12], and BayesianTheory of Mind [28, 29]. Understanding the intentions ofthe agents is important for multi-agent systems [30, 31] andhuman-robot interaction [32, 33]. In our work, we introduce

motion reasoning to task-based goal recognition. This allowsus to go beyond early recognition of 2D trajectory end points.

Imitation from Observation. We recognize the goal of avideo demonstration and use it for the robot to imitate thedemonstrated task. This imitation-from-observation [24, 34]setup does not require the demonstrations to be from thesame agent, and not even the same environment [25, 35].Moreover, we address goal-based imitation learning from justa single demonstration of the task. This is related to recentprogress on one-shot imitation learning [5–7, 36]. Insteadof collecting large amount of training data for end-to-endtraining, we explicitly reason about the object trajectories inthe demonstration, which is more data-efficient.

Interpretable Robot Motion. We resolve the goal ambiguityof high-level actions by reasoning about the low-level objecttrajectories, assuming the trajectories are intent-expressive orlegible [37]. Generating intent-expressive actions has beenan important area of human-robot interaction [13, 38–41]because the generated actions need to be unambiguouslycommunicated to robots. Instead of generating these motions,we aim to recognize intent-expressive trajectories.

Task and Motion Planning (TAMP). We introduce motionreasoning to task goal recognition. We treat the demonstratoras a task and motion planning (TAMP) [42–44] agent insteadof just a motion [18] or task planning [14] agent whenrecognizing the goal. Our motion predicate indicates how theposes of the objects affect the trajectories of other actions,which is related to the semantic attachments [44] and thegeometric constraints [45] in TAMP.

III. METHOD

We address goal-based imitation learning from real-worldvideos. Given a video demonstration of a task, we aim tooutput the symbolic goal of the task. This symbolic goalcan then be used as input for robotics systems to reproducethe task in potentially different environments. The mainchallenge of goal recognition from real-world videos is thatthere exist multiple interpretations of the goal based on justthe final state and the high-level actions in the demonstration.Our key observation is that the low-level motion trajectoriesare thus intent-expressive. This allows us to formulate theproblem as recognizing whether the motion predicate or thetask predicate or both are generating the trajectories. Byperforming motion reasoning based on inverse planning, weare able to interpret the actual goal of the demonstrationbeyond just the final state and the sequence of high-levelactions. Figure 2 shows an overview of our approach.

We will first discuss our goal-based imitation from ob-servation setup in Section III-A. This goal-based formulationenables learning from video observations of different envi-ronments and contexts. Next, we will introduce the proposedmotion predicate for goal-based imitation learning and howwe can use motion reasoning to determine if the task or themotion predicate is the intended subgoal for an action inSection III-B. Finally, we will include details of our visualperception pipeline in Section III-C.

Page 3: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

Input Video Obj Poses

Segment 1in_storage(cracker_box)

Segment 2��# = on_stove(bowl)

VideoSegmentationandTaskPredicate MotionReasoningon_stove(bowl)

RobotExecution

Fig. 2. Overview of our approach. Given an input video demonstration, we first detect the object poses and temporally segment the video to recognizethe task predicate gi for each segment Si. We then reason if the observed object trajectory ξ1 can be better explained by the motion predicate m1 (notblocking ξ2) or the task predicate g1. Once we have all the intentional task predicates, we can pool them together to obtain the goal G of the demonstration.This can be used for robot execution in a different environment.

A. Goal-Based Imitation from Observation

Given a demonstration D = [z1, . . . , zT ] of length T , theaim of goal-based imitation [12] is to output the intendedgoal G of the demonstration. We follow the imitation-from-observation setup [24], and thus the elements zt in thedemonstration do not necessarily correspond to the agent’sstate. In our setup, zt is a video frame of a third-personhuman video demonstration D. We assume that both thehuman demonstration and the robot execution are based onthe same task planning domain [46] (e.g., share the samePDDL domain file). The domain contains a list of groundatoms or predicates (e.g., in storage(cracker box),on stove(bowl)) and a set of grounded operators. Eachgrounded operator consists of its name, the list of arguments,the precondition of a list of predicates that need to besatisfied in order to apply the operator, and the effects ofhow the state would change by applying the operator. Wealso refer to the grounded operators as (high-level) actions.

Based on this definition, a goal G = [g1, . . . , gN ] consistsof a list of N predicates gi that need to be true to achievethe goal. We also refer to gi as subgoal or task predicate.We assume that gi can only be selected from a known subsetof all the predicates in the domain.

This goal-based formulation naturally enables imitationfrom observation in different environments. Given a videodemonstration De1 in environment e1, as long as we canextract the goal G, then G can be used to define the taskplanning problem in a new environment e2 6= e1. The robotcan then execute the output plan in e2 based on the goalG extracted from De1 . In addition, G is also independentof the agent’s state and motion as long as the task domaindefinition is shared between e1 and e2. In our experiment,the human demonstration can be in a remote mockup kitchene1, while the robot execution is done in the real kitchen e2

This generalizability across environments and agents is a keyfeature of our approach.

B. Motion Reasoning for Video Goal Recognition

We have discussed our goal-based imitation from observa-tion setup, and how it can be used for execution in differentenvironments based on third-person human demonstrations.

Now we discuss how we go from the demonstration D =[z1, . . . , zT ] to the underlying goal G. Our approach consistsof three main steps: (i) First, the demonstration D is tempo-rally segmented into video segments {Si}. Without loss ofgenerality, we assume in this section that each of the segmenti achieves one of the task predicates gi by manipulating asingle object. (ii) Second, for each of the video segment, themanipulation of the object also achieves a motion predicatein addition to the task predicate. We then extend inverseplanning to decide if the video segment is intended to achievethe task predicate gi or the motion predicate or both. (iii)Finally, once we know all the intentional task predicates giachieved by all the segments Si, we can pool them togetherbased on the domain definition to get the final goal G = {gi}.An overview of our approach is shown in Figure 2.

For video human demonstration, step (i) is equivalent toaction segmentation [47], which is a well-developed field incomputer vision [48, 49]. We will directly go into our maincontribution step (ii) in this section, and the details for step(i) will be discussed in Section III-C.

Assuming that we have temporally segmented the demon-stration based on the changes of task predicates, we thenhave the segment Si = [zi1, . . . , z

it, . . . , z

iTi

] consist of theobservations zit and the task predicate gi achieved in thissegment. Based on the arguments of the task predicate, wealso know the object bi being manipulated in this segment.Our next step is to decide if the task predicate gi achieved inthe segment Si is actually part of the goal of the demonstra-tion. Previous works for goal recognition from real-worlddemonstration [10, 14] focus on the relationships betweendifferent predicates gi. However, as we have discussed, forcases like Figure 1, we are unable to determine the intentionsimply based on gi. Despite the ambiguity at the predicatelevel, we do know that the aim of a demonstration is tocommunicate the goal. In this case, there should exist otherinformation to resolve this ambiguity at the predicate level.

We propose that the solution is to reason at the motion orthe object trajectory level between the segments. We observethat in addition to the task predicate gi, the manipulationof object bi also achieves something else. Now bi is at anew location and has a new pose. Independent of the task

Page 4: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

predicate gi, this new pose of bi can also enable object tra-jectories in later segments. Intuitively, we can think of high-level action based approaches [10, 14] as just considering thetask planning aspect of goal recognition, and 2D trajectoryapproaches [16, 17] as just considering the motion planningaspect of goal recognition. Our approach instead considersboth the task and motion planning aspects to infer the goalof the demonstration. In this case, although moving the bowlonto the stove in the next segment Si+1 might symbolicallyonly includes in hand(bowl) as its precondition, there isalso implicitly a motion constraint [45] of valid paths thatenables the moving of the bowl onto the stove. We thus callsatisfying the constraint by moving bi to create the valid paththe motion predicate mi(bi) achieved by the segment Si.

There are easy cases: If moving the object bi achievesa task predicate and does not create any valid paths forother objects then the intention is just the task predicate.On the other hand, if no new task predicate is achievedand new valid paths are created by moving bi then theintention is the motion predicate. The challenging caseis: when both task predicate gi and the motion predicatemi(bi) becomes true in Si, which one is the actual intendedgoal? Consider Figure 1 again, moving the cracker box bothachieved in storage(cracker box) and create a validpath for moving the bowl onto the stove. How do we knowif in storage(cracker box) is part of the goal?

After explicitly formulating the motion predicate, we cannow apply the principle of rational actions [50]: we canassess whether the motion predicate or the task predicatewould be more efficiently brought about by the observedobject manipulation trajectory. This is in line with Bayesianinverse planning [18]. Let ξis→q be the trajectory of bi in Si

starting from s and ending at q, we can decide the intentionof Si by:

arg maxg∈{gi,mi(bi)}

P (g|ξis→q) = P (ξis→q|g)P (g) (1)

Following [13], we can derive

P (ξis→q|g) ∝exp(−C(ξis→q)− C(ξi∗q→g))

exp(−C(ξi∗s→g)), (2)

where ξi∗s→g and ξi∗q→g are the optimal trajectories to achieveg from s and q, and C(·) is the function to compute the costof a trajectory. We obtain the object trajectories by trackingthe pose of each object. Each frame zit is represented by theposes of all the objects in the scene (details in Section III-C).

When g is just a location in space, ξi∗s→g and ξi∗q→g aremore straightforward to compute. In our case, g in Eq. (2) iseither the task predicate or the motion predicate, which canbe satisfied by a region in space instead of a single point. Inthis case, it is inefficient to directly discretize the space andrun search algorithms. We thus use RRT* [51] from s and qto approximate the optimal trajectories to achieve gi, mi(bi).In order to run RRT*, we first treat other objects bj , j 6= i asobstacles and use the object poses at frame t to compute theconfiguration space. When gi is a region in space, the successcondition for RRT* is to reach the region. On the other hand,

𝜉"→$%

𝑞

𝑠

Video

𝑞

𝑠

𝐶(𝜉$→*+%∗ ) = 4𝑐𝑚

𝐶(𝜉"→*+%∗ ) = 17𝑐𝑚

𝐶(𝜉"→$% ) = 15𝑐𝑚

Fig. 3. Example of our motion reasoning for P (ξis→q |mi). The cost ofthe trajectory C(ξis→q) can be estimated from the video. We use RRT* toestimate the optimal costs C(ξi∗s→mi

) and C(ξi∗q→mi). P (ξis→q |(g)i) can

be computed similarly.

mi(bi) means that bi is not blocking the trajectory of bk, ,k 6= i. We find the convex hull that covers the trajectoryof bk, and use RRT* to find the shortest path that bi is notintersecting with this convex hull. Figure 3 shows an examplethat applies Eq. (2) to compute P (ξis→q|mi(bi))

Task Predicate Pooling. Using the proposed motion pred-icate reasoning, we are able to decide for each segmentSi if the corresponding gi is intentional. The final step isthen pooling all the intentional gi into the final goal G. Wecombine the intentional task predicates by removing the onesthat are preconditions for later intentional task predicates. Forexample, although one moves the bowl onto the stove, theactual goal might just be to cook what is inside the bowl.Later the bowl is moved back on to the table to serve andbeing on stove is thus not part of the final goal.

C. Visual Perception Pipeline

We now discuss our visual perception pipeline and howwe use its output to segment the video.6D Pose Estimation. As our approach reasons at the levelof object trajectories, we need to first detect and trackthe object poses in the 3D space. We initialize the objectposes with PoseCNN [52]. The detection output is thenused as initialization for PoseRBPF [53] to track the 6Dposes. The tracked poses are further optimized based on thesigned distance functions [54]. When multiple cameras areavailable, we use the maximum particle score [53] to selectthe best view for the object. In this case, we transform thevideo frame observation zt to an object-centric representationxt = φ(z1:t) = [x1

t , . . . , xkt , . . . , x

Kt ]. Here φ is our object

pose tracking pipeline, xt is the object-centric representationof zt and xkt is the estimated 6D pose of the k-th object.Note that one can further augment xt with detected handtrajectories [55] to improve the robustness of the downstreamvideo segmentation task.Temporal Segmentation. The first step of our approach is totemporally segment the demonstration so that we can reasonabout the trajectories between the segments in the followingsteps. While one can treat this as an action segmentation

Page 5: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

pour spam in bowl move cracker box away cookonstove movebowltoworkspace

Fig. 4. Example of our demonstration in the mockup tabletop kitchen. The person first pours the spam into the bowl and moves the cracker box awayso that the bowl can be moved on the stove to cook the spam. Finally the bowl is moved back to the workspace.

(a) Kitchen (b) Mockup Kitchen

workspace

storagestove2

stove1

workspace

storage

stove1

stove2

Fig. 5. Our cooking domain involves four regions: the workspace thatis closest to the agent, the storage that is further away, and two stoves onwhich ingredients can be cooked.

problem [47] and collect annotated data for training neuralnetworks, we choose to segment the demonstration based onthe object pose trajectories we extract for motion reasoning.For each time step t, we have the poses xkt for all the objects.By comparing xkt temporally, we are able to know if object kis moving and being manipulated. We then segment the videosuch that each segment contains the manipulation of a singleobject. As earlier noted, the segmentation outcome can befurther improved by refining the temporal boundaries basedon the detected hand-object distances. Next, we computethe predicates for each frame based on the estimated poses.By comparing the predicates at the start and the end of thesegment, we can get the task predicate(s) for the segment.

IV. EXPERIMENTS

The aim of the paper is to recognize the goal of videodemonstrations even when it is not obvious from the finalstate and the high-level actions. We hypothesize that theobject trajectories are thus intent-expressive and proposea motion reasoning framework to recognize the goals ofthe demonstrations. Our experiments aim to answer thefollowing questions: (1) Are there task predicates that arenot intentional in the demonstrations? (2) Can our motionreasoning framework determine if a task predicate is inten-tional? (3) Can we address third-person imitation from videoobservation? We answer the first two questions by collectinga new dataset consist of demonstrations of a mockup cookingtask. We then apply our motion reasoning framework torecognize the goals of the demonstrations and compare toexisting approaches for goal recognition. We answer thelast question by performing robot execution based on ourextracted goal in a different environment.

A. Mockup Cooking Domain

We are interested in problems that involve both task andmotion reasoning. Cooking tasks are ideal because they in-volve multiple steps, and we have to potentially rearrange the

objects to create valid paths. In addition to the standard pick-and-place operators/predicates, we introduce two additionaloperators and their associated predicates:

1) pour(X,Y): in hand(X), on(X,Y) ⇒ in(X,Y)

2) cook(X): in(X,b), on stove(b) ⇒ cooked(X)

A further constraint for pour is that Y needs to be acontainer. We only use a bowl as the container. The kitchenenvironment is divided into four regions: workspace, storage,and two stoves. Figure 5 shows the regions in both themockup and the real kitchen.

We design our tasks as follow: a task is to cook aningredient F , which involves 2 key steps pour and cook.One of the key steps would be initially blocked by a blockingobject B, and the goal might or might not involve havingB to a new target region. We consider two ingredientsF = {tomato soup, spam} and three blocking objects B ={cracker box,mustard bottle, sugar box}. This gives a totalof 2 × 2 × 3 × 2 = 24 tasks. We collect 4 demonstrationsfor each task. This results in a total of 96 demonstrations.We exclude videos with substantial missing poses fromthe evaluation for a meaningful comparison. An exampledemonstration is shown in Figure 4.

B. Evaluating Goal Recognition

Experimental Setup. We perform 10 random splits of the24 tasks in our dataset: 12 for training and 12 for testing.For methods that do not require training (including ours), thetraining set is used to select the hyperparamters.Baselines. We compare to the following methods:- Final State. The Final State baseline just uses the truepredicates in the final frame as goal [56].- Task Predicates. The Task Predicates baseline uses allthe achieved task predicates in the demonstration withoutanalyzing the segments at the motion level. We apply thesame predicate pooling scheme as ours for a fair comparison.- Recurrent Neural Networks (RNN). We use RNN as an end-to-end learning baseline. RNN uses the same object centric-representation xt as ours. The hidden states are averaged overtime and classified with a two-layer MLP for the subgoals.- Discrete Graphical Model (DGM) [10]. We compare tographical model approaches for goal-recognition. The maindifference is that in our case the action of a segmentis uniquely defined by the start and end states, but thegoal is not uniquely defined by the final state of the seg-ment. We thus collect statistics for P (G|Xi, Xf ) instead ofP (A|Xi, Xf ), where Xi and Xf are the start and end statesof an action A to achieve goal G.- Ours w/o Motion (mi). We analyze the importance of ourmotion reasoning by comparing to an ablation without the

Page 6: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

VideoDemo

OutputGoal

RobotExec.

in_workspace(spam_can), cooked(spam), in_workspace(bowl)

in_bowl(spam), in_workspace(spam_can) in_storage(sugar_box) cooked(spam) in_workspace(bowl)TaskPred.

Fig. 6. Qualitative results for third-person imitation from observation. Given the video demonstration at the top, our framework is able to successfullyextract the intended goal. The goal is then used as input to a robot system for execution to reproduce the goal in a different environment.

TABLE I: Goal recognition results on our Mockup Cooking Dataset.Prec. Recall F1 F1blk Succ.

Final State 0.31 0.98 0.47 0.53 0.0Task Pred. 0.74 0.96 0.81 0.57 0.24RNN 0.72 0.56 0.62 0.47 0.17DGM [10] 0.76 0.94 0.82 0.54 0.26Ours w/o mi 0.90 0.84 0.84 0.35 0.29Ours 0.88 0.91 0.88 0.76 0.47

motion reasoning. Instead of comparing g = gi or mi(bi)using Eq. (2), this baseline just looks at g = gi and select athreshold for Eq. (2) to see if the task predicate gi is actuallypart of the goal.Metrics. We consider the following metrics:- F1 Score. As the goal is a subset of all predicates, standardmetrics are the Precision, Recall, and F1 Score to see howwell the predicted goal match with the groundtruth.- F1blk. As there are still many predicates in the goal, wefurther focus the evaluation on predicates of the blockingobject (blk) because the blocking object is the main sourceof ambiguous subgoals in the mockup cooking domain.- Success Rate. We approximate the actual execution successrate by just looking at the output goal. If the output goal isexactly the same as the groundtruth goal, we treat it as asuccess and a failure otherwise.Results. The results are shown in Table I. The Final Statebaseline has the highest recall because the goal predicatesshould be true in the last frame. However, the high recall isat the price of low precision. The Task Predicates baselineachieves much higher precision by recognizing all the high-level actions in the demonstration. We consider two learningbased baselines: RNN and DGM [10]. With less than 50training videos, the learning based methods do not generalizewell. DGM improves slightly over the Task Predicates byaccumulating statistics of how a predicate is part of the goal.Without motion reasoning, none of the above baseline canhandle the moving of the blocking objects and thus do nothave reasonable F1blk. Our full model explicitly performsmotion reasoning about the objects in the demonstration,and thus would not blindly take all the object movementsas intentional. This gives much higher precision compared tothe Task Predicates baseline. Our approach thus best balancethe precision and recall, and significantly outperforms thebaselines (+21% for Succ. and +19% for F1blk). We furtheranalyze the importance of our motion reasoning. Ours w/omi does not consider the motion predicate, but aims to

recognize the goal by comparing how optimal the trajectoryis to achieve the task predicate. This conservative baselinegives the highest precision, but at the cost of lower recall. Inaddition, the baseline is unable to handle the blocking objectto complete the task without motion reasoning.

C. Third-person Imitation from Observation

One additional advantage of our goal-based framework isthat it enables imitation learning across different environ-ments. To demonstrate this, we executed on a Franka Pandarobot in a real kitchen environment. We used PoseCNN [52]to provide initial estimates as to object positions, and usedDART [54] to track objects as they moved around andto perform hand-eye calibration between the Franka robotand the kitchen. Primitive motion policies are executedvia RMPflow [57], which allows for reactive, closed-loopcontrol. We use the output goal from the video demonstrationfor task planning. The task plan is then represented as aRobust Logical-Dynamical System [58] for reactive recoveryand robustness to sensor noise in execution.

Figure 6 shows qualitative results. Based on the videodemonstration collected in the mockup kitchen, we are ableto successfully extract the correct goal despite the manipu-lation of the sugar box. This extracted goal is input to thetask planner from [58] and the resulting plan is successfullyexecuted in the real kitchen. Although the demonstratormoves the sugar box in the video demonstration, and thesugar box also appears in the kitchen, the robot recognizesthat it does not need to move the sugar box because it isalready out of the way and the goal is just to cook the spam.

V. CONCLUSION

We present a new motion reasoning framework to rec-ognize the goals from real-world video demonstrations. Weshow that despite being ambiguous at the symbolic actionlevel and in terms of final states, the demonstrations are stillintent-expressive at the trajectory level. By explicitly reason-ing about object trajectories for task goals, we combine taskand motion reasoning to infer the goal of the demonstration.Our results show that this allows us to significantly outper-form previous approaches that aim to infer the goal basedon either just motion planning or task planning. In addition,we show that our goal-based formulation enables the robotto reproduce the same goal in a real kitchen by just watchingthe video demonstration from a mockup kitchen.

Page 7: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

Acknowledgements. We thank Ankur Handa, Yu Xiang, andClemens Eppner for their help and discussions on the project.

REFERENCES

[1] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inversereinforcement learning,” in ICML, 2004.

[2] J. Ho and S. Ermon, “Generative adversarial imitation learning,” inNeurIPS, 2016.

[3] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximumentropy inverse reinforcement learning,” in AAAI, 2008.

[4] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deepinverse optimal control via policy optimization,” in ICML, 2016.

[5] Y. Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider,I. Sutskever, P. Abbeel, and W. Zaremba, “One-Shot ImitationLearning,” in NeurIPS, 2017.

[6] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visualimitation learning via meta-learning,” in CoRL, 2017.

[7] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S.Savarese, “Neural task programming: Learning to generalize acrosshierarchical tasks,” in ICRA, 2018.

[8] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in ICML, 2017.

[9] G. Cheng, K. Ramirez-Amaro, M. Beetz, and Y. Kuniyoshi, “Pur-posive learning: Robot reasoning about the meanings of humanactivities,” Science Robotics, vol. 4, no. 26, 2019.

[10] M. J.-Y. Chung, A. L. Friesen, D. Fox, A. N. Meltzoff, and R. P. Rao,“A bayesian developmental approach to robotic goal-based imitationlearning,” PloS One, vol. 10, no. 11, 2015.

[11] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu,E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell, “Zero-shotvisual imitation,” in ICLR, 2018.

[12] D. Verma and R. P. Rao, “Goal-based imitation as probabilisticinference over graphical models,” in NeurIPS, 2006.

[13] A. D. Dragan, K. C. Lee, and S. S. Srinivasa, “Legibility andpredictability of robot motion,” in HRI, 2013.

[14] G. Katz, D.-W. Huang, T. Hauge, R. Gentili, and J. Reggia,“A novel parsimonious cause-effect reasoning algorithm for robotimitation and plan recognition,” IEEE Transactions on Cognitiveand Developmental Systems, vol. 10, no. 2, pp. 177–193, 2017.

[15] H. S. Koppula and A. Saxena, “Anticipating human activities usingobject affordances for reactive robotic response,” in RSS, 2013.

[16] M. Ramırez and H. Geffner, “Plan recognition as planning,” in AAAI,2009.

[17] B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Modeling interactionvia the principle of maximum causal entropy,” in IROS, 2010.

[18] C. L. Baker, R. Saxe, and J. B. Tenenbaum, “Action understandingas inverse planning,” Cognition, vol. 113, no. 3, pp. 329–349, 2009.

[19] N. D. Goodman and M. C. Frank, “Pragmatic language interpretationas probabilistic inference,” Trends in cognitive sciences, vol. 20, no.11, pp. 818–829, 2016.

[20] A. Solway and M. M. Botvinick, “Goal-directed decision making asprobabilistic inference: A computational framework and potentialneural correlates.,” Psychological review, vol. 119, no. 1, p. 120,2012.

[21] S. Sohrabi, A. V. Riabov, and O. Udrea, “Plan recognition asplanning revisited.,” in IJCAI, 2016.

[22] E Yolanda, M. D. R-Moreno, D. E. Smith, et al., “A fast goalrecognition technique based on interaction estimates,” in IJCAI,2015.

[23] H. P. Grice, “Logic and conversation,”[24] Y. Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from obser-

vation: Learning to imitate behaviors from raw video via contexttranslation,” in ICRA, 2018.

[25] B. C. Stadie, P. Abbeel, and I. Sutskever, “Third-person imitationlearning,” in ICLR, 2017.

[26] M. Lazaro-Gredilla, D. Lin, J. S. Guntupalli, and D. George,“Beyond imitation: Zero-shot task transfer on robots by learningconcepts as cognitive programs,” Science Robotics, 2019.

[27] G. Sukthankar, C. Geib, H. H. Bui, D. Pynadath, and R. P. Goldman,Plan, activity, and intent recognition: Theory and practice. Newnes,2014.

[28] C. Baker, R. Saxe, and J. Tenenbaum, “Bayesian theory of mind:Modeling joint belief-desire attribution,” in CogSci, 2011.

[29] M. D. Lee, “Bayesian methods in cognitive modeling,” Stevens’Handbook of Experimental Psychology and Cognitive Neuroscience,vol. 5, pp. 1–48, 2018.

[30] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls,J. Perolat, D. Silver, and T. Graepel, “A unified game-theoreticapproach to multiagent reinforcement learning,” in NeurIPS, 2017.

[31] N. C. Rabinowitz, F. Perbet, H. F. Song, C. Zhang, S. Eslami, andM. Botvinick, “Machine theory of mind,” in ICML, 2018.

[32] R. Kelley, A. Tavakkoli, C. King, M. Nicolescu, M. Nicolescu,and G. Bebis, “Understanding human intentions via hidden markovmodels in autonomous mobile robots,” in HRI, 2008.

[33] A. Ramachandran, A. Litoiu, and B. Scassellati, “Shaping productivehelp-seeking behavior during robot-child tutoring interactions,” inHRI, 2016.

[34] F. Torabi, G. Warnell, and P. Stone, “Adversarial imitation learningfrom state-only demonstrations,” in AAMAS, 2019.

[35] P. Sermanet, K. Xu, and S. Levine, “Unsupervised perceptualrewards for imitation learning,” in RSS, 2017.

[36] D.-A. Huang, S. Nair, D. Xu, Y. Zhu, A. Garg, L. Fei-Fei, S.Savarese, and J. C. Niebles, “Neural task graphs: Generalizing tounseen tasks from a single video demonstration,” in CVPR, 2019.

[37] T. Chakraborti, A. Kulkarni, S. Sreedharan, D. E. Smith, andS. Kambhampati, “Explicability? legibility? predictability? trans-parency? privacy? security? the emerging landscape of interpretableagent behavior,” in ICAPS, 2019.

[38] D. Szafir, B. Mutlu, and T. Fong, “Communication of intent inassistive free flyers,” in HRI, 2014.

[39] S. Tellex, R. Knepper, A. Li, D. Rus, and N. Roy, “Asking for helpusing inverse semantics,” 2014.

[40] Y. Zhang, S. Sreedharan, A. Kulkarni, T. Chakraborti, H. H. Zhuo,and S. Kambhampati, “Plan explicability and predictability for robottask planning,” in ICRA, 2017.

[41] A. M. MacNally, N. Lipovetzky, M. Ramirez, and A. R. Pearce,“Action selection for transparent planning,” in AAMAS, 2018.

[42] C. R. Garrett, T. Lozano-Perez, and L. P. Kaelbling, “Sampling-based methods for factored task and motion planning,” IJRR, 2018.

[43] M. Toussaint, K. Allen, K. A. Smith, and J. B. Tenenbaum, “Dif-ferentiable physics and stable modes for tool-use and manipulationplanning.,” in RSS, 2018.

[44] C. Dornhege, A. Hertle, and B. Nebel, “Lazy evaluation andsubsumption caching for search-based integrated task and motionplanning,” in IROS workshop on AI-based robotics, 2013.

[45] T. Lozano-Perez and L. P. Kaelbling, “A constraint-based methodfor solving sequential manipulation planning problems,” in IROS,IEEE, 2014.

[46] M. Ghallab, D. Nau, and P. Traverso, Automated planning: Theoryand practice. Elsevier, 2004.

[47] H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Re-covering the syntax and semantics of goal-directed human activities,”in CVPR, 2014.

[48] C.-Y. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, and J. C. Niebles,“D3tw: Discriminative differentiable dynamic time warping forweakly supervised action alignment and segmentation,” in CVPR,2019.

[49] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E.Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray,“Scaling egocentric vision: The epic-kitchens dataset,” in EuropeanConference on Computer Vision (ECCV), 2018.

[50] G. Csibra and G. Gergely, “‘obsessed with goals’: Functions andmechanisms of teleological interpretation of actions in humans,”Acta psychologica, vol. 124, no. 1, pp. 60–78, 2007.

[51] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimalmotion planning,” IJRR, vol. 30, no. 7, pp. 846–894, 2011.

[52] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: Aconvolutional neural network for 6d object pose estimation incluttered scenes,” RSS, 2018.

[53] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox,“PoseRBPF: A rao-blackwellized particle filter for 6d object posetracking,” in RSS, 2019.

[54] T. Schmidt, R. A. Newcombe, and D. Fox, “Dart: Dense articulatedreal-time tracking.,” in RSS, 2014.

[55] A. Handa, K. Van Wyk, W. Yang, J. Liang, Y.-W. Chao, Q. Wan,S. Birchfield, N. Ratliff, and D. Fox, “Dexpilot: Vision basedteleoperation of dexterous robotic hand-arm system,” ArXiv, 2019.

Page 8: Motion Reasoning for Goal-Based Imitation Learning › pdf › 1911.05864.pdf · Motion Reasoning for Goal-Based Imitation Learning De-An Huang1 ;2, Yu-Wei Chao , Chris Paxton , Xinke

[56] D.-A. Huang, D. Xu, Y. Zhu, A. Garg, S. Savarese, L. Fei-Fei, andJ. C. Niebles, “Continuous relaxation of symbolic planner for one-shot imitation learning,” in IROS, 2019.

[57] C.-A. Cheng, M. Mukadam, J. Issac, S. Birchfield, D. Fox, B. Boots,and N. Ratliff, “RMPflow: A computational graph for automaticmotion policy generation,” WAFR, 2018.

[58] C. Paxton, N. Ratliff, C. Eppner, and D. Fox, “Representing robottask plans as robust logical-dynamical systems,” IROS, 2019.