Object Affordance Learning from Human Descriptions - University of Texas at...

Object Affordance Learning from Human Descriptions

Human Robot Interaction Final Project Report

Yuchen CuiComputer Science

University of Texas at [email protected]

Wenbo XuElectrical and Computer Engineering

University of Texas at [email protected]

ABSTRACTIt is very important for intelligent agents that serve hu-mans to understand a human’s request using natural lan-guage. How the robot represent its knowledge about theworld largely affects its ability to communicate effectively. Itis believed that an object-centric represention of knowledgeis an efficient way for robots to reason about tasks. Objectaffordances are often used to describe the functionality ofan object or actions can be performed on an object. Priorwork has been focusing on how a robot can learn about ob-jects’ affordances by manipulating objects or observing howhumans interact with these objects. While these kinds oflearned obejct affordances are concrete and grounded to therobot’s actions, they are not directly mapped to the human’sabstractions or meanings to humans. In order for the robotto truly understand a human’s request provided in naturallanguage, we propose to let the robot learn the mappingbetween its perception and that of human’s through conver-sations. More specifically, the robot will ask the human todescribe objects in terms of their properties and functional-ities and build object models using the feedback. We imple-mented a prototype of the proposed system and present theresults of an user study in this report.

KeywordsHuman-centered computing; Human Robot Interaction; Ob-ject Affordance Learning

1. INTRODUCTIONWith the fast advancement of modern technology, in the

near future, robots will be entering into small businesses andaverage households to help automate our daily life. Theserobots will be interacting with humans and processing hu-mans’ requests everyday. They need to be able to communi-cate with humans naturally and effectively. However, robotsand humans have inherent perceptual differences. Whena robot sees a cluster of 184 pixels with an average RGBvalue of (245, 5, 2), a human sees a red ball. These kind of

EE382V ’16 Dec 13, 2016PeARL & SIM LaboratoryComputer Science DepartmentUniversity of Texas at AustinGates Dell Complex, 2317 Speedway, Austin, TX 78712

ACM ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

Figure 1: Robot Gemini in a simulated world

Table 1: Object Model

Field ExampleLabel (L) ”apple”

Raw Features (F) [, , ]Descriptions (D) red, round

Actions (A) make juice, make pie

differences make communication hard between humans androbots. Humans intuitively learn to associate abstract de-scriptions with objects as they interact with them. A humanchild can learn quite a lot about an object by asking ques-tions like ”what is this?” and ”what do you use it for?”. Webelieve the answers to these questions can also be leveragedby a learning agent interacting with humans in an effort tounderstand how humans perceive.

We propose to build a system that queries human for de-scriptions and build object affordance models from thesedescriptions. It will ask questions and learn the meaningsof these descriptions in a sturctured way. An object will beassociated with features a robot detects, descriptions a hu-man gives and actions a human defines the object to afford,as shown in Table 1. Our object model not only captureshow a robot perceives an object but also maps it with hu-man’s descriptions about the object, so that they can under-stand each other in the human’s language. In this report,we discuss how we envision our object affordance model tobe used and what kinds of design effort is need to build acomprehensive system. A prototype of the proposed systemis implemented on a Stanley Vector robot Gemini and a userstudy is performed to test the performance of this prototype.

2. RELATED WORKObject affordance learning has been studied in various

fields including Human Robot Interaction (HRI) and Learn-ing from Demonstration (LfD). On the other hand, NaturalLanguage Processing (NLP) and Congnitive Robotics (CR)has been focusing on how to make communication betweenhuman and robot more efficient and natural. Our proposedmethod aims to bridge the gap between simply imitatinghumans and understanding humans for autonomous learn-ing agents.

Object affordance does not have an exact definition in dic-tionary but it is often used in robotics as a term to relateactions to objects so that a robot can ground object prop-erties to its skills for manipulation purpose [5][6][7].

Figure 2: Object Affordance Model from Lopes et al. [3]:affordances are considered to be relations between Objects,Actions and Effects, can be used for different purposes, suchas predict the Effects given Objects and Actions.

Lopes et al. [3] presented a general model for learningobject affordances using Bayesian networks integrated in ageneral developmental architechture for social robots [8][9].Their affordance model is presented in Figure 2, where affor-dances are considered as relations between actions, objectsand effects. Given any of the two elements, the model is ableto predict the third. They implemented the system on a hu-manoid robot, tested it with imitation learning tasks andshowed that the robot was able to learn affodances associ-ated with its manipulation actions such as grasp, touch, andtap. However, their object affordance learning only capturesthe robots’s perception and restricts the affordance model tobe grounded by the robot’s primitive actions. It will be hardfor a robot who learned object affordances using such modelto communicate its knowledge to a human.

Thomaz & Cakmak’s work [2] approaches the problemof object affordance learning by exploring the role of a hu-man teacher in physically interacting with a learning agentand the envrionment. They present the method of leverag-ing a human partner’s assistance in robot learning as So-cially Guided Machine Learning (SG-ML). The idea is thathuman social learning is the most natural interface for hu-mans, while social interaction also provides bias and con-straints that might simplify problems for autonomous learn-ing agents. It is shown that SG-ML can improve the per-formance and efficiency of robot learning. Their study alsorevealed that the robot’s gazing behavior helps communi-cating robot’s state to humans naturally, which indicatesthat humans expect to interact with robots in ways they donaturally.

Thomason et al. [1] look at how to ground natural lan-

guage to multi-model sensory data by having human and alearning agent play ”I Spy” game. They collected the multi-modal (including haptic, auditory and proprioceptive) sen-sory data by letting the robot to manipulate a set of selectedobjects on its own, and then train classifiers using the hu-man’s label from the ”I Spy” game as ground truth. Theydemonstrated that multi-modal system performs better thanvision-only systems. However, their model can only capturethe relationship between linguistic semantics and object fea-tures observed by the robot. In their user study, they needto instruct participants to not talk about the functionalityof selected objects so that the robot can relate what it hearsto what it sees directly.

Our proposed system leverages human descriptions of ob-jects to build object affordance models, enabling the robotto communicate with the human about objects using thehuman’s descriptions. Our object affodance model capturesnot only what the robot perceives but also human’s pre-ception about objects. Our model can be combined withaffordance-based imitation learning methods such as the workby Thomaz & Cakmak [2], so that the robot can communi-cate about objects with the human teacher during learning.

3. SYSTEM DESIGNIn this section we present out technical approaches toward

modeling object affordances with human descriptions andhow to retrieve knowledge using our proposed model.

3.1 Object Affordance Model

Figure 3: Modified Object Affordance Model: componentHuman Descriptions is added, affordances modeled here arerelationships between natural language descriptions and ob-served raw features.

Our object affordance model aims to address the fact thatrobots will need to commnunicate with human about ob-jects. Therefore, as figure 3 shows, we added the HumanDescriptions component to the model of Montesano & Lopes[3] and focuses on the relationships between Human Descrip-tions and Object. At the same time, we modified the devel-opmental architecture for social robots that Montesano &Lopes [4] adopted from Weng [8] and Lungarella et al. [9], toinclude learning a human model and communication skillsas it interacts with the world. The hypothesis is that, byhaving a human model and basic communication skills, therobot will be able to better perform imitation games, sinceit can communicate objects with humans using natural lan-guage.

Figure 4: Modified Developmental Architecture: Learn Hu-man Model and Communication skills are added to WorldInteration phase.

3.2 Knowledge RetrievalGiven our object affordance model, now we can talk about

how to make use of it efficiently. Since the purpose of themodel is to capture human’s understanding about objects, itis desired to not restrict the ways a human describe objects.Therefore, we will not restrict the human user to any vo-cabulary or to use grammatically correct sentences. Given acorpus of unconstrained content, even if we can extract cor-rect attributes, inference can become intractable very easily.A very simple but efficient way of retrieving knowledge fromunconstraint text data is keyword-based search.

We adopt the popular Term Frequency Inverse DocumentFrequency (TF-IDF) index to determine relevance [10]. Givena set of documents D = {d0, d1, ..., dn}, the TF-IDF indexof some keyword w for a document d ∈ D is calcuated as:

T (w, d) = fw,d log(|D|fw,D

) (1)

where fw,d is the frequency or number of appearances of win d, |D| is the total size of the corpus, and fw,D is the totalnumber of appearances of w in D.

Given a sentence as a list of words s = [w0, w1, ..., wn], therelevance score of a particular document is calculated as:

R(s, d) =∑w∈s

T (w, d)∑d′∈D T (w, d

′)(2)

A list of stopwords can also be used to reduce possiblekeyword candidates. In our system, the human descriptionassociated with each object model becomes a document as-sociated with its corresponding label and the most relevantobjects to a user’s query will be:

dbest ∈ argmaxd∈DR(s, d) (3)

We can then retrieve the sensory data associated withsome dbest to ground the query to the features of the objectthat the robot can perceive. It the selected object is notwhat the user is required, the robot can propose anotherobject using a list of objects ranked by their R(s, d) score.The user query can be added to the object’s model of humandescriptions if the robot gets the confirmation from humanthat it found the correct object.

4. EXPERIMENT ENVIRONMENT

4.1 Hardware EnvironmentWe implemented a prototype of the proposed system on

the Stanley Vector robot Gemini, as shown in figure ??,made available by the Personal Autonomous Robots Lab(PeARL) in the UT Computer Science department. Geminiis equipped with a Microsoft Kinect as its visual sensor anda Kinova Jaco Arm, which is a robotic arm allows Gem-ini to perform human-like manipulation movements, such aspointing to an object.

4.2 Implementation Details

Figure 5: System Diagram

The implemented design is a distributed system consistswith functional units. The system diagram is shown in fig-ure 5. There are four functional units: Speech Recognitionunit, Manipulation unit, Feature Extraction unit, Knowl-edge Database, and the Description Learning state machine.Each of these units is implemented as node in ROS and com-municates with each other by message passing queues, whichusing the Publish/Subscribe schema in ROS.

4.2.1 Functional units

• The Speech Recognition unit utilizes the Google SpeechAPI [13] to perform the speech to text conversion.Whenever an speech recognition is needed, it uploadsthe human speech data onto the Google Cloud Plat-form for on-line recognition and retrieves the result bypolling.

• The Feature Extraction unit is based on PCL (PointCloud Library) and HLP-R Perception [15] package. Itsamples the RGB-D data from Kinect, performs point-cloud segmentation, extracts cluster features, and re-turns a list of candidates’ feature vector (figure 6).

• The Manipulation unit directly uses Actions defined inthe HLP-R Manipulation [14] package. It uses MoveIt[17] to calculate a feasible path to some desired poseand and executes the returned trajectory of movements.

• The Knowledge Database stores the object model asdescribed in Table 1. It links what Gemini sees, objectfeature vectors perceived by Kinect, and what humansees, human descriptions. Text processing is done us-ing the TextBlob library [16].

Figure 6: Top: RGB-D Kinect Data. Mid: Segmented Clus-ter. Bottom: Extracted features.

4.2.2 Robot State MachineThe Robot State Machine as shown in figure 7 controls the

operation flow of the entire system. It initially stays at theIDLE state and enters into the training loop or the testingloop depending on user task. For training task, the state ma-chine first searches through the current environment usingperception unit to find an unknown object. Then, the ma-nipulation unit controls the arm pointing at that unknownobject. Last, the speech unit converts human’s speech intotexts and store them into the knowledge database. For test-ing task, the state machine first retrieves human’s vocalrequest using the speech unit. Then, it searches throughthe knowledge database for a match use the algorithm de-scribed in the earlier section. Next, the feature vector of thematched object is used by the perception unit to find themost-likely object in the current environment. Last, manip-ulation unit move Gemini’s arm to point at that item.

Figure 7: Robot State Machine: Training Loop (Left) andTesting Loop (Right)

5. USER STUDYWe conducted a within-subject user study with 8 partici-

pants on our integrated system to see how it performs, andmore importantly, to get feedback from humans who inter-acted with it on how we should improve it.

5.1 HypothesisOur hypothesis is that using the proposed object affor-

dance model, a robot will be able to communicate objectswith the human who taught it more accurately and nat-urally, because we believe the proposed system is able tolearn properties about objects that can only be obtainedfrom human descriptions, while descriptions of objects mayvary from human to human.

5.2 Description Learning TaskThe task we use test the system is called Gemini Wants

to Learn, where a human will describe objects that Geminiselected by answering question Gemini asked and then testGemini by asking for one of the object they described to it.

The objects we used in this user study are selected fromthe YCB data set [12]. Since we only have a very basicvisual perception system, the objects are selected to be solidin color and visually differentiable from each other. Selectedobjects are presented in figure 8. Only 3 of these objects arerandomly selected to be used for a single experiment.

Figure 8: Selected Objects from YCB data set [12]: (TopLeft to Bottom Right) Toy Drill, Orange Cup, Orange, Soc-cer Ball, Mug, Apple, Softball, Yellow Cup and Banana.

As discussed previously, we do not want to restrict theway the participant answers Gemini’s questions. However,we asked participants to be as consistent as possible withtheir description about an object since we did not implementa correction mechanism to adapt for changing affordancemodels. The specific instructions we gave our participantsare the following:

• Gemini is a curious robot. He wants to learn about theobjects on the table and you task today is to help himlearn! The experiment consists of three parts: TestGemini, Teach Gemini and Quiz Gemini

• During Test Gemini, you’ll pick an arbitrary object onthe table and describe it to Gemini and see if he canfind the object you just described.

• During Teach Gemini, Gemini will first find an ob-ject he wants to learn about and point at that ob-ject, then he will ask four questions: ”What is thisobject?”, ”What is this color?”, ”What is this shape?”,and ”What do you use it for?”. Gemini has limited abil-ity to focus on hearing, so you are expected to answerthe first question with a single word like ”apple”, andanswer the other three questions with short sentenceslike ”it is green” and ”it is round” or ”it is somewhatround with a pointy end”, just make sure you finish asentence without breaks.

• During Quiz Gemini, your task is to see how well Gem-ini has learned! This time, you pick an object and de-scribe it to Gemini and see if he can find the objectyou just described! (Hint: A good teacher will makethe problem not too easy or too hard. For example,if you taught Gemini ”apple” is ”red”, you don’t wantto ask for ”apple” directly or ask for a ”fruit” and ex-pect Gemini know ”apple” is a ”fruit” by himself, youprobably want to ask for a ”red object”)

The initial experiment Test Gemini is designed to serveas the control group, where we defined labels for each objectand use Wikipedia’s definition for these labels as associatedhuman descriptions to these object. Teach Gemini experi-ment is where we collect human descriptions from our par-ticipants and Quiz Gemini is when we actually measure theperformance our system using descriptions collected fromprevious step.

5.3 ParticipantsWe were able to recruit 8 participants to interact with

Gemini, out of which 7 are male and 1 is femal, and allof them are UT graduate students majoring in ComputerScience or Electrical and Computer Engineering. Besides, 6out of the 8 participants are non-native English speakers.

(a) Real Scene (b) Simulated Scene

Figure 9: User Study Scene Setup

We gave them all the same instructions as presented previ-ously. Due to some unsolved firmware issue with the Kinovaarm, we were only able to let two of the participants to in-teract with Gemini physically in a scene setup as shown infigure 9a. Six other participants interacted with Gemini ina simulated world in Gazebo as shown in figure 9b. Thesystem we implemented has no inherent difference betweensimulation and realworld experiments, but there might bepsychological differences between the mental models of ahuman interacting with a physically embodied robot and avirtual one [11], which may discount the credibility and gen-eralizability of our analysis drawn from the obtained results.

5.4 Performance EvaluationThe number of successes out of the total nubmer of queries

for the control group is 2/8, which indicates a 25% successrate using pre-defined object affordance models. The num-ber of successes out the total number of queries for the ex-perimental group is 13/16, indicating a 81.25% success ratefor learned object affordance models.

We intended to count the average number of trials Geminineeds to find the correct object as the index for knowledgeretrieval accuracy. However, the results from our experi-ments show that most of the time Gemini either get thecorrect object in the first trial or completely fails when hethinks no objects in the knowledge base matches the user’squery. We think the reason might be that our knowledge re-trieval algorithm is very basic and the descriptions collectedduring a single interaction is not rich enough for the human

to be creative in their queries.As mentioned previously, a majority of our participants

are non-native English speakers, which led to many errorsin speech to text conversion. However, even though thelabels and descriptions they gave were mis-interpretated byGemini, Gemini was still able to find the correct object mostof the time since the mis-interpretations were mostly due toaccents and were somewhat consistent within-subject. Aslong as the human’s description is grounded to the correctfeatures, mis-interpretations can be by-passed in our system.

5.5 User FeedbackWe asked each of the participants to fill out a post-experiment

survey in order to get feedback about the implemented pro-totype. The questions in the survey and corresponding re-sults are listed in this section.

• Q: Have you interacted with Gemini-like robotsbefore? Half of the participants answered yes, includ-ing one says Siri in particular.

• Q: Do you think Gemini is able to understandyou? Why or why not? 5 out of the 8 partici-pants think Gemini is able to understand them mostlybecause Gemini was able to find the object they de-scribed. One participant thinks Gemini understoodthem to some extent. One thinks Gemini has limitedunderstanding and can only do simple associations.One answered ’not sure’. Most participants answered’yes’ to the previous question also answered ’yes’ tothis one.

Figure 10: Participant Responses

• Q: How smart do you think Gemini is compar-ing to Humans? (multiple choice) Results areshown in figure 10

• Q: How smart do you think Gemini is com-paring to exisiting AI technologies? (multiplechoice) Results are shown in figure 11.

• Q: Do you think the way Gemini learns aboutobjects is efficient? If not, what do you thinkhe could do instead? Half of the participants donot think this is an efficient way of learning mostly be-cause Gemini is just enumerating attributes and askingvery basic questions about objects. They suggest thatGemini could learn about the basic properties of theseobjects on its own and ask more interesting questions.One participant said this is a good way of learning butthe specific question asking ”what do you use it for?”may not apply to some category of objects.

Figure 11: Participant Responses

• Q: Do you enjoy teaching robots or you preferthey learn things on their own? 5 out of the 8participants prefer the robot to learn on its own. 2other participants think the robot should at least learnthe basics about objects on their own. The only femaleparticipant said she enjoyed teaching the robot.

As shown in the figure 10, most participants associatedGemini’s intelligence level with a 3-year-old human kid. Mostparticipants do not think Gemini is similar to Google or Siri(figure 11), including both those who think Gemini is ableto understand them and those who don’t, while Gemini canbe considered as essentially a search-engine with customizeddata from the user. Most participants found it tiring to an-swer the same basic questions over and over again for dif-ferent objects and wish to get more ’interesting’ questions.They tend to answer faster than the first time when teach-ing about the second and third objects. We designed thefour different questions in an effort not only to get humanto talk about these properties so that we can later leveragethese information for generalization but also communicatethe perceptive capabilities of Gemini, effectively what Gem-ini sees, to the human. It did not seem to work as well aswe intended. Repeated patterns of questions are consideredas the less ”smart” aspect of the system.

We also observed that during the two studies in which wewere able to use the physical robot, the users focused oninteracting with Gemini, while the participants interactedwith the virtual Gemini sometimes turned to look at us andsaught for explanation of what Gemini was doing.

6. FUTURE WORKFrom both performance evaluation and user feedback, we

learned that there are a lot can be improved concerning dif-ferent aspects of the system.

Our knowledge retrieval algorithm can be improved byleveraging definitions of the keywords in ”WordNet” [18],which is already integrated in TextBlob [16]. So that therobot may relate different words with same semantic mean-ings as the same attribute. We can adopt a belief system asMontesano & Lopes [3] did.

We also need to design the human-robot interaction morecarefully so that it feels natural for the human. Instead ofasking the same questions over and over again about eachobject, the robot could just discuss one attribute of the ob-jects at a time. It also needs to be more interactive and re-sponsive to the human’s answers. Some of our participantssuggested that Gemini should greet them with their names

instead of calling them all ”Human”. Besides, we could alsoleverage multi-modal sensory data like Thomason et al. did[1], to increase the robot’s perceptual capability.

7. CONCLUSIONIn this report, we proposed an object affordance model

to be used for social robots that will need to communicatewith human about objects. We presented a modified de-velopmental model for intelligent robots to include humanmodels. We implemented a prototype of a system using ourproposed object affordance model, and conducted a pilotuser study to evaluate and gain insights on how the systemshould be improved. We draw many valuable lessons fromthe process of designing a HRI system as well as conductinga controlled user study. It is very important for engineerswho design robots to consider how human would interactwith their robots and how different social behaviors of therobot could affect the usefulness of their system.

8. REFERENCES[1] Thomason, J., Sinapov, J., Svetlik, M., Stone, P., &

Mooney, R. J. (2016). 1. Learning Multi-ModalGrounded Linguistic Semantics by Playing” I Spy. InProceedings of the Twenty-Fifth international jointconference on Artificial Intelligence (IJCAI).

[2] Thomaz, A. L., & Cakmak, M. (2009, March). Learningabout objects with human teachers. In Proceedings ofthe 4th ACM/IEEE international conference on Humanrobot interaction (pp. 15-22). ACM.

[3] Montesano, L., Lopes, M., Bernardino, A., &Santos-Victor, J. (2008). Learning object affordances:From sensory–motor coordination to imitation. IEEETransactions on Robotics, 24(1), 15-26.

[4] Montesano, L., Lopes, & Santos-Victor, J. (2007). Adevelopmental roadmap for learning by imitation inrobots. IEEE Trans. Syst. Man Cybern. B Cybern., vol.37, no. 2, pp. 308-321, Apr. 2007.

[5] Katz, D., Venkatraman, A., Kazemi, M., Bagnell, J. A.,& Stentz, A. (2014). Perceiving, learning, andexploiting object affordances for autonomous pilemanipulation. Autonomous Robots, 37(4), 369-382.

[6] Ugur, E., Szedmak, S., & Piater, J. (2014, October).Bootstrapping paired-object affordance learning withlearned single-affordance features. In 4th InternationalConference on Development and Learning and onEpigenetic Robotics (pp. 476-481). IEEE.

[7] GonÃğalves, A., Abrantes, J., Saponaro, G., Jamone,L., & Bernardino, A. (2014, October). Learningintermediate object affordances: Towards thedevelopment of a tool concept. In 4th InternationalConference on Development and Learning and onEpigenetic Robotics (pp. 482-488). IEEE.

[8] Weng, J. J. (1998). The developmental approach tointelligent robots. In IN AAAI SPRING SYMPOSIUMSERIES, INTEGRATING ROBOTIC RESEARCH:TAKING THE NEXT LEAP.

[9] Lungarella, M., Metta, G., Pfeifer, R., & Sandini, G.(2003). Developmental robotics: a survey. ConnectionScience, 15(4), 151-190.

[10] Ramos, J. (2003, December). Using tf-idf to determineword relevance in document queries. In Proceedings ofthe first instructional conference on machine learning.

[11] Wainer, J., Feil-Seifer, D. J., Shell, D. A., & Mataric,M. J. (2006, September). The role of physicalembodiment in human-robot interaction. In ROMAN2006-The 15th IEEE International Symposium onRobot and Human Interactive Communication (pp.117-122). IEEE.

[12] Calli, B., Singh, A., Walsman, A., Srinivasa, S.,Abbeel, P., & Dollar, A. M. (2015, July). The YCBobject and model set: Towards common benchmarks formanipulation research. In Advanced Robotics (ICAR),2015 International Conference on (pp. 510-517). IEEE.

[13] Google Cloud Platform, Python-doc-examples [SourceCode]

[14] HLP-R, hlpr manipulation [Source Code]

[15] HLP-R, hlpr perception [Source Code]

[16] Steven Loria et al. TextBlob [Website]

[17] Dave Coleman, Ioan A Sucan et al. MoveIt [SourceCode]

[18] Princeton University ”About WordNet.” WordNet.Princeton University. 2010. [Website]

Object Affordance Learning from Human Descriptions - University of Texas at...

Documents

Transcript of Object Affordance Learning from Human Descriptions - University of Texas at...