Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3...

14
Report on Deliverable D4.3 [Human-Robot Interaction] Grant agreement number: 609465 Project acronym: EARS Project title: Embodied Audition for RobotS Funding scheme: FP7 Date of latest version of Annex I against which the assessment will be made: 24.10.2013 Project’s coordinator: Prof. Dr.-Ing. Walter Kellermann, FAU E-mail: [email protected] Project website address: http://robot-ears.eu/ EC distribution: Confidential

Transcript of Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3...

Page 1: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Report on Deliverable D4.3[Human-Robot Interaction]

Grant agreement number: 609465

Project acronym: EARS

Project title: Embodied Audition for RobotS

Funding scheme: FP7

Date of latest version of Annex I against which the assessment will be made: 24.10.2013

Project’s coordinator: Prof. Dr.-Ing. Walter Kellermann, FAU

E-mail: [email protected]

Project website address: http://robot-ears.eu/

EC distribution: Confidential

Page 2: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 2 of 14

Contents

1 Introduction 4

2 Interaction scenario 4

3 Attention system 4

4 Behaviours for event recognition and interaction 8

5 Interaction initiated by the robot 11

6 Synchronisation and communication protocol 12

7 Conclusions 13

Page 3: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 3 of 14

List of Acronyms and Abbreviations

Partners’ AcronymsFAU Friedrich-Alexander University Erlangen-NurnbergIMPERIAL Imperial College of Science, Technology and MedicineBGU Ben-Gurion University of the NegevUBER Humboldt-Universitat zu BerlinINRIA Institut National de Recherche en Informatique et en AutomatiqueALD Aldebaran Robotics SA

Page 4: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 4 of 14

1 Introduction

This deliverable on Human-Robot Interaction covers the efforts of the Tasks accomplished within WP4, whoseleading partner is UBER. The single tasks have received contributions from other partners of the EARS projects.In particular, the Work Package consists of the following five Tasks:

• T4.1 Learning internal models for robot interaction and prediction of sensory and motor states. M1-M36(Partners involved: UBER (28PM))

• T4.2 Creation of an attention system. M7-M24 (Partners involved: UBER (6PM), INRIA (2PM))

• T4.3 Development of optimal behaviours for event recognition and localisation. M1-M30 (Partners in-volved: UBER (8PM), INRIA (3PM))

• T4.4 Integrating knowledge from voice dialogue. M1-M12 (Partners involved: ALD (4PM))

• T4.5 Interaction initiated by the robot. M7-M30 (Partners involved: UBER (6PM), ALD (6PM), INRIA(4PM))

The scope of this deliverable is to present a demonstration of Human-Robot Interaction capabilities. Deliv-erable D4.3 consists of this report, of the source code implementing the algorithms described in this documentand of a video showing the HRI demonstration. The demonstration of Human-Robot Interaction, as it will bedescribed in detail in the following sections, includes the following robot capabilities: an attentional systembased on a robot ego-sphere, behaviours for event recognition and localisation and interaction initiated by therobot using a dialogue system.

2 Interaction scenario

The interaction scenario presented in this deliverable consists of a Nao robot placed on a desk and interactingwith one or more participants. The goal is to have intuitive human-robot interaction by equipping the robotwith an attention mechanism for detecting salient events for the interaction, such as the face of the interactionparticipant and the voice or other sounds, and for deciding on which salient event the robot should focus itsattention. Figure 1 shows a picture of a typical interaction session.

The target scenario of the EARS project, which will be presented in the final robot demonstrator, extendsthe current scenario: whenever the robot finds a potential participant by detecting his or her face, it initiates theinteraction by starting a dialogue with him/her. Eventually, the robot can get distracted by other events, if theyare salient enough.

The software delivered together with this document provides the functionalities needed by the attentionsystem (section 3) for communicating with the dialogue manager (implemented by ALD and reported in De-liverable D4.1) and, thus, for initiating or interrupting a dialogue (section 5)

As reported in this document, we carried out a human-robot interaction experiment where the robot inter-acted with pairs of participants. The behaviours exhibited by the robot are emerging from different parametri-sations of the processes for habituation and inhibition of salient events (see section 4). These processes, imple-mented within the attention system, characterise the way the robot reacts to external events.

3 Attention system

Attention is a cognitive skill which lets an individual concentrate on a particular aspect of the environmentwithout the interference of the surrounding [Sch14]. Empirical evidence from developmental psychology sug-

Page 5: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 5 of 14

Figure 1: A picture from a typical human-robot interaction experiment.

gests that the development of skills to understand, manipulate and coordinate attentional behaviour lays thefoundation of social interaction and learning [Tom95].

An important attentional mechanism is that of saliency detection. Saliency detection is a process whereparticular items, such as people and objects, are discovered, enabling humans to shift their limited attentionalresources to those objects that stand out the most [BSH11]. Being able to orientate rapidly towards salientvisual events has evolutionary significance because it allows the individual to detect as quickly as possible prey,mates or predators in the external environment [IK01], beside having an important role in social interaction andlearning [Tom95]. Typical models of saliency detection systems rely on bottom-up processes based on visualsaliency maps, where the visual input is analysed using pre-attentive computations of visual features [IK01].Visual input is decomposed into a set of topographic feature maps, or saliency maps. Therefore, different spatiallocations compete for saliency within each map and only the locations that stand out persist [IKN+98].

The core study behind this deliverable is an attention system for the humanoid robot Nao based on similarsaliency detection processes. In a previous study [BSH11][SBH13], UBER investigated the integration ofvisual and auditory events into an attention system based on saliency detection and on a robot egosphere.The egosphere is a multi-modal egocentric map which represents salient areas in the robot’s surrounding (seeFig. 2). It enables the robot to shift its attention from one salient area to another one in a natural fashion. Thesphere is centered at the robot’s neck coordinate system, while the saliency map of its surrounding is projectedonto the sphere’s surface. The egosphere [BSH11] acted also as a short-term memory system for visual andauditory events, and adapted to recalculate positions of salient locations in the robot’s coordinate system duringego-motion.

As described in this deliverable, the egosphere and attention system implemented by UBER [BSH11] hasbeen improved and extended. Firstly, UBER extended the system so that salient events detected by externalalgorithms can be projected onto the egosphere. The original system [BSH11] adopted native functions fromthe Aldebaran NaoQI SDK for detecting faces and for localising sound sources. As reported in this deliverable,substantial integration work has been carried out by UBER with the aim of improving the system performancesin detecting salient events, by adopting algorithms developed by the EARS partners. In particular, the currentsystem interacts with the face detection and visual tracker algorithms implemented by INRIA (for more de-tails on these algorithms, please refer to Deliverable D3.1 ”Methodology for the extraction of 3D descriptorsbased on visual cues”) and the sound source localisation algorithms implemented by BGU and IMPERIAL(for more details on these algorithms, please refer to Deliverable D2.1 ”Microphone array signal processing forhumanoid robots”). Finally, integration between the dialogue system produced by ALD and the egosphere has

Page 6: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 6 of 14

Figure 2: Illustration of the egosphere for the robot attention system. Note how the projection of the salientevent - red point - is transferred onto the closest edge of the tessellated sphere - blue point.

been accomplished (for more details on the dialogue system, please refer to Deliverable D4.1 ”Voice dialoguesystem”).

The original egosphere attentional framework [BSH11] was implemented using the Nao Team HumboldtC++ software (NaoTH, http://www.naoteamhumboldt.de/en/). UBER’s first task to accomplish in order toachieve the goals of this deliverable consisted in porting the original egosphere/attention code onto nativeAldebaran’s NaoQI C++ SDK. The main motivation behind this task was to allow for compatibility and easierintegration with the works of the other EARS partners.

The approach adopted for implementing the robot egosphere in [BSH11] and in this deliverable is similarto that of Fleming and colleagues [FPB06]: projection of events onto the surface of the egosphere, and thesearch space, are reduced by tessellating the sphere and by storing information about salient areas into theedges of the tessellated sphere [Sch14] (see how the projection of the salient event - red point - is transferredonto the closest edge of the tessellated sphere - blue point - in Fig. 2). Such an approach introduces errors inprojection of a salient event, because the projection space is reduced to just a set of nodes on the sphere surface.Alternative approaches have been suggested, for example by Ruesch and colleagues [RLB+08], who adopteda matrix projection of the egosphere, leading to higher precision of the perception of the world. However, thisapproach increases the computational complexity due to required image transformations and a higher numberof arithmetic operations required per iteration [BSH11][Sch14] and it is not well suited for auditory events.While the approach by Ruesch has higher precision, the one adopted here and in [BSH11][SBH13][Sch14] isfaster, due to the lower number of arithmetic operations that need to be performed during the projection and thesearch of salient events.

As explained in [BSH11][SBH13][Sch14], the sphere tessellation is performed by recursive division of tri-angle faces of an icosahedron. The same algorithm has been adopted here for creating the tessellated egosphere.By increasing the recursion depth - representing the number of recursive calls to the function implementing thetessellation - each initial face is divided into a higher number of smaller faces, achieving a higher number ofnodes and smaller error of projection of salient nodes. See [BSH11][SBH13][Sch14] for more details. Theimplementation used here adopted a recursion depth of 4, resulting in 2562 nodes, with the mean theoreticalprojection error being 1.15◦ and maximum 2.65◦.

In the original egosphere system, through mechanisms of habituation, inhibition and forgetting of salientareas, the robot was capable of exploring its surroundings, and by finding areas of maximum saliency, it locatedthe next area to be attended [Sch14]. This resulted in the robot exhibiting an exploratory behaviour [SBH13].As reported in section 4, UBER extended the original behavioural diversity of the robot by manipulating theparameters driving the habituation, inhibition and forgetting processes.

Finally, a tool for runtime visualisation of the egosphere and the salient events projected onto it has alsobeen implemented in Python. Fig. 3 shows a screenshot of the plot during a run of the egosphere system. Fig. 4

Page 7: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 7 of 14

Figure 3: Screenshot of the egosphere visualisation tool written in Python by Paul Schutte for his studentproject at UBER. The green line points from the origin of the sphere to the front of the robot, to indicate wherethe chest of the robot is facing at. The red circle shows the frontal direction of the robot head. Blue circlesrepresent sound source locations, whereas blue triangles represent face locations. The size of the blue circlesand triangles represent the saliency of the corresponding event.

shows the locations of sound sources in the egosphere during a simple experiment where two loudspeakersplaced at the left and at the right of the robot were simultaneously emitting sounds.

The egosphere software has been also improved to be independent of the algorithms adopted for detectingsalient events, such as face detectors or sound source localisers. In particular, the egosphere system is preparedto handle salient events detected by the native Aldebaran NaoQI SDK and by external algorithms, throughPython or C/C++ function calls.

As mentioned above, the egosphere can be fed with salient events detected and tracked by external algo-rithms developed by EARS partners, such as the face tracker implemented by INRIA. The original egosphereimplementation did not implement any tracking mechanism of salient events. In other words, a moving per-son or sound source was generating new salient points. However, the egosphere system was still capable offollowing the person, since a newly generated salient point corresponding to the current position of the subjecthad always higher saliency than those corresponding to previous positions. The egosphere system has beenimproved to store the ID of the tracked event (face or sound source). Current efforts are focusing on imple-menting an update mechanism that does not create new salient points for a tracked ID that is already presentin the egosphere. The tracked ID is produced by the external algorithm that is sending the salient event to theegosphere (for example, the face tracking system implemented by INRIA). The mechanism will be adopted inthe final demonstrator to be presented in December 2016.

In addition, an improved algorithm that defines the initial saliency of an externally generated event hasbeen implemented. For sound events, the initial saliency of a point added to the egosphere is calculated as theweighted sum of the energy and the confidence of the sound, both calculated by the external algorithm for DoAestimation. For face events, the initial saliency is calculated as the weighted sum of the distance of the face tothe robot and its confidence, both calculated by the external algorithm implemented by INRIA.

Page 8: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 8 of 14

Figure 4: Screenshot of the egosphere visualisation plot and the corresponding setup, where two loudspeakerwere emitting sounds towards the Nao robot.

4 Behaviours for event recognition and interaction

As anticipated in the previous section, the tessellated egosphere is mainly characterised by the following set ofparameters: a recursion depth, which generates the number of nodes in the egosphere during its initialisation,and three parameters that drive the way the saliency of a particular event decays over time - the habituation andinhibition weights and a decay step for the forgetting process.

In previous work [BSH11][SBH13][Sch14], we described how habituation, inhibition and forgetting mech-anisms are employed in order to favour shift of attention to information in new locations. This mechanism isinspired by the inhibition of return mechanism in spatial attention in humans. Several evidences can be foundin the literature in support of that. For example, Posner and colleagues [PRCV85] demonstrated that humans,when generating saccades, tend to inhibit orienting toward visual locations which have been previously attended(inhibition of return).

Habituation, instead, is the process during which the individual gets used to the attended point, which resultsin loss of interest in that point [Sch14]. We adopted the original implementation in the egosphere system forthat, modelling habituation with the following function:

h(t) = h(t− 1) + wh(1− h(t− 1)) (1)

where h(t) determines how much the system is habituated to that particular salient point 1 at time t, andwh ∈ [0, 1] is the habituation weight, which determines how fast the habituation to a certain salient locationincreases. However, the original inhibition and habituation processes [BSH11][SBH13][Sch14] have beenimproved. Each salient point is characterised by a saliency value s(t) ∈ [0, 1] which, independently of thehabituation and inhibition processes, decreases over time by a small saliency decay factor ws ∈ [0, 1]:

s(t) = s(t− 1)− ws (2)

When s(t) is less than 0, it is set to 0. When the habituation level to a certain salient point exceeds a prede-fined habituation threshold level, the salient point starts getting inhibited. In fact, inhibition is implemented asa factor i(t) to be multiplied to the saliency level:

stot(t) = s(t) · i(t) (3)

where stot(t) is the total saliency of a particular event. When the habituation threshold is reached, theinhibition factor i(t) decreases over time as follows:

1Each salient point has a habituation level.

Page 9: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 9 of 14

i(t) = i(t− 1)− wid (4)

where wid ∈ [0, 1] is an inhibition decrease weight. Put is simply, the more the system is habituated to asalient event, the smaller the inhibition factor becomes, making the total saliency of the node decreasing fasterthan regularly. When stot(t) becomes 0, the current salient point is removed, producing thus a shift of theattention to the next most salient point.

Moreover, differently from the original implementation [Sch14], habituation to a salient point increasesonly when such a salient point is within the field of view of the robot camera. In the egosphere system, a salientpoint is within the field of view of the robot camera whenever its distance to the focussed sphere node (theclosest node to the center of the robot field of view) is below a threshold. If a salient point is outside of the fieldof view of the robot camera, then its habituation level decreases and its inhibition factor increases according tothe following rules:

h(t) = h(t− 1)− wh(1− h(t− 1)) (5)

and

i(t) = i(t− 1) + wii(1− i(t− 1)) (6)

where wii ∈ [0, 1] is an inhibition increase weight. Note that i(t) ∈ [0, 1] and h(t) ∈ [0, 1].We investigated the implementation of different interaction behaviours using different parametrisations of

the aforementioned processes, as reported in a student project carried out by Mr. Paul Schutte at UBER [Sch16].In particular, we observed how different interaction behaviours could be generated by the robot when settingdifferent habituation weights and inhibition decrease weights. We implemented a set of behaviours, as shownin Table 1.

Table 1: Behaviours and corresponding saliency decrease weight ws, habituation weight wh, inhibition de-crease wid weight. Different values are set, depending on the type of salient event: face or sound. For somebehaviours, it is required that the system does not create salient points for a specific event. This is implementedby initialising the total saliency stot of that event to 0.

Behaviour ws

faceswh

faceswid

facesws

soundswh

soundswid

soundsExploration 0.05 0.4 0.05 0.05 0.4 0.05Focus mostly on faces 0.05 0.04 0.05 0.25 0.4 0.05Focus mostly on sounds 0.25 0.4 0.05 0.05 0.04 0.05Exploration with slow reaction times 0.05 0.04 0.05 0.05 0.04 0.05Look only at faces 0.05 0.04 0.05 stot = 0.0

Look only at sounds stot = 0.0 0.05 0.04 0.05

An interaction experiment has been carried out, as reported in [Sch16], where participants where askedto judge the robot behaviour after having interacted with it. Fig. 1 shows a screenshot from the experiment.The experiment consisted of the Nao robot2 placed on a desk in front of two participants. Participants wereasked to interact with the robot and to try to catch its attention. Four sessions were carried out for each pair of

2Nao v5 has been used for this experiment. Current efforts are focused on reproducing the experiment with the 12-Microphones Naohead developed within the EARS project and on comparing the results with the previous experiment. The motivation behind havingcarried out the experiment with the standard Nao v5 head is that, at that time of the experiment, the system integration (egosphereusing INRIA’s face detector and BGU’s sound source localiser, and synchronisation system running onboard with Lab Streaming Layercommunication protocol) was still not completed.

Page 10: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 10 of 14

participants, where four different robot behaviours were executed: (A) Exploration, (B) Look only at faces, (C)Look only at sounds, (D) Exploration with expressive feedback. In behaviours (A) and (D), the robot attentionis shifted towards both faces and sound source locations. In behaviour (D), however, the robot exhibit alsoexpressive feedback, consisting of short random utterances in reply to strong sounds, and of changing eye-ledcolours according to the type of the current most salient event (green for faces, dark blue for sounds, light blueif there are no salient events).

Each session lasted 3 minutes. The order of the four behaviours, and thus of the four session, was ran-domised for each participant pair. After each session, both participants were asked to fill a questionnaire abouttheir perception of the robot behaviour. Questionnaires are often used to measure the user’s attitude in human-robot interaction experiments. We adopted part of the Godspeed questionnaire [BKCZ09] which uses semanticdifferential scales for evaluating the attitude towards the robot. Such a questionnaire contains questions (vari-ables) about five concepts (latent variables): Anthropomorphism, Animacy, Likeability, Perceived Intelligenceand Perceived Safety (for a detailed description and for the set of questions, please refer to [BKCZ09]).

Participants were posed these questions, after each session:

Please assess your impression of the robot on these scales:

Q1. Inert 1 - 2 - 3 - 4 - 5 Interactive

Q2. Apathetic 1 - 2 - 3 - 4 - 5 Responsive

Q3. Unintelligent 1 - 2 - 3 - 4 - 5 Intelligent

Q4. Unpleasant 1 - 2 - 3 - 4 - 5 Pleasant

A statistical analysis has been performed on the responses of the participants to the questionnaires [Sch16].26 participants performed the experiment (in pairs, so in total in 13 experiments). Each participant filled fourquestionnaires, each related to one of the four robot behaviours mentioned above.

First, we checked whether the distributions of the collected data are normal or not, in order to select theproper statistical tests. As in [SBH13], for each variable (that is, for each question), we looked at the superim-position of the histogram of the data with a normal curve characterised by the mean and the variance of the data.The histograms did not fit well together with the corresponding normal curves, therefore, although not havingchecked the kurtosis and the skewness of the data that would have given a more precise measurement of thenormality of the distributions, we assumed that the distributions are not normal. As argued in [SBH13], someresearchers claim that only non-parametric statistics should be used on Likert-scale data and when the normal-ity assumption is violated. Nonetheless, we run a Repeated Measures ANOVA3 (post-hoc test using Bonferronicorrection) for the analysis of variances of the data, since, as discussed in [SBH13], other authors [VFTR10]claim that Repeated Measures ANOVA is robust toward the violation of the normality assumption. Schmiderand colleagues found in their Monte Carlo study that the empirical Type I and Type II errors in ANOVA werenot affected by the violation of assumptions [SZD+10].

We performed the tests on the four variables (four questions)4.3Repeated measures ANOVA compare the average score for a single group of subjects at multiple time periods (observations).4As described in [SBH13], Mauchly’s test has been used as statistical test for validating repeated measures ANOVA. It tests the

sphericity, which is related to the equality of the variances of the differences between levels of the repeated measures factor. Sphericity,an assumption of repeated measures ANOVA, requires that the variances for each set of difference scores are equal. Sphericity cannotbe assumed when the significance level of the Mauchly’s test is less than 0.05. Violations of sphericity assumption can invalidate theanalysis conclusions, but corrections can be applied to alter the degrees of freedom in order to produce a more accurate significancevalue, like the Greenhouse-Geisser correction. When the significance level of the Greenhouse-Geisser estimate is less than 0.05, statis-tical significant differences revealed by post-hoc test can be elicited from the pairwise comparisons between the observations. RepeatedMeasures ANOVA does not tell where the differences between groups lie. As described in [SBH13] and in [Sch14], when repeatedmeasures ANOVA is statistically significant (both with sphericity assumption not violated or with Greenhouse-Geisser correction),post-hoc tests with multiple comparisons can highlight exactly where these differences occur.

Page 11: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 11 of 14

Table 2: Average scores of the questions.

Behaviour Q1 Q2 Q3 Q4(A) Exploration 3.23 3.46 3.15 3.42(B) Look only at sounds 2.88 3.04 2.42 2.96(C) Look only at faces 2.92 3.08 2.62 3.15(D) Exploration with expressive feedback 3.88 3.69 3.38 3.65

Table 2 shows the means of the variables corresponding to the four questions. In general, the averagescoring of the participant to all the questions was higher during the (A) Exploration behaviour than duringthe (B) Look only at sounds or (C) Look only at faces behaviour. This confirms that multimodal interaction,as in the (A) Exploration behaviour where both face and sound events are fed into the egosphere, is betterperceived by participants. (D) Exploration with expressive feedback has been scored on average higher than(A) Exploration for all the questions, thus as the best perceived behaviour. However, statistically significantdifference has been observed only for Q1, Q3 and Q4, but only between behaviours (B) and (D), and (C)and (D), with the exception of Q3, where statistically significant difference has been observed also betweenbehaviours (A) and (B). For more details, see [Sch16].

5 Interaction initiated by the robot

This sections briefly introduces the work that has been carried out by UBER for the interaction between theegosphere system and the dialog system implemented by ALD and reported in the Deliverable D4.1 ”Voicedialogue system”.

The egosphere system has been adapted so that it can communicate with the dialog system. In particular,the egosphere can export the list of salient points - together with their positions, their saliency and the trackedIDs, if available - when the dialog system request it. Salient points and their IDs are given to the egosphere bythe external algorithms implemented by the EARS partners for face and sound sources detection and tracking.The egosphere system makes sure that, if a salient point with a specific tracked ID moves from one sphere nodeto another, old salient points are removed. In this way, egosphere keeps track of salient events - faces or soundsources - over time.

The dialog system, once having received the list of salient events with tracked IDs (egosphere can also storesalient events that do not have any tracked ID), applies a threshold to their saliency to determine which ones areinteresting or not for interacting with them. If the resulting list is not empty, the tracked ID corresponding to themost salient point within this list is selected as the subject to interact with. Thus, the interaction through dialogis initiated. To maintain the focus of the attention to this location, the dialog system sends to the egosphere arequest to enforce (or increase) the saliency of the salient point that has that specific tracked ID.

While having the dialog with the interacting partner, the dialog system maintain an updated list of mostsalient points requesting it to the egosphere system. If another source appears, its saliency has to be overanother high saliency threshold in order to interrupt the dialogue and to be selected as the salient point toattend.

Other non-tracked events can also trigger the egosphere system to move the robot’s focus of attentionsomewhere else, if their saliency is higher than the ones of the points that have been reinforced by the dialogsystem.

This system is at the basis of the human-robot interaction session that will be presented in the final de-mostrator (reported in Deliverable D5.4 ”Evaluation of humanoid robot demonstrators”) at the EARS meetingin Erlangen in December 2016.

Page 12: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 12 of 14

The sources of the software implementing these functionalities are provided by this deliverable. Note thatthe delivered software does not include all the necessary components to run this interaction experiment, asthis requires also the software delivered by ALD in D4.1 and the latest implementations of the EARS partnersregarding the interfaces of their algorithms (face tracking - INRIA - and sound source localisation - BGU andIMPERIAL) to the egosphere system. In this deliverable, only the following software, and the correspondingdocumentation, is released:

- The main egosphere system (see section 3 and 4);

- The egosphere visualisation utility (see section 3);

- The latest version of the audio-video-motor synchronisation software and the software for data exchangebetween the egosphere module and the algorithms implemented by the EARS partners (see section 6).

6 Synchronisation and communication protocol

A considerable amount of work has been carried out by UBER for synchronising different data streams be-fore being processed by the algorithms implemented by the EARS partners. As reported in Deliverable D4.2”Methodology and software for a computational internal model”, UBER implemented an audio-video-motorsynchronisation mechanism on top of the NAOqi-Modularity frameworks provided by Aldebaran Robotics (seeDeliverable D5.1: ”Architecture for Audio Integration”). Modularity allows for combination of asynchronousdata collection and data processing using filter chains. In particular, it allows the development of algorithms assequence of filters, each implementing a specific functionality, that exchange inputs and output signals betweeneach other. UBER extended the functionalities provided by Modularity and implemented a set of filters for thegathering of audio, visual and motor data from the robot, for the alignments of asynchronous data into fixedbuffers and for the training of internal models and ego-noise classification. UBER proposed the usage of theUSB Linux-Compatible sound card Cymatic Live Recorder LR-16, which can provide up to 16 input channelsto a PC. The card is currently adopted by all the EARS partners as the interface for gathering the input signalsfrom the 12 microphones of the robot head developed within EARS.

First efforts consisted in recompiling the Nao’s OS in order to detect and to support such a USB soundcard. UBER carried out this task in collaboration with ALD. Therefore, the synchronisation process presentedin Deliverable D4.2 has been extended to collect audio, visual and motor data onboard on the robot and fromthe Cymatic LR-16 card into a buffer that is post-processed and segmented into smaller chunks.

This synchronised data is therefore broadcasted to the network where the robot is connected to, so that itcan be sent to external algorithms running on different computers, for example the face detection and trackingalgorithms (INRIA) and DoA estimation algorithms (BGU and Imperial). The communication protocol (de-livered in the software accompanying this report) has been implemented on top of the Lab Streaming Layer(LSL). LSL is an open-source system for the unified collection of measurement time series in research ex-periments that handles both the networking, time-synchronization, (near-) real-time access as well as option-ally the centralized collection, viewing and disk recording of the data (https://github.com/sccn/labstreaminglayer). C++ interfaces to read and to write data into the LSL streams and C++/Matlab in-terfaces to read such data from external computer have been developed by UBER and the other EARS partners.

The computers running the C++ face detection and tracking algorithms (INRIA) and the Matlab scriptsfor DoA estimation (BGU, Imperial and FAU) can thus read audio-visual-motor data, that has been previouslysynchronised onboard on the robot, from the LSL streams. These algorithms can process the data and sendsalient events, such as tracked faces or detected sound source locations, to the egosphere running on the robotin two ways:

Page 13: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 13 of 14

- using C++ NaoQI interfaces. Specific callback functions have been implemented into the egospheresystem, so that external algorithms can send salient events or can read the current state of the egospheresystem (functionality needed, for example, by the ALD dialog system) in runtime.

- using a Python NaoQI interface that reads the result of the DoA estimation from the computer runningthe Matlab script, and send it to the egosphere running onboard the robot.

Regarding the Python interface, UBER implemented a script that reads the output of the algorithm forsound source localisation implemented by BGU and IMPERIAL in T2.1 on the Benchmark I Nao head. Soundsource locations are stored by the Matlab script into Map Object representations created by IMPERIAL. UBERimplemented a Python script that reads data from Map Objects stored into files on the PC running the Matlabalgorithm for sound source localisation, and sends the detected sound source locations to the egosphere NAOqimodule (written in C++) running on the robot.

Fig. 5 illustrates the interaction between Map objects, developed in Matlab by BGU and IMPERIAL, andthe egosphere, developed by UBER.

Figure 5: Integration of EARO, Map object and egosphere.

7 Conclusions

This deliverable presented functionalities for a human-robot interaction scenario, as described in section 2.Several challenges have been addressed and solved. Firstly, synchronisation of audio, video and motor datahas been addressed and implemented as a real-time process running onboard, using a 16-channel sound cardplugged into the robot. A data communication system has been implemented, for delivering the synchroniseddata through the network where the robot is connected to. Synchronisation and data communication are funda-mental features for the external algorithms implementing face and sound source detection and tracking, as wellas for other algorithms, such as those for learning and predicting robot ego-noise, as described in DeliverableD4.2.

We also showed how an attention system that stores salient auditory and visual events into a short-termmemory system represented as an egosphere can drive natural behaviours in a human-robot interaction scenario.Different behaviours can be generated by simply manipulating processes of habituation and inhibition to salientevents. We also presented a human-robot interaction where participants were asked to interact with the robotand to score the quality of the interaction using questionnaires under different conditions in which the robotexhibited different behaviours emergent from the aforementioned processes.

Page 14: Report on Deliverable D4.3 [Human-Robot Interaction] Audition for RobotS (EARS) Deliverable D4.3 Human-Robot Interaction Grant Agreement No. 609465 Revision Draft, [November 11, 2016]

Embodied Audition for RobotS (EARS)

Deliverable D4.3 Human-Robot InteractionGrant Agreement No. 609465

Revision Draft, [November 11, 2016]Page 14 of 14

References

[BKCZ09] C. Bartneck, D. Kulic, E. Croft, and S. Zoghbi, “Measurement instruments for the anthropomor-phism, animacy, likeability, perceived intelligence, and perceived safety of robots,” Internationaljournal of social robotics, vol. 1, no. 1, pp. 71–81, 2009.

[BSH11] S. Bodiroza, G. Schillaci, and V. Hafner, “Robot ego-sphere: An approach for saliency detectionand attention manipulation in humanoid robots for intuitive interaction,” in 11th IEEE-RAS Inter-national Conference on Humanoid Robots (Humanoids), Oct 2011, pp. 689–694.

[FPB06] K. A. Fleming, R. A. Peters, and R. E. Bodenheimer, “Image mapping and visual attention on a sen-sory ego-sphere,” in 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE, 2006, pp. 241–246.

[IK01] L. Itti and C. Koch, “Computational modelling of visual attention,” Nature reviews neuroscience,vol. 2, no. 3, pp. 194–203, 2001.

[IKN+98] L. Itti, C. Koch, E. Niebur et al., “A model of saliency-based visual attention for rapid scene anal-ysis,” IEEE Transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.

[PRCV85] M. I. Posner, R. D. Rafal, L. S. Choate, and J. Vaughan, “Inhibition of return: Neural basis andfunction,” Cognitive neuropsychology, vol. 2, no. 3, pp. 211–228, 1985.

[RLB+08] J. Ruesch, M. Lopes, A. Bernardino, J. Hornstein, J. Santos-Victor, and R. Pfeifer, “Multimodalsaliency-based bottom-up attention a framework for the humanoid robot icub,” in Robotics andAutomation, 2008. ICRA 2008. IEEE International Conference on. IEEE, 2008, pp. 962–967.

[SBH13] G. Schillaci, S. Bodiroza, and V. V. Hafner, “Evaluating the effect of saliency detection and attentionmanipulation in human-robot interaction,” International Journal of Social Robotics, vol. 5, no. 1,pp. 139–152, 2013.

[Sch14] G. Schillaci, “Sensorimotor learning and simulation of experience as a basis for the develop-ment of cognition in robotics,” Ph.D. dissertation, Humboldt-Universitat zu Berlin, Mathematisch-Naturwissenschaftliche Fakultat II, 2014.

[Sch16] P. Schutte, “Audiovisuelle, nonverbale Kommunikation zwischen Menschen und Maschine,” Studi-enarbeit, Mathematisch-Naturwissenschaftliche Fakultat, Institut fur Informatik, 2016.

[SZD+10] E. Schmider, Ziegler, M. Danay, E. Beyer, and M. Buhner, “Is it really robust? reinvestigatingthe robustness of anova against violations of the normal distribution assumption,” Methodology:European Journal of Research Methods for the Behavioral and Social Sciences, vol. 6, p. 147,2010.

[Tom95] M. Tomasello, “Joint attention as social cognition,” Joint attention: Its origins and role in develop-ment, pp. 103–130, 1995.

[VFTR10] G. Vallejo, M. P. Fernandez, E. Tuero, and P. E. L. Rojas, “Analyzing repeated measures usingresampling methods,” Anales de Psicologıa/Annals of Psychology, vol. 26, no. 2, pp. 400–409,2010.