Combining first-person and third-person gaze for attention ... · Combining first-person and...

6
Combining first-person and third-person gaze for attention recognition Francis Martinez, Andrea Carbone and Edwige Pissaloux ISIR, Universit´ e Pierre et Marie Curie - CNRS UMR 7222 {martinez,carbone,pissaloux}@isir.upmc.fr Abstract— This paper presents a method to recognize atten- tional behaviors from a head-mounted binocular eye tracker in triadic interactions. By taking advantage of the first-person view, we simultaneously estimate the first-person and third- person gaze. The first-person gaze is computed using an appearance-based method relying on local features. In parallel, head pose tracking allows determining the coarse gaze of people in the scene camera. Finally, knowing the first- and third-person gaze direction, scores are computed which permit to assign attention patterns to each frame. Our contributions are the followings: (i) head pose estimation based on localized regression, (ii) attention analysis, in particular mutual and shared gaze, in- cluding the first-person gaze, (iii) experiments conducted using a head-mounted appearance-based gaze tracker. Experiments on recorded data show encouraging results. I. INTRODUCTION Visual analysis of social interaction from non-verbal cues is still an important and challenging topic. As powerful non-verbal signals, head pose and eye gaze have been at the center of various research studies to understand human behaviors [1]. Indeed, gaze conveys rich information about what people attends to, the task being performed and their intentions. Among these studies, researchers have shown that gaze exhibits different functions [2] and, for example, helps people to regulate conversational sequencing in group meetings by relying on, amongst others, eye contact (i.e. mutual gaze) mechanisms. In order to provide means for attention analysis, visual gaze estimation [3] has been a pioneering research topic over the past decades and diverse solutions were proposed to ease in gathering gaze data from either head or eye de- pending on the distance to the camera. Futhermore, different configurations are possible and depend on the applications (human-computer interaction, psychological research, market analysis) and its constraints e.g. gaze accuracy, mobility, intrusiveness and so on. Specifically, we distinguish between two categories: (1) external or (2) egocentric viewpoint. External viewpoint. Head pose estimation has been in- vestigated in several applications, including visual surveil- lance [4], meetings [5], retail stores [6]. Marin-Jimenez et al. [7] proposed a method pipeline to detect people looking at each other in TV series. Their approach relies on a tracking-by-detection scheme where upper bodies and heads are detected and grouped into tracks. Head orientations are then estimated using Gaussian Process Regression (GPR) and a score is computed to determine if people are looking at each other. Egocentric viewpoint. More recently, first-person vision has increasingly attracted researchers due the subjective nature of recorded data that helps understanding different aspects of human behavior, namely through first-person activity recognition using scene context [8], gaze [9] or both, scene context and gaze, [10]. Among them, Fathi et al. [11] pro- posed a method to study daily-long social interactions from an egocentric viewpoint. By combining patterns of human attention with first-person head motion, a Hidden Conditional Random Field (HCRF) is trained to recognize the type of social interaction (dialogue, discussion, monologue) while the person is standing or walking. In a different way, Ye et al. [12] presented a method to detect moments of eye contact from a commercial wearable gaze tracker. In parallel, some other works have addressed the issue of estimating the first-person gaze from wearable systems. They can be divided into two main approaches: feature-based [13] and appearance-based [14] methods. In the former one, geometric eye features (e.g. iris) serve to estimate the gaze direction, while the latter method directly computes a mapping from a high-dimensional feature vector (e.g. eye image) to a low-dimensional gaze space. In this paper, we propose to estimate the gaze direction of the person wearing the system and the persons appearing in the scene camera to temporally classify individual frames into specific attention patterns. Moreover, by doing so, we also show the feasability of using an appearance-based gaze estimator for attention analysis. To achieve this goal, the workflow of our approach proceed as follows (Fig. 1): Eye cameras - Appearance-based gaze estimation based on multi-level Histograms of Oriented Gradients (HOG) and Relevance Vector Regression (RVR). Scene camera - Head pose tracking using a probabilistic tracker and localized regression. Attention analysis by leveraging on both first-person and third-person gaze. The remainder of the paper is organized as follows. In Section II, we provide details on our methodology to estimate the first- and third-person gaze and to recognize the type of gaze patterns. Section III presents some experimental results on head pose estimation and attention analysis, while Section IV concludes the paper. II. PROPOSED APPROACH We proceed in two steps by exploiting images provided by the eye and the scene cameras. First, we extract appearance

Transcript of Combining first-person and third-person gaze for attention ... · Combining first-person and...

Page 1: Combining first-person and third-person gaze for attention ... · Combining first-person and third-person gaze for attention recognition Francis Martinez, Andrea Carbone and Edwige

Combining first-person and third-person gazefor attention recognition

Francis Martinez, Andrea Carbone and Edwige PissalouxISIR, Universite Pierre et Marie Curie - CNRS UMR 7222

{martinez,carbone,pissaloux}@isir.upmc.fr

Abstract— This paper presents a method to recognize atten-tional behaviors from a head-mounted binocular eye trackerin triadic interactions. By taking advantage of the first-personview, we simultaneously estimate the first-person and third-person gaze. The first-person gaze is computed using anappearance-based method relying on local features. In parallel,head pose tracking allows determining the coarse gaze of peoplein the scene camera. Finally, knowing the first- and third-persongaze direction, scores are computed which permit to assignattention patterns to each frame. Our contributions are thefollowings: (i) head pose estimation based on localized regression,(ii) attention analysis, in particular mutual and shared gaze, in-cluding the first-person gaze, (iii) experiments conducted usinga head-mounted appearance-based gaze tracker. Experimentson recorded data show encouraging results.

I. INTRODUCTION

Visual analysis of social interaction from non-verbal cuesis still an important and challenging topic. As powerfulnon-verbal signals, head pose and eye gaze have been atthe center of various research studies to understand humanbehaviors [1]. Indeed, gaze conveys rich information aboutwhat people attends to, the task being performed and theirintentions. Among these studies, researchers have shownthat gaze exhibits different functions [2] and, for example,helps people to regulate conversational sequencing in groupmeetings by relying on, amongst others, eye contact (i.e.mutual gaze) mechanisms.In order to provide means for attention analysis, visual gazeestimation [3] has been a pioneering research topic overthe past decades and diverse solutions were proposed toease in gathering gaze data from either head or eye de-pending on the distance to the camera. Futhermore, differentconfigurations are possible and depend on the applications(human-computer interaction, psychological research, marketanalysis) and its constraints e.g. gaze accuracy, mobility,intrusiveness and so on. Specifically, we distinguish betweentwo categories: (1) external or (2) egocentric viewpoint.External viewpoint. Head pose estimation has been in-vestigated in several applications, including visual surveil-lance [4], meetings [5], retail stores [6]. Marin-Jimenez etal. [7] proposed a method pipeline to detect people lookingat each other in TV series. Their approach relies on atracking-by-detection scheme where upper bodies and headsare detected and grouped into tracks. Head orientations arethen estimated using Gaussian Process Regression (GPR) anda score is computed to determine if people are looking ateach other.

Egocentric viewpoint. More recently, first-person vision hasincreasingly attracted researchers due the subjective natureof recorded data that helps understanding different aspectsof human behavior, namely through first-person activityrecognition using scene context [8], gaze [9] or both, scenecontext and gaze, [10]. Among them, Fathi et al. [11] pro-posed a method to study daily-long social interactions froman egocentric viewpoint. By combining patterns of humanattention with first-person head motion, a Hidden ConditionalRandom Field (HCRF) is trained to recognize the type ofsocial interaction (dialogue, discussion, monologue) whilethe person is standing or walking. In a different way, Ye etal. [12] presented a method to detect moments of eye contactfrom a commercial wearable gaze tracker.In parallel, some other works have addressed the issue ofestimating the first-person gaze from wearable systems. Theycan be divided into two main approaches: feature-based[13] and appearance-based [14] methods. In the formerone, geometric eye features (e.g. iris) serve to estimate thegaze direction, while the latter method directly computes amapping from a high-dimensional feature vector (e.g. eyeimage) to a low-dimensional gaze space.

In this paper, we propose to estimate the gaze directionof the person wearing the system and the persons appearingin the scene camera to temporally classify individual framesinto specific attention patterns. Moreover, by doing so, wealso show the feasability of using an appearance-based gazeestimator for attention analysis. To achieve this goal, theworkflow of our approach proceed as follows (Fig. 1):• Eye cameras - Appearance-based gaze estimation based

on multi-level Histograms of Oriented Gradients (HOG)and Relevance Vector Regression (RVR).

• Scene camera - Head pose tracking using a probabilistictracker and localized regression.

• Attention analysis by leveraging on both first-personand third-person gaze.

The remainder of the paper is organized as follows. InSection II, we provide details on our methodology to estimatethe first- and third-person gaze and to recognize the type ofgaze patterns. Section III presents some experimental resultson head pose estimation and attention analysis, while SectionIV concludes the paper.

II. PROPOSED APPROACH

We proceed in two steps by exploiting images provided bythe eye and the scene cameras. First, we extract appearance

Page 2: Combining first-person and third-person gaze for attention ... · Combining first-person and third-person gaze for attention recognition Francis Martinez, Andrea Carbone and Edwige

Fig. 1. Overview of the proposed approach: (a) Extraction of local appearance features from eye images, (b) First-person calibration to build an offlineappearance gaze model, (c) Inferring first-person gaze, (d) Probabilistic head tracking, (e) Head pose estimation using localized regression, (f) Recognitionof attention patterns from people’s gaze.

features from the eye images and we estimate the first-persongaze by learning sparse kernel regressors. In order to estimatethe third-person gaze direction, we employ a tracking-basedapproach combined with continuous head pose estimation.Finally, attention scores are computed based on first- andthird-person gaze and allow recognizing the attention patternof each individual and, more generally, of the triad.

A. First-person gaze estimationTo infer the first-person gaze, we follow the same approach

as in [15] by relying on appearance-based gaze estimation.A custom-built head-mounted eye tracker allows collectingleft/right eye and scene images from the subject. Localappearance features, namely multi-level HOG, are then ex-tracted in order to estimate the gaze and thus, it does notrequire the difficult task of accurately and robustly extractingspecific eye features such as the iris which is usually affectedby eyelid occlusion. Moreover, it avoids the use of activeillumination, namely infrared light.Eye appearance features xe. Given the location of eyecorners1, eye images are extracted and aligned. The imagesare then resized to 80 × 120 and HOG features [16] arecomputed over a grid of 1×2, 3×1, 3×2, 6×4 blocks. Foreach block, Nb = 9 signed-orientation bins histograms arebuilt within 2×2 cells. Features are extracted for both eyesand concatened within a 2520 -dimensional feature vector.Appearance-based gaze estimation. Given a subject datasetDe = {(xei ,gei ), i = 1 . . . Ng} where xei is the eye appear-ance features and gei is the corresponding coordinates in thegaze space2, we learn the RVR functional [17]:

gei = 〈w, φ(xei )〉+ εi (1)

where εi ∼ N (0, σ2e) is the data-driven noise term and the

basis matrix φ is built upon a Radial Basis Function (RBF)kernel:

K(x,x′) = exp(−||x− x′||22/r2) (2)

1The positions are manually determined in one calibration frame.2A separate function is learnt for each output dimensions gex and gey .

with r being the basis width.

B. Third-person gaze estimation

Head pose tracking is still a challenging task due to thesensitivity to tracking and pose estimation. We propose to usean online tracking-based approach to tackle this problem inrealtime.Probabilistic head tracking. For each person, an onlineindividual tracker is initialized. We employ a state-of-the-art tracker, namely the incremental visual tracking (IVT)algorithm proposed by Ross et al. [18]. Furthermore, in orderto reduce drift, a near-frontal face detector combined withskin detection is integrated in the tracking process and iscalled every k frames (k = 5 in our implementation).State model. The state3 is modeled by : s = {x, y, s, a}where (x, y) stands for the center coordinates of the headbounding box, s the scale and a the aspect ratio. The lasttwo terms are updated at the same time as the face detector.Dynamical model. The proposal distribution is modeled bythe following mixture of Gaussian distributions:

q(st|st−1, zt) = (1− γ) · p(st|st−1) + γ · qd(st|zt) (3)

with:qd(st|zt) = N (st; s

d,Σd) (4)

where the mixture coefficient γ is set to 0.3, sd denotes theclosest face bounding box and N (x;µ,Σ) is the Normalprobability distribution function.Observation model. The model is identical to [18], exceptthat it is also updated when a face is detected.Localized continuous head pose estimation. To estimatethe head pose, we use Localized Multiple Kernel Regression(LMKR) proposed by Gonen & Alpaydin [19] . In contrast tosingle kernel-based regression methods, LMKR can capturethe local structure of data by combining kernel functionsusing data-dependent kernel weights and thus, acts similarlyto a mixture-of-experts [20].

3We do not rely on the full state model as implemented in [18].

Page 3: Combining first-person and third-person gaze for attention ... · Combining first-person and third-person gaze for attention recognition Francis Martinez, Andrea Carbone and Edwige

Given a training set {xhi ,ghi }Ni=1 where xhi ∈ Rd is the headpose feature vector (see Section III.A.) and ghi ∈ R2 the headpose ouput given by the yaw αi and the pitch βi, we learna discriminant function (one for each output dimension):

ghi =

P∑m=1

ηm(xhi |V)〈wm,Φm(xhi )〉+ b (5)

where P defines the number of feature spaces and ηm(.|V)is a gating function corresponding to the feature space mand with parameters V. Due to the non-convex nature ofthe joint-optimization, a two-step alternating procedure isemployed to learn both the support vector coefficients andthe gating model parameters. Support vector coefficients arelearned by fixing the gating model while the gating model isupdated by computing ∂J(V)/∂V followed by a gradient-descent step (with J(V) being the objective function of thedual problem). As a gating model, we use a linear softmaxfunction:

ηm(x|V) =〈νm, ψ(x)〉+ ν0m∑P

h=1 exp(〈νh, ψ(x)〉+ ν0h)(6)

where V = {νm, ν0m}Pm=1 and ψ(x) = x is a linear mappingfunction allowing to divide the input space and serves toselect the associated local regressor.Eye-in-head gaze model. Head pose estimation as a gazecue is not sufficient due to eye-in-head motion and therefore,knowledge from eyes should be considered. However, in ourcase, the low resolution of images prevents us from usingeye information. To remedy this problem, we instead includea cognitive gaze model to consider eye-in-head motion,similarly to [5]:

gh − gref = κ · (gc − gref ) (7)

with person-specific coefficients κ = [κα, κβ ]. gref =[αref , βref ]T and gc = [αc, βc]T are respectively thereference drection (also known as the midline) and thecompensated gaze accounting for eye-in-head shift.

C. Attention recognition

To analyse gaze patterns, attention scores are computedbased on the gaze estimates described in II-A and II-B. Then,each person is assigned one of the following attention states:S = {”G” = looking at first-person, ”L” = looking at leftthird-person, ”R” = looking at right third-person, ”U” =unfocused}.First-person attention. Given the location of a face LFi

and the estimated gaze coordinates, the first-person attention(FPA) score is defined as:

FPA(Fi) =

{1 if ge ∈ OFi

G0 exp(− ||g

e−LFi||22

2σ2g

)otherwise

(8)

where OFiis the set of pixel locations belonging to the

face Fi, G0 is a normalization coefficient and σg is set withrespect to the accuracy of the gaze tracker.Third-person attention. Third-person attention (TPA)scores are computed using the knowledge of face locations

and the head pose-based gaze estimates as described in Sec-tion II-B. To infer the direction of gaze from face locations,we first transform the 3D gaze vector gFi , where the depthcoordinate is determined based on the head height hi, tothe head pose representation (αFi , βFi). Then, we model thegaze of a person as a 3D cone with elliptical cross sectionin order to account for the uncertainty in yaw and pitchangles. Hence, we define the TPA score based on the distancebetween the estimated pose and the gaze direction obtainedfrom face locations:

TPA(Fi, Fj) =√

∆ΘTi,jQ∆Θi,j ∀i 6= j

where ∆Θi,j = [∆αi,j ,∆βi,j ]T and Q is defined by:

Q =

[1/2ϕ2

α 00 1/2ϕ2

β

](9)

Hence, the person Pi is looking at person Pj if TPAi,j ≤ 1.Temporal filtering. For each person appearing in the scenecamera, we use a simple HMM in order to smooth theTPA observations o1:t and estimate the attention state lt bycomputing the posterior probability:

p(lt|o1:t) =p(ot|lt)p(lt|lt−1)p(lt−1|o1:t−1)∑l′tp(ot|l′t)p(l′t|lt−1)p(lt−1|o1:t−1)

(10)

where the prior p(l0) = πl0 is a uniform distribution andthe transition probability p(lt = m|lt−1 = n) = Amn is sethigh if m = n and uniformly for m 6= n. The observationprobability is estimated from the TPA score by using thefollowing model:

p(ot|lt) ∝ exp(−λ · TPA2) (11)

where λ is set experimentally.

III. EXPERIMENTAL RESULTS

In this section, we first propose to evaluate the perfor-mances of the head pose estimation method and then, wepresent results obtained for attention recognition in triadicinteractions.

A. Head pose evaluation

Datasets and evaluation procedure. We evaluated theperformances of the head pose estimation using two publiclyavailable databases:• the CUbiC FacePix database [21] includes 30 subjects

with yaw angles ±90◦ and 1◦ pose angle increments.We used the dataset with varying pose and constantambiant illumination and because faces images arealigned, we cropped the images by keeping a 98×98patch at their center.

• the PRIMA Pointing’04 database [22] consists of twoseries of 15 subjects for each of which 91 images4 arecaptured covering yaw angles ±90◦ and pitch angles±60◦ with 15◦ intervals. Face regions were manuallycropped around the skin.

4±90◦ pitch angles are ignored because they are undersampled.

Page 4: Combining first-person and third-person gaze for attention ... · Combining first-person and third-person gaze for attention recognition Francis Martinez, Andrea Carbone and Edwige

TABLE IHEAD POSE ERROR ANGLE (MAAE ± SD) FOR YAW α AND PITCH β

LRR KRR GPR KPLS LMKRP = 1 P = 2

FacePixα 9.57◦ ± 7.55◦ 8.50◦ ± 6.80◦ 8.12◦ ± 6.68◦ 7.99◦ ± 6.78◦ 8.73◦ ± 7.59◦ 6.79◦ ± 5.86◦

|α| ≤ 45◦ 9.29◦ ± 7.12◦ 8.20◦ ± 6.49◦ 7.75◦ ± 6.31◦ 7.57◦ ± 6.35◦ 8.03◦ ± 6.71◦ 6.81◦ ± 5.48◦

|α| > 45◦ 10.40◦ ± 8.38◦ 9.40◦ ± 7.28◦ 9.23◦ ± 7.20◦ 9.26◦ ± 7.36◦ 10.85◦ ± 8.87◦ 6.72◦ ± 6.39◦

Pointing

α 11.87◦ ± 10.47◦ 9.31◦ ± 8.94◦ 8.59◦ ± 8.67◦ 8.53◦ ± 8.88◦ 8.66◦ ± 9.17◦ 7.34◦ ± 6.66◦

|α| ≤ 45◦ 11.21◦ ± 9.89◦ 8.59◦ ± 8.41◦ 7.81◦ ± 8.07◦ 7.72◦ ± 8.09◦ 7.77◦ ± 8.32◦ 6.93◦ ± 6.54◦

|α| > 45◦ 14.10◦ ± 11.95◦ 11.73◦ ± 10.14◦ 11.18◦ ± 9.98◦ 11.40◦ ± 10.66◦ 11.62◦ ± 11.06◦ 8.72◦ ± 6.87◦

β 10.97◦ ± 8.84◦ 8.24◦ ± 7.40◦ 7.48◦ ± 6.99◦ 7.39◦ ± 7.00◦ 7.57◦ ± 7.27◦ 7.62◦ ± 7.26◦

Our head pose estimator was evaluated against state-of-the-art regression methods: Linear Ridge Regression (LRR).Kernel Ridge Regression (KRR), Gaussian Process Regres-sion (GPR), Support Vector Regression (SVR) and KernelPartial Least Squares (KPLS), recently proposed in [23].SVR was computed using LMKL with P = 1 i.e. a singlekernel is used. As a performance metrics, we use the MeanAbsolute Angular Error (MAAE) and the Standard Deviation(SD) between the continuous estimated pose and the discreteground truth pose.Feature extraction. Cropped head images were resized to40×40 and HOG features [16] were extracted using Nb = 8bins and a 8×8 grid laid over the image. No dimensionalityreduction (PCA, ICA) was performed on the feature vector.Training the regressors. For each method, the free parame-ters were chosen using 5-fold cross-validation on the trainingsamples. We used a RBF kernel in case of kernel-basedregression.For FacePix, a subset with 10◦ pose angle intervals waschosen from the initial database and the experiment wasrepeated for 10 random trials where two thirds of the subjectswere used for training and the remaining for testing. Testsubjects were unseen by either algorithm during the training.For Pointing’04, the first serie of the database was used astraining set and regressors were tested on the second serie.For KPLS, we used the kernel NIPALS algorithm proposedby Rosipal & Trejo [24] for which the optimal number oflatent factors was found to be Nfac = 40.Results. Table I presents the head pose estimation results forthe two databases. For the yaw angle, LMKR exhibits betterperformances in comparison with state-of-the-art methods.Notable improvement is observed for near profile view(|α| > 45◦) which in turn, can provide a less underestimatedpose estimation. However, similar results between KPLS,GPR and LMKR are obtained for the pitch angle. We assumethis in part due to the lower number of discrete angles (7 forβ against 13 for α) and it indicates that P = 1 kernel issufficient for pitch estimation given the database.

B. Attention analysis evaluation

First-person calibration. Before any experiment, anindividual calibration is performed in order to build anappearance-based gaze model (Section II-A). To collect eyeimages with corresponding gaze coordinates, the subject isasked to follow a target while keeping his head fixed. Then,a person-dependent model was computed using person-

independent parameters that were oprimized on a separatedataset. No knowledge from other subjects is included in ourmodel in contrast to [15], hence the mean absolute error ishigher (∼ 2◦ of visual angle using 100 calibration samples)but sufficient for the task at hand.Evaluation setup. To evaluate our attention detector, 4× 3subjects were involved in triadic-like interactions where onesubject wears the head-mounted gaze tracker, while two otherparticipants sit in front of him. Each recorded video sequencecontains between around 1500 and 6500 frames at 15 fpsresulting in a total of ∼15 minutes of videos. In each frame,each third person was manually given a attention label andtransitions (e.g. during head turn) were labeled as unfocused.Hence, by knowing where each person is looking, we are alsoable to determine attention patterns such as:

1) who is looking at who (LAP),2) mutual gaze (MG) between all participants,3) person-based shared gaze (SG) i.e. if two persons are

looking at the same person.Implementation details. Parameters for the head trackingwere unchanged for all videos. For the head pose estimation,we set the yaw and pitch angular apertures: ϕα = ϕβ = 30◦.Moreover, in order to partially deal with in-plane rotation, thePointing’04 dataset is complemented with faces (only withyaw 0◦) rotated in roll ±15◦. In total, the dataset comprises3380 faces and we trained LMKR using P = 2 for yaw angleand P = 1 for pitch angle. For all experiments, the eye-in-head coefficient κ = [κα, κβ ] was fixed to [0.6, 0.5]. Giventhe geometrical experiment setup, we set gref = [0, 0]T i.e.the first person is taken as the reference and is assumed toremain the same over time. Similarly, to define a TPA scorewith respect to the first-person, we set [αF , βF ] = [0, 0].Results and discussion. As a performance metric, we used aframe-by-frame comparison with the annotated ground truth.Table II presents the accuracy results of the third-personattention recognition for each experiment. For MG and SG,we also indicate the corresponding data percentage. Figure 2shows the precision-recall results for the recognition of MGand SG for each video sequence where:

precision = |R∩GT ||R| recall = |R∩GT |

|GT | (12)

with R and GT being the set of, respectively, the recog-nized and the ground truth labels. One can see that theproposed method achieve satisfactory results. The loweraccuracy obtained in experiment B can be in part explainedby the presence of large in-plane head rotation. In some

Page 5: Combining first-person and third-person gaze for attention ... · Combining first-person and third-person gaze for attention recognition Francis Martinez, Andrea Carbone and Edwige

TABLE IIACCURACY RESULTS FOR THIRD-PERSON ATTENTION RECOGNITION,

MUTUAL GAZE (MG) AND SHARED GAZE (SG).

Exp. LAP MG SGLeft person Right person Acc. % Acc. %

A 0.89 0.75 0.82 59 0.87 71B 0.75 0.76 0.75 58 0.84 63C 0.82 0.75 0.87 57 0.77 44D 0.91 0.89 0.93 67 0.91 68

Avg. 0.82 0.84 60 0.85 61

experiments, subjects, e.g. the right person in the experimentA, also tended to sligthly turn their head which introducedhigh eye-in-head motion that was not fully handled by thecognitive gaze model.We also evaluated 3-way attention patterns that occur dueto the triad and help to understand the nature of the socialinteraction at a frame level. These patterns corresponds to 6categories5: exclusive mutual gaze (xMG), exclusive sharedgaze (xSG), mutual-shared gaze (M-SG), directed gaze (DG),cyclic gaze (CG) and exclusive averted gaze (xAG). Figure3(a) indicates data statistics related to the triadic attentionpatterns and Figure 3(b) shows the corresponding overallconfusion matrix for all experiments.Figure 4 shows some snapshots and recognized patterns of:(a) experiment D for 2000 < t < 3000 and (b) experimentB for 1000 < t < 2000. In Fig. 4(a), attention patternsare reliably recognized, even if sometimes head turns poseproblems but these are also not trivial to label by hand. Fig.4(b) looks more fragmented at some points which is mainlydue to noisy tracking outputs and large variations in β.In our experiments, we also noticed that the first-person gazewas sometimes affected by eye expressions (e.g. smiling)mostly influencing the eyelids closing. This is mainly ex-plained by the fact that gaze calibration is usually done with aneutral expression. While this is usually not a problem whenone wants to evaluate the gaze tracker accuracy, it still shouldbe considered in practice, especially if interactions occur.However, considering the expression space would lead to ahigher dimensionality problem and would require gatheringadditional calibration samples.

IV. CONCLUSION

In this paper, we proposed a pipeline framework to detectsocial gaze patterns, such as mutual and shared gaze, fromfirst-person vision. The first-person gaze was computed usinglocal appearance features based on mutlilevel HOG and apersonal calibration was performed in order to build thegaze model. On the other hand, the third-person gaze wasestimated by combining probabilistc head tracking and poseestimation via localized regression. Evaluation of our headpose estimator on publicly available databases showed someimprovements over state-of-the-art methods and experiments

5xMG and xSG correpsond to one person looking unfocused while thetwo others are, respectively, in MG or SG mode. For xAG, each person islooking unfocused.

Fig. 2. Precision-recall for each experiment and each gaze pattern: mutualgaze (red) and shared gaze (blue). Stars indicate average results of allexperiments.

(a)

(b)

Fig. 3. 3-way attention patterns: (a) Data statistics, (b) Overall confusionmatrix for the 4 experiments.

involving different participants showed promising results forattention recognition in triadic interactions.Through this study, we also showed that, although the first-person gaze estimate is less accurate than with infrared-basedsystems, head-mounted appearance-based gaze trackers arestill suitable for attention analysis and that high accuracy isnot necessarily needed.Future works could involve the study of social interactionsfrom multimodal signals (e.g. combining gaze, facial expres-sions and speech) and with a variable number of persons.

V. ACKNOWLEDGMENTS

This research was partly funded by the European Comis-sion’s 7th Framework Programme under G.A. No. 247730(AsTeRICS) and the CNRS under the DEFISENS mission.

REFERENCES

[1] M. Argyle, “Social interaction,” Travistock Publications, 1969.[2] M. Argyle, R. Ingham, F. Alkema, and M. McCallin, “The different

functions of gaze,” in Semiotica, 1973.[3] D.W. Hansen and Q. Ji, “In the eye of the beholder: A survey of

models for eyes and gaze,” IEEE Trans. on PAMI, 2010.

Page 6: Combining first-person and third-person gaze for attention ... · Combining first-person and third-person gaze for attention recognition Francis Martinez, Andrea Carbone and Edwige

(a)

(b)

Fig. 4. Example of attention recognition timeline for two experiments. For each third-person gaze, the first and second row, respectively, correspond tothe ground truth and the predicted gaze label. Recognized attention patterns are displayed as graphs laid over each snapshot.

[4] B. Benfold and I. Reid, “Guiding visual surveillance by trackinghuman attention,” in BMVC, 2009.

[5] S.O. Ba and J.-M. Odobez, “Recognizing human visual focus ofattention from head pose in meetings,” IEEE Trans. on Systems, Man,and Cybernetics, 2008.

[6] X. Liu, N. Krahnstoever, T. Yu, and P.H. Tu, “What are customerslooking at?,” in AVSS, 2007.

[7] M. Marin-Jimenez, A. Zisserman, and V. Ferrari, “”here’s looking atyou, kid.” detecting people looking at each other in videos,” in BMVC,2011.

[8] H. Pirsiavash and D. Ramanan, “Recognizing activities of daily livingin first-person camera views,” in CVPR, 2012.

[9] K. Ogaki, K.M. Kitani, Y. Sugano, and Y. Sato, “Coupling eye-motionand ego-motion features for first-person activity recognition,” in CVPRWorkshop on Ego-Centric Vision, 2012.

[10] A. Fathi and J.M. Rehg, “Learning to recognize daily actions usinggaze,” in ECCV, 2012.

[11] A. Fathi, J.K. Hodgins, and J.M. Rehg, “Social interactions: A first-person perspective,” in CVPR, 2012.

[12] Z. Ye, Y. Li, A. Fathi, Y. Han, A. Rozga, G. D. Abowd, and J. M.Rehg, “Detecting eye contact using wearable eye-tracking glasses,” inPETMEI (in conjunction with UbiComp), 2012.

[13] A. Tsukada, M. Shino, M.S. Devyver, and T. Kanade, “Illumination-free gaze estimation method for first-person vision wearable device,”in ICCV Workshop on Computer Vision in Vehicle Technology: FromEarth to Mars, 2011.

[14] B. Noris, J.-B. Keller, and A. Billard, “A wearable gaze tracking

system for children in unconstrained environments,” CVIU, pp. 1–27,2011.

[15] F. Martinez, A. Carbone, and E. Pissaloux, “Gaze estimation usinglocal features and non-linear regression,” in ICIP, 2012.

[16] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in CVPR, 2005.

[17] M. E. Tipping and A. C. Faul, “Fast marginal likelihood maximisationfor sparse bayesian models,” in Workshop on AI & Statistics, 2003.

[18] D.A. Ross, J. Lim, R. Lin, and M.-H. Yang, “Incremental learning forrobust visual tracking,” IJCV, vol. 77, no. 1-3, pp. 125–141, 2008.

[19] M. Gonen and E. Alpaydin, “Localized multiple kernel regression,”in ICPR, 2010.

[20] R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton, “Adaptivemixtures of local experts,” Neural Computing, vol. 3, no. 1, pp. 79–87,1991.

[21] G. Little, S. Krishna, J. Black, and S. Panchanathan, “A methodologyfor evaluating robustness of face recognition algorithms with respectto changes in pose and illumination angle,” in ICASSP, 2005.

[22] N. Gourier, D. Hall, and J.L. Crowley, “Estimating face orientationfrom robust detection of salient facial features,” in Proc. of Pointing2004, ICPR, International Workshop on Visual Observation of DeicticGestures, 2004.

[23] M. Al Haj, J. Gonzalez, and L.S Davis, “On partial least squares inhead pose estimation: How to simultaneously deal with misalignment,”in CVPR, 2012.

[24] R. Rosipal and L.J. Trejo, “Kernel partial least squares regression inreproducing kernel hilbert space,” JMLR, vol. 2, pp. 97–123, 2001.