Gaze Tracking by Using Factorized Likelihoods Particle ... Tracking by Using Factorized Likelihoods...

15
Gaze Tracking by Using Factorized Likelihoods Particle Filtering and Stereo Vision Erik Pogalin Information and Communication Theory Group Delft University of Technology P.O. Box 5031, 2600 GA Delft, The Netherlands [email protected], [email protected] AbstractIn the area of visual perception research, information about a person’s attention on visual stimuli that are shown on a screen can be used for various purposes, such as studying the phenomenon of human vision itself or investigating eye movements while that person is looking at images and video sequences. This paper describes a non-intrusive method to estimate the gaze direction of a person by using stereo cameras. First, facial features are tracked with particle filtering algorithm to estimate the 3D head pose. The 3D gaze vector can be calculated by finding the eyeball center and the cornea center of both eyes. For the purpose mentioned above, we also proposed a screen registration scheme to accurately locate a planar screen in world coordinates within 2 mm error. With this information, the gaze projection on the screen can be calculated. The experimental results indicate that an average error of the gaze direction of about 7 could be achieved. Keywords: Gaze tracking, facial features tracking, particle filtering, stereo vision. 1 I NTRODUCTION An eye gaze tracker is a device that estimates the direction of the gaze of human eyes. Gaze tracking can be used for numerous applications, ranging from diagnostic applications such as psychological and marketing research to interactive systems in the Human-Computer Interaction (HCI) domain ([4], [17]). For example, studying eye movements during reading can be used to diagnose reading disorders. Investigating the user’s attention on advertisements can help to improve their effectiveness. In the HCI domain, gaze tracking can be used as a way to interact with machines, e.g. as a pointing device for disabled people when operating a computer or as a support system in cars to alert users when they fall asleep. Several commercial gaze tracking products exist that are highly accurate and reliable. They are mostly based on a so- called infrared technique. Tobii [18] and ERT [6] developed a system that uses a motorized camera and infrared lighting to track the eye gaze. Their products are mainly used for visual perception research. Other companies such as Fourward [8] and ASL [1] use head-mounted cameras to track the user’s eyes from a close distance. These kinds of products are suitable for user interaction as well as visual perception research. There are two disadvantages which make those infrared- based gaze tracking products less attractive for wide use. Most of these products require special hardware such as motorized cameras, helmets or goggles, making the product really expensive (between US$15,000 and US$150,000 as reported in [19]). Furthermore, this special hardware can cause discomfort and will restrict the user’s movements. In this paper, we designed a gaze tracking scheme in the framework of visual perception research. In a typical experiment, users are told to watch visual stimuli that are displayed on a screen [4]. Their gaze projection on the text, image or video sequence shown on the screen can be used for various purposes, such as diagnosing reading disorders, analyzing the effectiveness of advertisements and investigating differences of their attentions while evaluating the image quality of a video sequence. Considering these applications and the two disadvantages mentioned above, we summarized the following requirements as guidelines during the system design: The system should detect and track the user’s gaze on a 2D screen by estimating the intersection point between the gaze ray and the screen. The system must use a non-intrusive technique. The system should track a single user at a time. The system does not have to work in real-time. The system should be made as cheap as possible and it should be possible for the system to be used for user- interaction purposes. The average angular gaze error should not exceed 5 . 1

Transcript of Gaze Tracking by Using Factorized Likelihoods Particle ... Tracking by Using Factorized Likelihoods...

Gaze Tracking by Using Factorized LikelihoodsParticle Filtering and Stereo Vision

Erik PogalinInformation and Communication Theory Group

Delft University of TechnologyP.O. Box 5031, 2600 GA Delft, The [email protected], [email protected]

Abstract— In the area of visual perception research, informationabout a person’s attention on visual stimuli that are shown ona screen can be used for various purposes, such as studying thephenomenon of human vision itself or investigating eye movementswhile that person is looking at images and video sequences. Thispaper describes a non-intrusive method to estimate the gaze directionof a person by using stereo cameras. First, facial features are trackedwith particle filtering algorithm to estimate the 3D head pose. The3D gaze vector can be calculated by finding the eyeball center andthe cornea center of both eyes. For the purpose mentioned above,we also proposed a screen registration scheme to accurately locatea planar screen in world coordinates within 2 mm error. With thisinformation, the gaze projection on the screen can be calculated.The experimental results indicate that an average error of the gazedirection of about 7◦ could be achieved.

Keywords:Gaze tracking, facial features tracking, particle filtering,stereo vision.

1 INTRODUCTION

An eye gaze tracker is a device that estimates the directionof the gaze of human eyes. Gaze tracking can be used fornumerous applications, ranging from diagnostic applicationssuch as psychological and marketing research to interactivesystems in the Human-Computer Interaction (HCI) domain([4], [17]). For example, studying eye movements duringreading can be used to diagnose reading disorders.Investigating the user’s attention on advertisements canhelp to improve their effectiveness. In the HCI domain, gazetracking can be used as a way to interact with machines, e.g.as a pointing device for disabled people when operating acomputer or as a support system in cars to alert users whenthey fall asleep.

Several commercial gaze tracking products exist that arehighly accurate and reliable. They are mostly based on a so-called infrared technique. Tobii [18] and ERT [6] developed asystem that uses a motorized camera and infrared lighting to

track the eye gaze. Their products are mainly used for visualperception research. Other companies such as Fourward [8]and ASL [1] use head-mounted cameras to track the user’seyes from a close distance. These kinds of products are suitablefor user interaction as well as visual perception research.

There are two disadvantages which make those infrared-based gaze tracking products less attractive for wide use.Most of these products require special hardware such asmotorized cameras, helmets or goggles, making the productreally expensive (between US$15,000 and US$150,000 asreported in [19]). Furthermore, this special hardware cancause discomfort and will restrict the user’s movements.

In this paper, we designed a gaze tracking scheme inthe framework of visual perception research. In a typicalexperiment, users are told to watch visual stimuli that aredisplayed on a screen [4]. Their gaze projection on thetext, image or video sequence shown on the screen canbe used for various purposes, such as diagnosing readingdisorders, analyzing the effectiveness of advertisements andinvestigating differences of their attentions while evaluatingthe image quality of a video sequence. Considering theseapplications and the two disadvantages mentioned above, wesummarized the following requirements as guidelines duringthe system design:

• The system should detect and track the user’s gaze on a2D screen by estimating the intersection point betweenthe gaze ray and the screen.

• The system must use a non-intrusive technique.• The system should track a single user at a time.• The system does not have to work in real-time.• The system should be made as cheap as possible and it

should be possible for the system to be used for user-interaction purposes.

• The average angular gaze error should not exceed 5◦.

1

Inspired by the works of Matsumotoet al. [15] and Ishikawaet al. [12], which used a completely non-intrusive method toestimate the gaze directions in 3D, we make another contribu-tion to this type of solutions by introducing some modificationsto their method. Our tracking scheme combines the auxiliaryparticle filtering algorithm ([16]) and stereo information todetect and track facial features, such as eye and mouth corners.The 3D locations of these features determine the pose of thehead. Furthermore, we use a 3D eye model which assumesthat the eyeball is a sphere. Unlike Ishikawaet al., we chooseto use the corners of the eye socket instead of the cornerslocated on the eyeball surface. This would make the trackingmore robust to occlusions and eye blinks.

Finally, we devised a screen registration scheme to locatea 2D surface that is not visible in the camera view (such asa monitor positioned behind the camera) by using a specialmirror. In this way, the screen location in the world coordinatesystem is known accurately so that we can directly calculatethe intersection of the gaze ray with the screen. Beside thescreen we could also register other objects in the worldcoordinate system as well. With minor modifications thesystem could be easily applied for user-interaction purposes.

This paper is organized as follows. In section2 we presenta short summary of the work that has been done previouslyin eye gaze tracking. The outline of our gaze tracking systemis presented in section3. In section4 we will discuss thecalibration of the cameras and the registration of the 2D screen.Next, the two most important modules of the system, thehead pose tracking and the gaze direction estimation, willbe described in section5 and 6, respectively. The systemperformance is evaluated and the results are given in section7and finally, section8 will conclude this paper with a discussionand recommendations for future work.

2 PREVIOUS WORK

In the last few years, gaze tracking research is concentratedon intrusive as well as non-intrusive video-based techniques.Using image processing and computer vision techniques, it ispossible to compute the gaze direction without the need forany kind of physical contact with the user.

The most popular technique is the use of infrared lightingsto capture several reflections from parts of the eye (pupil,cornea, and lens reflections) [4]. The relative position ofthese reflections changes with pure eye rotation, but remainsrelatively constant with minor head movements. With appro-priate calibration procedures, this method estimates the user’spoint of regard on a planar surface (e.g. PC monitor) onwhich calibration points are displayed. Several variations tointerpolate the gaze from known calibration points have beenreported in the literature, including the use of artificial neuralnetworks ([2], [5], [13]).

This infrared technique is widely applied in currentcommercial gaze trackers. However, it needs a high resolution

image of the eye, which explains the use of expensivehardware, such as a zoom-capable camera mounted below thescreen or attached to a helmet.

Another approach that has been developed recently detectsthe head pose separately and uses this information to estimatethe gaze direction in 3D. This method has several advantagescompared to the infrared technique. Aside from the cheaphardware requirements (a pair of normal cameras and a PC),tracking is not only restricted to the point of regard on aplanar object. Since the gaze is tracked in a 3D world, wecan also intersect the gaze with other objects of interest aswell, provided that those objects are properly registered in the3D world (i.e. the locations are accurately known). Becauseof this, the system can be easily modified for interactionpurposes.

Matsumotoet al. [15] used stereo cameras to detect andtrack the head pose in 3D. A 3D model for each user is builtby selecting several facial features in the initialization phase.This 3D pose will be rigidly tracked over time. To measure thegaze direction, the location of the eyeball center is calculatedfrom the head pose and the cornea center is extracted fromthe stereo images. The vector that connects the eyeball centerand the cornea center is the estimate of the gaze direction.

The use of Active Appearance Models (AAM) has beenproposed by Ishikawaet al. [12]. A 3D AAM is fitted tothe user’s face and tracked over time by using only a singlecamera. Similar steps as in [15] are done to measure the 3Dgaze vector. Another camera is used to view the scene andby asking the user to look at several points in the world,the relative gaze orientation with respect to the projection ofthese points in the view-camera image can be interpolated.

This paper makes another contribution to the 3D gazetracking method. The pose of the head will be trackedby using the particle filtering algorithm proposed in [16].Combined with stereo vision, 3D head pose can be recovered.We use a slightly different eyeball model than the modelused in [12] and [15]. Since visual perception research is ourmain concern, we also devise a screen registration schemeto locate a planar screen with respect to the cameras. Withthis information, the gaze projection on the screen can becalculated.

3 SYSTEM OUTLINE

Our gaze tracking system consists of three main modules:head pose tracking, gaze direction estimation and intersectioncalculation (figure1). We use a 3D facial feature model todetermine the 3D pose of the head. Together with a 3D eyemodel the 3D gaze vector can be determined. Figure2 showsthe hardware setup of the system. A pair of USB camerasplaced below the monitor is used to capture the user in the

2

S t e r e o C a m e r a C a l i b r a t i o n

S c r e e n R e g i s t r a t i o n

U s e r T r a i n i n g

I n i t i a l i z a t i o n

H e a d P o s eT r a c k i n g

G a z e D i r e c t i o n E s t i m a t i o n

I n t e r s e c t i o nC a l c u l a t i o n

S t a r t

2 D S c r e e n C o o r d i n a t e s

3 D H e a d P o s e

3 D E y e M o d e l

3 D G a z e V e c t o r

W o r l d - S c r e e n F r a m e T r a n s f o r m a t i o n

2 D S t a r t P o s i t i o n s & F a c i a l F e a t u r e T e m p l a t e s

3 D F a c i a l F e a t u r eM o d e l

Figure 1. Block diagram of the gaze tracking system. The left part showsthe off-line steps that have to be done before the actual tracking is performed.

scene.

Several pre-processing steps must be done beforeperforming the actual tracking. First of all, the stereo camerasmust be calibrated. In the calibration process, the leftcamera reference frame is used as the world reference frame.Secondly, we need to register the screen position in worldcoordinates. In this way, after calibrating the cameras and thescreen, we can directly compute the intersection of the gazeray with the screen plane. The calibration procedure will bediscussed in detail in section4. The third and last step isto estimate the user-dependent parameters for the 3D facialfeature model and 3D eye model. The facial feature modelis built by taking several shots of the head under differentposes. The eye model is created by acquiring a trainingsequence, where the user looks at several calibration pointson the screen. The estimated parameters will be used for theactual tracking. We refer to section6 for more details on theeyeball model used.

The head pose tracking (section5) is initialized manuallyin the first frame received by the cameras. In this initializationphase, we choose the facial features that we want to track anduse the image coordinates of these features (in the left andright frame) as start positions for the head pose tracking. Inour system, the corners of the eyes and mouth are selected. Arectangular color window defined around each chosen featurewill be used as reference template. These facial features willbe tracked throughout the whole video stream by using theparticle filtering algorithm proposed in [16]. The system thenperforms stereo triangulation on each facial feature. The outputof this module is the 3D locations of all features, whichdetermines the pose of the head in the current frame.

Once we know the 3D location of the eye corners, the lo-cation of the eyeball center can be determined (see section6).

Figure 2. Hardware setup of the gaze tracking system. A pair of USBcameras placed below the monitor are used to capture the user in the scene.

A small search window is defined around the eye corners tosearch for the cornea center in the left and right frame. The3D location of the cornea centers are found by triangulation.The gaze is then defined by a 3D vector connecting the eyeballcenter and the cornea center. Two gaze vectors are acquiredfrom the gaze direction estimation module, one from the leftand one from the right eyes.

The last step is to intersect the gaze vectors from the left andright eye with the object of interest (e.g. the monitor screen).The intersection is done by extending this vector from theeyeball center until it reaches the screen. To compensate forthe effect of noise, we take the average of the left and rightprojected gaze points and feed the single 2D screen coordinateto the output. In the following sections, each module of thesystem will be discussed in more detail.

4 CAMERA CALIBRATIONThis section discusses the calibration of the cameras and theregistration of the 2D screen. The results are the intrinsicparameters of the cameras and the extrinsic parameters of thecameras and the screen (i.e. the relative position of the camerasand the screen with respect to the world reference frame). Insection4.1 we deal with the calibration of the stereo cameras,followed by the screen registration in section4.2.

4.1 Calibrating Stereo CamerasCamera calibration is done by using the method proposed byZhang [20]. This method only requires the cameras to observea planar checkerboard grid shown at different orientations(figure 3). In the following we describe the calibrationnotations that will be used in the remaining sections.

A 2D point is denoted byx = [u v]T and a 3D pointby X = [x y z]T . We usex, X to denote the homogeneous

3

yz

O l = O w

x

yz

O r

xx

y

zO g

M l r

Figure 3. The setup used for stereo camera calibration. The origin of thecamera frame is located on the pinhole of the camera. The left camera frameis also used as the world frame (w: world, l: left camera, r: right camera andg: calibration grid).

coordinates of a 2D and a 3D vector, respectively. A pinholecamera model is used with the following notation:

λxim = KXc, Xc = RXw + T (1)

which relates a 3D pointXw = [xw yw zw]T in the worldreference frame with its image projectionxim = [u v 1]T inpixels, up to a scale factorλ. The matrixK, called the cameraor calibration matrix, is given by

K =

fx αfx u0

0 fy v0

0 0 1

and contains theintrinsic parameters: the focal lengthsfx

and fy, the coordinates of principal points(u0, v0) and theskewness of the image axesα.

The same 3D pointXw can be represented in the camerareference frame byXc = [xc yc zc]T , which is related bya 3x3 rotation matrixR and 3x1 translation vectorT. Thisframe transformation can also be written as a single matrix:

M =[

R T0 1

]

In this paper we use the left camera frame as the worldframe, so for the left and right camera we would have:

Xl = MwlXw, Mwl = I4×4

Xr = MwrXw, Mwr =[

Rwr Twr

0 1

]= Mlr

(2)

where IN×N is an identity matrix of sizeN × N and Mlr

denotes theextrinsic parametersof the stereo cameras, thetransformation between the left and right camera frame.

We used a lens distortion model that incorporates the radialand tangential distortion coefficients. Letxd be the normalizedand distorted image projection in camera reference frame:

xd =[xc/zc

yc/zc

]=

[x

y

]

andr = x2 + y2.The undistorted coordinatexud is defined as follows [3]:

xud = Drxd + Dt (3)

where

Dr =[1 + k1r

2 + k2r4 + k5r

6]

Dt =[2k3xy + k4(r2 + 2x2)k3(r2 + 2y2) + 2k4xy

]

are the radial and tangential distortion coefficients,respectively. These coefficients can be represented by asingle vectork = [k1 k2 k3 k4 k5]T .

Finally, equation (1) can be modified to include the distor-tion model:

xim = Kxud (4)

To estimate the intrinsic and extrinsic camera parameters,the following steps are taken:

• Acquiring stereo imagesPosition the two cameras so that an overlapped viewof the user’s head is achieved. Take a series of imagesof the calibration grid (figure4) under different planeorientations.

• Extracting the grid reference frameFor each plane orientation, the four intersection cornersof the pattern are chosen manually (the white diamondsin figure 4). The inner intersections will be detectedautomatically by estimating planar homography betweenthe grid plane and its image projection [20]. All detectedintersection points are then refined by the Harris cornerdetector [3] to achieve sub-pixel accuracy. From eachimage, we will get the image coordinate of eachintersection pointxim and its coordinate in the gridreference frameXg = [xg yg 0]T .

u

v

Extracted corners

50 100 150 200 250 300

50

100

150

200

Og

Figure 4. The extracted intersection points from a calibration grid. The fourintersection corner points are chosen manually (white diamonds), while theinner points are automatically extracted by using plane homography.

4

• Estimating individual camera parametersThe intrinsic parameters and the distortion coefficientsof each camera are estimated by minimizing the pixelreprojection error of all intersection points on all images,in the least square sense. The initial guess for theparameters is made by setting the distortion coefficientsto zero and choosing the centers of the images as theprincipal points. The initial focal lengths are calculatedfrom the orthogonal vanishing points constraint [3].

• Estimating the parameters of both camerasThe individually optimized parameters for each camerafrom the previous step are now used as initial guess forthe total optimization (considering both cameras). At theend we get the optimized distortion coefficients of bothcameras, the calibration matrix for each camera and theexternal parameters relating the two cameras.

4.2 Registering the Screen to the World FrameIn order to intersect the gaze vector with the screen, the screenlocation with respect to the world frame must be determined.In other words, we need to determine the transformationMws from the world frame to the screen frame. We use thefollowing method to estimate this transformation.

A mirror is placed in front of the camera to capturethe reflection of the screen. The camera will perceive thisreflection as if another screen is located at the same distancefrom the mirror but in the opposite direction (see figure5). Weattached a reference frame to each of the objects:Ow, Om,Ov and Os for the world, mirror, virtual screen and the realscreen frame, respectively.

If we know the location of the mirror and this ‘virtual’screen, then we can also calculate the location of the realscreen. By taking three co-planar points on the screen in worldcoordinates (e.g. the points that lie on the XY-plane of thescreen), we get the first two orthogonal vectors that define thescreen reference frame. The third one can be computed bytaking the cross product of these two vectors.

xy

z

O s

xyz O m

xyz

O vy

zO w

M m w , M w m

M v w

M w s

v o r i gv l o n g

v s h o r t

s s h o r ts l o n g

x

s o r i g

Figure 5. The hardware setup used for the screen registration. The stereocameras are represented by two ellipses in front of the screen. Each object isshown with its own reference frame (w: world, m: mirror, v: virtual screenand s: screen).

Om

X

YZ

Image points (+) and reprojected grid points (o)

X

Ov Y

Z

50 100 150 200 250 300

50

100

150

200

Figure 6. The mirror used for the registration of the screen. A part of thereflection layer is removed, so that the camera can see the calibration patternput behind the mirror. Compare the extracted reference frame with figure5.

By displaying a calibration pattern on the screen, the virtualscreen-to-world frame transformationMvw can be computedfrom the reflection of that pattern. With this information, wecan choose three co-planar points and calculate their 3D worldcoordinatesvw

orig, vwlong andvw

short (figure5). Then, applyingthe following transformations to each of these points will resultin the corresponding 3D screen pointssw

orig, swlong andsw

short

in world coordinates:

swi = Mmw

1 0 0 00 1 0 00 0 −1 00 0 0 1

Mwmvw

i (5)

In equation (5) the virtual points are first transformed to themirror coordinates viaMwm. The second matrix will mirrorthe points to the opposite direction of the mirror’s XY-plane.After that, by multiplying it again with the inverse transfor-mation Mmw we get the screen points in world coordinatesswi .To determine the location of the mirror, a part of the mirror’s

reflection layer is removed, making that part transparent.A calibration pattern is placed behind the glass. For thecalculation of the world-to-mirror frame transformationMwm, the grid frame extraction from section4.1 must beslightly modified. Instead of extracting intersection pointsfrom the whole grid only the points on the grid border needto be detected (figure6).

The last step is to determineMws from the calculatedscreen points. The rotation and translation component of thetransformation can be determined as follows:

sxaxis = swlong − sw

orig

syaxis = swshort − sw

orig

szaxis = swxaxis × sw

yaxis

5

Rws = [sxaxis syaxis szaxis]T

Tws = −Rwssworig

(6)

with si as the normalized version ofsi.

Since the camera calibration is only accurate for the spacewhere the calibration grid is positioned, we need to acquiretwo sets of images. In the first set, we take into account thespace where the user’s head and also the mirror are supposedto be located (about 40-60 cm in front of the camera). Forthe second set, we place the calibration grid on the estimatedlocation of the screen reflection (the ‘virtual’ screen), about110-130 cm away from the camera. The calibration is thenperformed over the joint set of images. After that, the screenregistration described above can be carried out.

5 HEAD POSE TRACKING

In this section the head pose tracking module will be discussedin detail. First, a short summary of the particle filtering algo-rithm is provided in section5.1, followed by the description ofthe factorized likelihoods particle filtering scheme proposed in[16] (section5.2). The 3D facial feature model that is used inour scheme is described in section5.3. Finally, in section5.4we discuss the role of particle filtering in the head trackingmodule and propose the use of stereo information as priorknowledge for the tracking. The choice of particle filteringparameters will also be discussed here.

5.1 Particle FilteringRecently, particle filtering has become a popular algorithm forvisual object tracking. In this algorithm, a probabilistic modelof the state of an object (e.g. location, shape or appearance)and its motion is applied to analyze a video sequence. Aposterior densityp(x|Z) can be defined over the object’s state,parameterized by a vectorx, given the measurementsZ fromthe images up to timet. This density is approximated by adiscrete set of weighted samples, called theparticles(figure7).At time t, this set is represented by{sk, πk} which containsK particless1, s2, . . . , sK and their weightsπ1, π2, . . . , πK

(for easier notation, we remove the time index).The main idea of particle filtering is to update this particle-

based representation of the posterior densityp(x|Z) recur-sively from previous time frames:

p(x|Z) ∝ p(z|x)p(x|Z−)p(x|Z−) =

∑x− p(x|x−)p(x−|Z−) (7)

where the superscript− denotes the previous time instant.See [10] for the complete derivation of this equation.

Beginning from the posterior of the previous time instantp(x−|Z−), a number of new particles are randomly sampledfrom the set {s−k , π−k }, which is approximately equal tosampling fromp(x−|Z−). Particles with higher weights will

x

p ( x | Z )

s k

p k

p k

Figure 7. An illustration of the particle-based representation of a 1-dimensional posterior distribution. The continuous density is approximatedby a finite number of samples or particlessk (depicted by the circles). Eachparticle is assigned a weightπk (represented by circle radius) in proportionwith the value of the observation densityp(z|x = sk), which is an estimationof the posterior density atsk.

have higher probability to be picked for the new set, whileparticles with lower weights can be discarded.

Next, each of the chosen particles are propagated via thetransition probability p(x|x−), resulting in a new set ofparticles. This is approximately equivalent to sampling fromthe densityp(x|Z−) (equation (7), second line).

In the last step, new weights are assigned to the newparticles, measured from the observation density, that is letπk = p(z|x = sk). The new set of pairs{sk, πk} representsthe posterior probabilityp(x|Z) of the current timet.

Once the new set is constructed, the moments of the state atcurrent timet can be estimated. We can take for instance theweighted average of the particles, obtaining the mean position:

E[x] =K∑

k=1

πksk (8)

In our case, we consider a facial feature such as eye ormouth corner as a single object, with the image location as thestate. In every time frame, the facial feature location is trackedby evaluating the appearance of the feature. Several problemsoccur when this algorithm is used to track multiple objects[16]. One of the problems is that propagating each objectindependently would deteriorate the tracking robustness whenthere are interdependencies between the objects. By incorpo-rating this information in the tracking scheme, the propagationwould be more efficient, i.e. less particles are wasted on areaswith low likelihood. For example, if we want to track multiplefacial features individually without any information about therelative distance between the features, the rigidness of the faceis lost. By introducing some constraints in the propagation ofeach facial feature, the rigidness of the face is preserved.

5.2 Auxiliary Particle Filtering with FactorizedLikelihoodsThe method summarized below, proposed in [16], is one of theimprovements to particle filtering in the case of tracking mul-tiple objects. The state is partitionedx = [x1|x2| . . . |xM ]T

such thatxi (i = 1, 2, . . . ,M ) represents the state of each

6

object andM is the number of objects. Each partition ispropagated and evaluated independently:

p(xi|Z) ∝ p(z|xi)∑

x−p(xi|x−)p(x−|Z−) (9)

Similar to the notation in section5.1, each posteriorp(xi|Z) is represented by a set ofsub-particles andtheir weights {sik, πik}, with k = 1, 2, . . . , K and Kthe number of sub-particles. After separately propagatingthose sets, a proposal distribution is constructed fromindividual posteriors:g(x) =

∏i p(xi|Z). By ignoring the

interdependencies between differentxi, we can constructthe samplesk = [s1k|s2k| . . . |sMk]T (concatenation of thesub-particles) by independently sampling fromp(xi|Z). Theindividual propagation steps are summarized below.

The densityp(x|Z) now represents the posterior of allobjects, instead of only one object. Starting from the set{s−k , π−k } from the previous time frame the following stepsare repeated for every partitioni:

1) Propagate all K particles s−k via the transitionprobability p(xi|x−) in order to arrive at a collectionof K sub-particles µik. Note, that while s−k hasthe dimensionality of the state spacex, µik has thedimensionality of the partitioned statexi.

2) Evaluate the observation likelihood associated witheach sub-particleµik, that is letλik = p(z|xi = µik).

3) SampleK particles from the collection{s−k , λikπ−k }. Inthis way it favors particles with highλik, i.e. particleswhich end up at areas with high likelihood whenpropagated with the transition probability.

4) Propagate each chosen particles−k via the transitionprobability p(xi|x−) in order to arrive at a collectionof K particlessik. Note thatsik has the dimensionalityof the partitioni.

5) Assign a weightπik to each sub-particle as follows,wik = p(z|sik)

λik, πik = wikP

j wij.

After this procedure, we haveM posteriorsp(xi|Z) eachrepresented by{sik, πik}. Then, samplingK particles fromthe proposal functiong(x) is approximately equivalent withconstructing the particlesk = [s1k|s2k| . . . |sMk]T by sam-pling independently eachsik from p(xi|Z). Finally, in orderfor these particles to represent the total posteriorp(x|Z) weneed to assign a weight to each particle equal to [11]:

πk =p(sk|Z−)∏i p(sik|Z−)

(10)

In other words, the re-weighting process favors particlesfor which the joint probability is higher than the product ofthe marginals. In the general case that the above equationcannot be evaluated by an appropriate model, the weights need

- 4 0- 2 0

02 0

4 0

- 2 0

0

2 0

4 0

4 2 0

4 4 0

Y

X

Z

Figure 8. The 3D facial feature model. On the left the facial feature templatesare shown. On the right we see their locations in 3D, calculated from stereoimages. The triangle represents the 2D face plane, formed by connecting theaverage locations of all three feature pairs.

to be estimated. Here the use of prior information such asthe interdependencies between the objects is utilized. Afternormalizing the sum to 1 again, we end up with a collection{sk, πk} as the particle-based representation ofp(x|Z).

5.3 3D Facial Feature ModelThe facial feature model in our scheme consists of twocomponents:

• templates of the facial features’ appearance• relative 3D coordinates of the facial features (reference

face model)

The facial features shown in figure8 are defined as thecorners of the eyes and mouth. This facial feature modelis user-dependent and must be built before tracking can beperformed. First, a stereo snapshot of the head is taken. Fromthis shot the relative 3D positions of the facial features areextracted by manually locating the features in the left andright images and triangulating those features. Together theyform a reference shape model for the user’s face. Next, in thebeginning of each tracking process (initialization phase), thestart positions of the facial features in the left and right framesare selected manually. Simultaneously, a rectangular imagetemplate around each feature is acquired. These templates willbe used in the tracking process.

5.4 Multiple Facial Features TrackingIn this section we will use the auxiliary particle filteringscheme described in the previous section for the problem ofmultiple facial feature tracking. Figure9 shows the overviewof the head tracking module where the facial features willbe tracked. The facial feature templates from the initializationphase are used to track the features in 2D. The output of eachof the particle-filtering blocks is a set of particles that repre-sents the distribution of the 2D facial feature locations, for theleft and right image respectively:{sk, πk}L and{sk, πk}R.

In order to do the re-weighting process in equation (10) weuse the reference face model (figure8) as prior information onthe relative 3D positions of the facial features. We combine

7

2 D P a r t i c l e f i l t e r i n g

2 D P a r t i c l e f i l t e r i n g

S t e r e o t r i a n g u l a t i o n

3 D R e - w e i g h t i n g

3 D M o d e l f i t t i n g

3 D H e a d P o s e

S t a r t p o s i t i o n s & f a c i a l f e a t u r e t e m p l a t e s( R i g h t f r a m e )

2 D p a r t i c l e s 2 D p a r t i c l e s

3 D p a r t i c l e s

3 D F a c i a l F e a t u r e M o d e l

W e i g h t e d a v e r a g e o f 3 D p a r t i c l e s

S t a r t p o s i t i o n s & f a c i a l f e a t u r e t e m p l a t e s

( L e f t f r a m e )

Figure 9. Block diagram of the head pose tracking module. Particle filteringis used to track the 2D locations of the facial features in the left and rightframe.

the two particle sets from the left and right image to a setof 3D particles by triangulating each left and right particle(one-to-one correspondence), and compare each 3D particlewith the reference face model to calculate the weightsπk,3D.These weights are then assigned to the left and right set (πk,L

and πk,R) and the individual propagation for the next framecan start again.

From each frame we can roughly estimate the 3D locationsof the facial features by calculating the weighted average ofthe 3D particles (equation (8)). The reference face model isthen fitted to these 3D points to refine the estimation of thehead pose in current frame.

In the following subsections we will describe the choice ofthe state, the observation model and the transition model usedfor the 2D tracking. After that, we discuss how the priors areused to take into account the interdependencies between thefacial features.

5.4.1 State and Transition Model

We consider each facial feature as an object. For everyfacial feature i, the object state is represented byxi =[ui vi u−i v−i ]T , with [ui vi] and[u−i v−i ] as the current and theprevious 2D image coordinates of a particular feature, respec-tively. We choose to include the previous image coordinatesin order to take into account the object’s motion velocity andtrajectory.

To simplify the evaluation of the transition density, weassume thatp(xi|x−) = p(xi|x−i ), which means that eachfeature can be propagated individually. A second-order processwith Gaussian noise is used for individual propagation of each

feature:

p(xi|x−i ) ∝

1+α 0 −α 0

0 1+β 0 −β

1 0 0 0

0 1 0 0

x−i + N(0, σn) (11)

with α, β ∈ [0, 1] as the weight factor that determines thestrength of the contribution of the horizontal and verticalmotion velocity of a particle in the transition model.

5.4.2 Observation ModelAfter the 2D particles are propagated in step 1 and 4 fromsection5.2, the weight of each sub-particle needs to be deter-mined. This is done by evaluating the observation likelihoodp(z|xi).

We use the same observation model as proposed in [16]. Atemplate-based method is used as measurementz from theimages. The color difference between a reference templateand an equally sized window centered on each sub-particleis used as a measure of the weight of the particles, that isthe probability of a sub-particle being the location of a facialfeature. Let the reference template beri, and let the windowcentered on a sub-particle beoi. The color-based difference isthen defined as [16]:

c(oi, ri) =(

oi

E{oi,Y} −ri

E{ri,Y})

(12)

where the subscript Y denotes the luminance component ofthe template andE{A} is the mean of all elements in A. Thematrix c(oi, ri) contains the RGB color difference betweenoi

and ri. Finally, the scalar color distance between those twomatrices is defined by:

d(oi, ri) = E

{ρ(c(oi, ri)

)}

ρ(c(·)

)=

∣∣∣c(·)R∣∣∣ +

∣∣∣c(·)G∣∣∣ +

∣∣∣c(·)B∣∣∣

(13)

whereρ(·) is a robust function defined as theL1-norm of thecolor channels per pixel.

The observation likelihood is defined as:

p(z|xi) ∝ εo + exp(−d(oi, ri)2

2σ2o

)(14)

where σo and εo are the model parameters (see figure10).The parameterσo determines the steepness of the curve, thatis, how fast the curve will drop in the case of bad particles (i.e.particles that have low similarity with the reference template).The parameterεo is used to prevent that particles get stuck onlocal maxima when the object is lost. To improve the abilityto recover from a lost object,εo should be small but non-zero[14].

5.4.3 PriorsTo approximate the re-weighting process as defined in equa-tion (10), we use similar approach as in the calculation ofthe observation likelihood. The prior information on the 3Drelative positions of the facial features is now used.

8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.2

0.4

0.6

0.8

1

1.2

1.4

p(z|

x)

d(.)

σ

ε

Figure 10. The observation model.

After we get the new particle sets from the left and rightimages,{sk, πk}L and{sk, πk}R, we combine these sets (one-to-one correspondence) by triangulating each particle pair,resulting inK 3D particles. The weight of each 3D particleπk is then approximated by:

πk = εp + exp(− d2

k

2σ2p

)(15)

where σp and εp are the model parameters similar to thosefor the observation likelihood (equation (14)) and dk is thedifference between the reference face shape and the shapederived from each 3D particle. To calculate this difference,the reference face shape is first rotated such that its face plane(figure8) coincides with the face plane of the measured shape.The scalar distancedk is then defined as:

dk =1M

M∑

i=1

d2ik (16)

wheredik is the 3D spatial distance between featurei of thereference and featurei of the kth 3D particle.

6 GAZE DIRECTION ESTIMATIONAfter acquiring the estimation of the head pose in the previoussection, we will now discuss the gaze direction estimationmodule and the intersection calculation module in detail. Webegin with presenting the geometrical eye model used in oursystem in section6.1. In section6.2 the calculation of the 3Dgaze vector will be explained. Finally, the intersection betweenthe gaze ray and the screen will be dealt with in section6.3.

6.1 Geometrical Eye ModelWe use a 3D eyeball model similar to the model used byMatsumotoet al. [15] and Ishikawaet al. [12]. The eyeball isregarded as a sphere with radiusr and centerO (figure 11).We assume that the eyeball is fixed inside the eye socket,except for rotation movements around its center. Thereforethe relative position of the center O and the eye corners isconstant regardless of the head movements. Unlike Ishikawa

E 1E 2

C

C

E 2

E 1 M

O r

d

P

v g^

Figure 11. The eyeball model used in our system. The capital letters denote3D points in world coordinates. The gaze directionvg is defined as a 3Dvector from the eyeball centerO pointing to the cornea centerC. The pointsE1 andE2 are the inner and outer corners of the eye socket.

et al., we also assume that the inner and outer corners of theeye socket (E1 andE2) arenot located on the eyeball surface.It is easier to locate and track the eye corners than the points onthe eyeball surface, because these corners are more distinctive(figure 11). This would also make the tracking more robust toeye blinks. Furthermore, we assume that the anatomical axisof the eye coincides with the visual axis1. The gaze directionis defined by a 3D vector going from the eyeball centerOthrough the cornea centerC. Our 3D eyeball model consistsof two parameters:

• the radius of the eyeballr.• the relative position of the eyeball center with respect to

the eye corners.

The relative position of the eyeball center is defined asa 3D vector from the mid-point of the eye cornersM tothe eyeball centerO, and termed as an ‘offset vector’d.These parameters are determined for each person by takinga training sequence where the gaze points of that person isknown.

The training sequence is acquired by recording the user’shead pose and cornea center locations while he is looking atseveral calibration points on the screen. Since we know thelocations of the calibration points, we can calculate the gazevector to these points. If we consider only one calibration pointP, the gaze vector is determined by

vg = P − C, vg =vg

‖vg‖1Anatomical axisis defined as the vector from eyeball center to the center

of the lens, whilevisual axis is defined as the vector connecting the foveaand the center of the lens. The visual axis represents the true gaze direction.On the retina, the image that we see will be projected at the fovea, which isslightly above the projection of the optical axis.

9

with vg as the normalized gaze vector when the eye gaze isfixed to point P (see figure11).

The relation between the gaze vector and the unknownparametersr andd is reflected by the equation:

d + rvg = C −M (17)

This equation cannot be solved because we have 4 un-knowns (the radius r and the offset vectord = [dx, dy, dz])and 3 equations (one for each x, y and z component). Ifwe combine the left and right eye together, assuming thesame eyeball radius, we would still have 7 unknowns and 6equations. Therefore, we need at least 2 calibration points toestimate the eyeball parameters for each user. The generalizedmatrix equation for N calibration points can be derived fromequation (17), written in the formAx = b:

vgL,1 I 0vgL,2 I 0

...vgL,N I 0

vgR,1 0 IvgR,2 0 I

...vgR,N 0 I

rdL

dR

=

CL,1 −ML,1

CL,2 −ML,2

...CL,N −ML,N

CR,1 −MR,1

CR,2 −MR,2

...CR,N −MR,N

(18)

Solving this matrix equation in the least square sense leadsto the desired eyeball parameters. Note that the calculation isdone in the face coordinate system (see figure8), otherwiseequation (18) would not be valid.

6.2 Estimating the Gaze VectorOnce the eyeball parameters are estimated, we can estimate thegaze direction. The overview of the gaze direction estimationmodule is given in figure12.

2 Dp r o j e c t i o n

3 D e y e b a l lm o d e l

2 Dp r o j e c t i o n

E y e b a l l c e n t e rc a l c u l a t i o n

E y e b a l l c e n t e rc a l c u l a t i o n

C o r n e ad e t e c t i o n

C o r n e ad e t e c t i o n

G a z ec a l c u l a t i o n

G a z ec a l c u l a t i o n

3 D h e a d p o s e

3 D R i g h t e y e c o r n e r s

3 D L e f t e y e c o r n e r s

2 D L e f t e y e c o r n e r s

( l e f t / r i g h t f r a m e )2 D R i g h t e y e

c o r n e r s( l e f t / r i g h t f r a m e )

3 D L e f t e y e b a l l c e n t e r

3 D R i g h t e y e b a l l c e n t e r

3 D L e f t c o r n e a c e n t e r

3 D R i g h t c o r n e a c e n t e r

3 D L e f t g a z e v e c t o r

3 D R i g h t g a z e v e c t o r

Figure 12. Detailed block diagram of the gaze direction estimation module.

Figure 13. The ROI defined between the inner and outer eye corners. Thesmall dot in the middle of the circle represents the 2D cornea center.

From the head pose tracking module we get the 3D locationsof all facial features. However, for gaze direction estimationwe only need the 2D and 3D positions of the inner and outereye corners (for the left and right eye). This information is usedto estimate the cornea center and the eyeball center locations.

6.2.1 Finding the Eyeball Center

We calculate the location of the left and right eyeball centerseparately by using the following equation:

O =12

(E1 + E2) + d = M + d (19)

where d is the offset vector obtained from the trainingsequence.

6.2.2 Finding the Cornea Center

To find the cornea center we first project the 3D eye cornersback to the left and right 2D image plane. A small ROIin the image is then defined between the inner and outercorner locations (figure13). Then, template matching witha disk-shaped template on the intensity image is used toapproximately locate the cornea. After that, we define an evensmaller ROI around the initial cornea center location and applythe circular Hough transform on the edge image of the smallerROI. The second ROI is used to filter out irrelevant edges.The pixel position with the best confidence (most votes) is theestimation of the cornea center.

The steps described above are done for the left and rightimage separately. The left and right 2D cornea center locationsare then triangulated to find the 3D location.

6.2.3 The 3D Gaze Vector

After finding the 3D cornea center locationC and the 3Deyeball centerO for the left and right eye, the gaze vector forthe current frame is then calculated by

vg = C −O, vg =vg

‖vg‖ (20)

The normalized left and right gaze vectors are finally for-warded to the intersection calculation module (see figure12).

6.3 Intersecting Gaze Vector with the ScreenThe overview of the intersection calculation module is shownin figure 14. To intersect the gaze ray with the screen weneed the information about the screen location. In figure15,the gaze direction is projected on the screen in pointP . The

10

V e c t o r - p l a n e i n t e r s e c t i o n

2 D S c r e e n c o o r d i n a t e

V e c t o r - p l a n e i n t e r s e c t i o n

A v e r a g eL e f t g a z e p r o j e c t i o n R i g h t g a z e p r o j e c t i o n

3 D L e f t g a z e v e c t o r 3 D R i g h t g a z e v e c t o r

W o r l d - t o - s c r e e nf r a m e t r a n s f o r m a t i o n

Figure 14. Detailed block diagram of the intersection calculation module.

N

v g^P

O s

Figure 15. Illustration of the ray-plane intersection.

resulting gaze ray can be written in parametric representationas:

g(t) = O + vgt (21)

whereO is the eyeball center andvg is the unit gaze vector.For a certain scalart, the gaze ray will intersect the screen atpoint P .

By using the knowledge that the product of every point ina plane with the plane’s normal is a constant [9]:

N · P = N ·Os = c

and the parametric representation of the gaze ray on equa-tion (21), we can obtain the valuetP when the gaze rayintersects the screen plane:

N · (O + vgtP ) = N ·Os

tP = −−N ·Os + N ·ON · vg

(22)

Equation (22) can be further simplified if we do the cal-culation in the screen coordinate system. We will then haveOs = 0 and N = [0 0 1]T , reducing the calculation to adivision of two scalars:

tP = − oz

vgz(23)

whereoz is thez component of the eyeball center (in screencoordinate system).

For the output of the whole system, the average of theprojected gaze ray from the left and right eye are taken tocompensate for the effect of noise.

7 EXPERIMENTAL RESULTSIn this section we evaluated the performance of each module ofthe gaze tracking system. The calibration and screen registra-tion results are presented in section7.1. Section7.2 discussesthe tracking performance of the auxiliary particle filtering. Thegaze training and estimation results are shown in section7.3and finally, we tested the whole system by applying our systemon some sequences in section7.4.

7.1 Stereo Calibration and Screen RegistrationTo calibrate the web camera pair, we took 16 image pairs(320x240 pixels) of the checkerboard calibration grid undervarious positions. The first eight shots were taken when thegrid was held about 50 cm away from the camera. Theremaining shots were made while holding the grid about120 cm away from the camera. TableI shows the estimatedcamera parameters. (Note that the rotation matrix is onlyrepresented by three rotation angles, one for thex, y, andz axis respectively).

The results shown in this table indicate that the averagehorizontal and vertical reprojection error are very small(below 0.1 pixel). The reprojection error remains relativelyconstant if fewer images than 16 were taken, but this wouldresult in a larger error of the estimated parameters.

For the screen registration we took another 5 shots con-taining the mirror in various positions (figure16). By usingthe method described in section4.2, we could compute theposition of the screen with respect to the world frame foreach stereo image pair. The estimated world-to-screen trans-formationMws for each mirror position is listed in tableII .

We can see from the standard deviation of the rotationangles and the translation vectors that the screen registration

TABLE I

STEREO CAMERA CALIBRATION RESULTS

Left camera Right cameraIntrinsic parameters optimized std. optimized std.

Focal lengths fx 519.57 0.39 509.91 0.37fy 516.91 0.39 506.92 0.38

Principal points u0 191.41 1.19 184.91 1.24v0 111.25 0.87 121.51 0.81

Radial dist. k1 -0.2730 0.0070 -0.2839 0.0063k2 0.057 0.047 0.214 0.049

Tangential dist. k3 -0.00169 0.00029 0.00064 0.00032k4 -0.00074 0.00035 -0.00119 0.00026

Avg. Reproj. error x 0.077 0.080(pixel) y 0.069 0.079

Extrinsic parameters optimized std.

Rotation Rα 0.229 0.103(degree) Rβ 9.147 0.157

Rγ 1.290 0.013Translation Tx -94.910 0.085(mm) Ty -2.794 0.084

Tz 1.098 0.883

11

Figure 16. An example of the shots for the screen registration. The imagesshown here were taken from the left camera.

TABLE II

SCREEN REGISTRATION RESULTS: THE WORLD TO SCREEN

TRANSFORMATION Mws

Rotation angle (deg.) Translation vector (mm)(Rα, Rβ , Rγ ) (Tx, Ty , Tz)

Mirror position #1 (25.46, 9.58, 1.74) (65.42, 205.06, 142.88)#2 (24.96, 8.59, 1.71) (67.72, 210.55, 145.49)#3 (25.40, 8.68, 1.80) (64.73, 207.03, 145.15)#4 (24.44, 8.59, 1.64) (69.45, 209.21, 144.52)#5 (24.90, 10.43, 1.45) (70.04, 209.26, 145.84)

Mean Standard deviation

Rotation angle 25.03 0.42(degree) 9.17 0.82

1.67 0.13

Translation vector 67.47 2.36(mm) 208.22 2.17

144.78 1.17

method is accurate with up to about 2 mm translation errorand less than 1◦ rotation error. The mean value of thetransformation will be used to determine the screen locationin the intersection calculation module.

7.2 Head Pose TrackingFigure 17 shows an example of the head pose tracking byusing K = 100 particles. The tracking was performed bychoosingα = 0.7 and β = 0.5 for horizontal and verticalspeed components respectively, with noise standard deviationof σn = 1.8 pixel (see equation (11)). The choice of theseparameters depends strongly on the expected speed of thehead movements. If only slow movements are present, then

Figure 17. Example of the head pose tracking with particle filtering. Theresults presented here were taken from the left camera for frames 1 (userinitialization), 51, 86, 122 (from left to right and top to bottom).

Figure 18. The reference face shape model (above) and the estimated faceshape in the face coordinate system for all frames (below).

we can choose a smaller value forα, β and σn, therebyimproving the tracking precision (smaller jitter). Using largervalues will decrease the precision, but this makes the trackingmore robust to faster movements.

When we compare the estimated face shape of all framesin the same sequence, apparently the shape varies slightlyover time (figure18). This is caused by the stochastic natureof particle filtering. The statistics are shown in tableIII .This variation would render the user’s eyeball model useless,because it assumes that the eye corners are fixed with respectto the whole face shape model. This was the reason thatwe have fitted the reference shape to the estimated shape(section5.4). In this way the rigidness of the face shape ineach frame was preserved.

12

TABLE III

STATISTICS OF THE ESTIMATED FACE SHAPE

Standard deviations (mm) σx σy σz

Facial featuresmouth corner 1 0.41 0.45 1.68mouth corner 2 0.58 0.51 1.91left eye corner 1 0.32 0.79 1.83left eye corner 2 0.32 1.44 1.52right eye corner 1 0.42 1.34 2.43right eye corner 2 0.33 1.18 2.15

TABLE IV

ESTIMATED EYEBALL PARAMETERS

Eyeball param. (mm) Left eye Right eye

Radius r 11.36Offset vector dx -0.99 0.23

dy 7.38 6.35dz -9.65 -5.33

Error (mm) Ground truth Measurement Difference

x1 240.00 268.18 28.18y1 180.00 172.76 -7.24

x2 -30.00 -50.38 -20.38y2 150.00 161.87 -18.13

x3 -30.00 -50.91 -20.91y3 -30.00 -18.11 11.89

x4 240.00 262.21 22.21y4 -30.00 -6.78 23.22

Average error: 28.14Standard deviation: 3.38

7.3 Gaze Direction EstimationBefore we could estimate the gaze direction, we trainedthe system in order to estimate the user-dependent modelparameters (see section6.1). The training sequence wasacquired by recording the user’s eye corners and corneapositions while he was looking at 4 calibration pointson the corners of the screen. After that we estimated theeyeball parameters by solving equation (18). The results aresummarized in tableIV.

We analyzed the effects of errors on two quantities to theoverall gaze error: the cornea center and the eyeball center.Together they will determine the gaze vector (section6.2). Anew sequence was acquired while the user was looking at onepoint while his head was fixed. Since the head and corneawere fixed, the variations on the tracked eye corners (or theeyeball center, indirectly) and the cornea center locations wereonly caused by the algorithm. The standard deviations of theeyeball center and cornea center are shown in tableV. We cansee here that the cornea fitting produced almost twice as largedeviation as the eyeball center.

For the following calculation, the mean of the cornea centerfor all frames, where the head and eyes are really steady, wasconsidered the ‘true’ location. The same consideration wasalso made for the mean of the eyeball centers for all frames.

TABLE V

TRACKING AND CORNEA FITTING ERROR

Standard deviations(mm) Left eye Right eye

Eyeball centerσx 0.34 0.28σy 0.95 0.78σz 1.53 1.20

Cornea centerσx 0.21 0.54σy 0.33 0.44σz 2.47 2.20

TABLE VI

EFFECTS OF INDIVIDUAL PARAMETER ERROR TO THE GAZE PROJECTION

ERROR

Gaze projection Exp. #1 Exp. #2 Exp. #3std. dev. (mm) x y x y x y

None fixed 10.53 61.90 16.09 52.65 17.59 88.58Eyeball center fixed 16.88 58.79 15.89 55.14 23.16 52.96Cornea center fixed 20.63 29.09 21.01 36.00 30.41 55.15

The gaze projections on the screen were calculated in threepasses. First the gaze vector in each frame was calculatedas usual by equation (20). In the second pass, we held thecornea center constant for all frames by taking the mean. Inthe last pass, the eyeball center location was held constantfor all frames, again by taking the mean. This experimentwas repeated 3 times on the same sequence to ensure that theresults are more reliable since particle filtering is stochasticin nature (tableVI).

The results indicate that errors in the cornea center fittinghave the largest influence on the vertical gaze projection errors.If the noise in the cornea center is removed (by taking theaverage), the spread of the gaze projection error becomessmaller and rounder (see also figure19).

−250 −200 −150 −100 −50 0 50

−100

−50

0

50

X

Y

Figure 19. The plot of the gaze projections of the first experiment of tableVI .The symbols represent the mean gaze projection (+), the gaze points whenno parameters are fixed (◦), fixed eyeball center (∗), and fixed cornea center(×).

13

7.4 Overall PerformanceIn this section we presented the overall results of the gazetracking system when applied to the train sequence and atest sequence. Both sequences were recorded when a personwas looking at the same 4 calibration points (figure20). Aswe can see, the average error of the gaze projection on thescreen is about 6 cm, which corresponds with an angularerror of about 7◦ at a distance of 50 cm. Figure21 shows the3D gaze vectors from the left and right eyes when they wereprojected back to the image plane.

When we compare the gaze direction estimation to all 4calibration points, we see that the projected gaze points onthe lower part of the screen have a much smaller and rounderspread. This is caused by the cornea fitting error. Since thecameras were located below the screen, we had an almostfrontal view of the face when the user was looking at thelower part of the screen. Hence, the fitting produced smallererror because the cornea has a circular form. The further theuser’s gaze was away from the camera, the more elliptic thecornea image projection would be, making it more difficult tobe fitted. As a result similar spread as that from section7.3was observed for the gaze to the upper part of the screen.

−300 −250 −200 −150 −100 −50 0 50 100 150 200

−250

−200

−150

−100

−50

0

50

100

150

X (mm)

Y (

mm

)

Average gaze projection error: 65.55 mm (std: 34.54 mm)

−350 −300 −250 −200 −150 −100 −50 0 50 100 150

−200

−150

−100

−50

0

50

100

150

200

X (mm)

Y (

mm

)

Average gaze projection error: 63.78 mm (std: 32.11 mm)

Figure 20. The plots of projected gaze points from the train sequence(above) and the test sequence (below). The gaze to each calibration points arerepresented by different symbols.

8 CONCLUSIONS AND RECOMMENDA -TIONS

In this paper a gaze tracking system is presented based onparticle filtering and stereo vision. We propose to do facialfeature tracking in 2D by particle filtering, and use the stereoinformation to estimate the head pose of a user in 3D.Together with a 3D eyeball model the 3D gaze direction canbe estimated.

For gaze tracking application on visual perception research,we need to know the projection of the user’s gaze on thescreen, where the visual stimuli are presented. We deviseda screen registration scheme in order to accurately locatethe screen location with respect to the cameras. With thisinformation, the gaze projection on the screen can becalculated.

The results achieved by our gaze tracking scheme arepromising. The average gaze projection error is about 7◦,only a few degrees off the specified requirements. At auser-monitor distance of 50 cm and a 17” screen (about30x24 cm), this would mean that we can distinguish the gazeprojections on the screen in about 5×3 distinct blocks.

There is still room for improvements in our gaze trackingsystem. Based on the results in previous section the corneafitting should be the main concern, since errors in this parthave the greatest influence on the overall gaze error. Higherresolution of the cornea fitting is needed, for example bytaking larger image resolution, or by fitting the cornea withsub-pixel accuracy. Together with a more sophisticated fittingalgorithm such as ellipse fitting [7], better results should beachieved. Another possible source of error is the exclusion ofthe anatomical and visual axis difference of the eye in the 3Deye model. Compensating this difference will also reduce theoverall gaze tracking error.

The head pose tracking module still has some difficultyto track persons with glasses and fast head movements. Theuse of lightings and rotation invariant templates might helpto reduce tracking loss. Furthermore, some smoothing inthe temporal domain could help to reduce the jitter in theestimated facial feature locations. Finally, to eliminate themanual selection of the facial features in the initializationphase, the possibility to automatically locate these featuresshould be explored.

REFERENCES

[1] Applied Science Laboratories (ASL), USA.http://www.a-s-l.com/(Last visited: 22 October 2004)

[2] Baluja, S. and Pomerleau, D.,Non-intrusive Gaze Tracking usingArtificial Neural Networks, Report no. CMU-CS-94-102, CarnegieMellon University, 1994.

14

Figure 21. Results from detection of gaze direction. The vectors are drawn starting from the cornea center of the left and right eye, respectively.

[3] Bouguet, J.Y.,Camera Calibration Toolbox for MATLAB.http://www.vision.caltech.edu/bouguetj/calib doc/(Last visited: 28 September 2004)

[4] Duchowski, A.T., Eye Tracking Methodology: Theory and Practice,London: Springer, 2003.

[5] Ebisawa, Y., “Improved Video-based Eye-gaze Detection Method”,IEEE Transactions on Instrumentation and Measurement, (47)4:948–955, 1998.

[6] Eye Response Technologies (ERT), USA.http://www.eyeresponse.com/(Last visited: 22 October 2004)

[7] Fitzgibbon, A.W., Pilu, M. and Fischer, R.B., “Direct Least-squaresFitting of Ellipses”, IEEE Transactions on Pattern Analysis andMachine Intelligence, 21(5):476–480, 1999.

[8] Fourward Technologies, Inc., USA.http://www.fourward.com/(Last visited: 22 October 2004)

[9] Glassner, A.S.,Graphics Gems, pp. 390-391, Cambridge: AcademicPress, 1990.

[10] Isard, M. and Blake, A., “Condensation – Conditional DensityPropagation for Visual Tracking”,International Journal of ComputerVision, (29)1:5–28, 1998.

[11] Isard, M. and Blake, A., “ICondensation: Unifying Low-level andHigh-level Tracking in a Stochastic Framework”,Proceedings of the5th European Conference on Computer Vision, vol. 1, pp. 893–908,1998.

[12] Ishikawa, T., Baker, S., Matthews, I. and Kanade, T., “PassiveDriver Gaze Tracking with Active Appearance Models”,Proceedingsof the 11th World Congress on Intelligent Transportation Systems, 2004.

[13] Ji, Q. and Zhu, Z., “Eye and Gaze Tracking for Interactive GraphicDisplay”, International Symposium on Smart Graphics, 2002.

[14] Lichtenauer, J., Reinders, M. and Hendriks, E., “Influence of theObservation Likelihood Function on Particle Filtering Performance inTracking Applications”, Proceedings of the 6th IEEE InternationalConference on Automatic Face and Gesture Recognition, pp. 767–772,2004.

[15] Matsumoto, Y. and Zelinsky, A., “An Algorithm for Real-time StereoVision Implementation”,Proceedings of the 4th IEEE InternationalConference on Automatic Face and Gesture Recognition, pp. 499–504,2000.

[16] Patras, I., and Pantic, M., “Particle Filtering with Factorized Likelihoodsfor Tracking Facial Features”,Proceedings of the 6th IEEE International

Conference on Automatic Face and Gesture Recognition, pp. 97–102,2004.

[17] Reingold, E.M., McConkie, G.W., and Stampe, D.M., “Gaze-contingentMultiresolutional Displays: An Integrative Review”,Human Factors,(45)2:307–328, 2003.

[18] Tobii Technology AB, Sweden.http://www.tobii.se/(Last visited: 22 October 2004)

[19] Wooding, D.,Eye Movement Equipment Database (EMED), UK, 2002.http://ibs.derby.ac.uk/emed/(Last visited: 22 October 2004)

[20] Zhang, Z., “A Flexible New Technique for Camera Calibration”,IEEE Transactions on Pattern Analysis and Machine Intelligence,22(11):1330–1334, 2000.

15