Eye Gaze Correction With Stereovision for Video...

16
Eye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang Yang 1 and Zhengyou Zhang 2 1 Dept. of Computer Science, University of North Carolina at Chapel Hill, North Carolina, USA [email protected], 2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA [email protected] Abstract. The lack of eye contact in desktop video teleconferencing substan- tially reduces the effectiveness of video contents. While expensive and bulky hardware is available on the market to correct eye gaze, researchers have been try- ing to provide a practical software-based solution to bring video-teleconferencing one step closer to the mass market. This paper presents a novel approach that is based on stereo analysis combined with rich domain knowledge (a personalized face model). This marriage is mutually beneficial. The personalized face model greatly improved the accuracy and robustness of the stereo analysis by substan- tially reducing the search range; the stereo techniques, using both feature match- ing and template matching, allow us to extract 3D information of objects other than the face and to determine the head pose in a much more reliable way than if only one camera is used. Thus we enjoy the versatility of stereo techniques without suffering from their vulnerability. By emphasizing a 3D description of the scene on the face part, we synthesize virtual views that maintain eye contact using graphics hardware. Our current system is able to generate an eye-gaze cor- rected video stream at about 5 frames per second on a commodity PC. Keywords: Stereoscopic vision, Eye-gaze correction, Structure from motion. 1 Introduction Video-teleconferencing, a technology enabling communicating with people face-to- face over remote distances, does not seem to be as widespread as predicted. Among many problems faced in video-teleconferencing, such as cost, network bandwidth, and resolution, the lack of eye-contact seems to be the most difficult one to overcome[18]. The reason for this is that the camera and the display screen cannot be physically aligned in a typical desktop environment. It results in unnatural and even awkward interaction. Special hardware using half-silver mirrors has been used to address this problem. How- ever this arrangement is bulky and the cost is substantially high. What’s more, as a piece of dedicated equipment, it does not fit well to our familiar computing environment, thus its usability is greatly reduced. We aim to address the eye-contact problem by synthe- sizing videos as if they were taken from a camera behind the display screen, thus to This work was mainly conducted while the first author was at Microsoft Research as a summer intern.

Transcript of Eye Gaze Correction With Stereovision for Video...

Page 1: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

Eye Gaze Correction With Stereovision forVideo-Teleconferencing

Ruigang Yang�1 and Zhengyou Zhang2

1 Dept. of Computer Science, University of North Carolina at Chapel Hill, North Carolina, [email protected],

2 Microsoft Research, One Microsoft Way, Redmond, WA 98052, [email protected]

Abstract. The lack of eye contact in desktop video teleconferencing substan-tially reduces the effectiveness of video contents. While expensive and bulkyhardware is available on the market to correct eye gaze, researchers have been try-ing to provide a practical software-based solution to bring video-teleconferencingone step closer to the mass market. This paper presents a novel approach that isbased on stereo analysis combined with rich domain knowledge (a personalizedface model). This marriage is mutually beneficial. The personalized face modelgreatly improved the accuracy and robustness of the stereo analysis by substan-tially reducing the search range; the stereo techniques, using both feature match-ing and template matching, allow us to extract 3D information of objects otherthan the face and to determine the head pose in a much more reliable way thanif only one camera is used. Thus we enjoy the versatility of stereo techniqueswithout suffering from their vulnerability. By emphasizing a 3D description ofthe scene on the face part, we synthesize virtual views that maintain eye contactusing graphics hardware. Our current system is able to generate an eye-gaze cor-rected video stream at about 5 frames per second on a commodity PC.Keywords: Stereoscopic vision, Eye-gaze correction, Structure from motion.

1 Introduction

Video-teleconferencing, a technology enabling communicating with people face-to-face over remote distances, does not seem to be as widespread as predicted. Amongmany problems faced in video-teleconferencing, such as cost, network bandwidth, andresolution, the lack of eye-contact seems to be the most difficult one to overcome[18].The reason for this is that the camera and the display screen cannot be physically alignedin a typical desktop environment. It results in unnatural and even awkward interaction.Special hardware using half-silver mirrors has been used to address this problem. How-ever this arrangement is bulky and the cost is substantially high. What’s more, as a pieceof dedicated equipment, it does not fit well to our familiar computing environment, thusits usability is greatly reduced. We aim to address the eye-contact problem by synthe-sizing videos as if they were taken from a camera behind the display screen, thus to

� This work was mainly conducted while the first author was at Microsoft Research as a summerintern.

Page 2: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

establish natural eye-contact between video-teleconferencing participants without us-ing any kind of special hardware.

The approach we take involves three steps: pose tracking, view matching, and viewsynthesis. We use a pair of calibrated stereo cameras and a personalized face model totrack the head pose in 3D. The use of strong domain knowledge (a personalized facemodel) and a stereo camera pair greatly increase the robustness and accuracy of the 3Dhead pose tracking. The stereo camera pair also allows us to match parts not modelledin the face model, such as the hands and shoulders, thus providing wider coverage of thesubject. Finally, the results from the head tracking and stereo matching are combinedto generate a virtual view. Unlike other methods that only “cut and paste” the face partof the image, our method generates natural looking and seamless images, as shown infigure 1.

Fig. 1. Eye-gaze Correction: The two images shownon the left are taken from a pair of stereo camerasmounted on the top and bottom sides of a monitor; theimage shown above is a synthesized virtual view thatpreserves eye-contact.

There have been attempts to address the eye-contact problem using a single camerawith a face model [12],[10], or using dense stereo matching techniques[19], [15]. Wewill discuss these approaches in more details in Section 2. With a single camera, wefound it difficult to maintain both the real-time requirement and the level of accuracywe want with head tracking. Existing model-based monocular head tracking methods[11], [4], [2], [7], [1] either use a simplistic model so they could operate in real timebut produce less accurate results, or use some sophisticated models and processing toyield highly accurate results but take at least several seconds to compute. A single-camera configuration also has difficulties to deal with occlusions. Considering theseproblems with a monocular system, we decided to adopt a stereo configuration. Theimportant epipolar constraint in stereo allows us to reject most outliers without usingexpensive robust estimation techniques, thus keeping our tracking algorithm both robustand simple enough to operate in real-time. Furthermore, two cameras usually providemore coverage of the scene.

Page 3: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

Fig. 2. Camera-Screendisplacement causes thelose of eye-contact.

One might raise the question that why we do not use a dense stereo matching algo-rithm. We argue that, first, doing a dense stereo matching on a commodity PC in realtime is difficult, even with today’s latest hardware. Secondly and most importantly, adense stereo matching is unlikely to generate satisfactory results due to the limitation oncamera placement. Aiming at desktop video conferencing applications, we could onlyput the cameras around the frame of a display monitor. If we put the cameras on the op-posite edges of the display, given the normal viewing distance, we have to converge thecameras towards the person sitting in front of the desktop, and such a stereo system willhave a long baseline. That makes stereo matching very difficult; even if we were ableto get a perfect matching, there would still be a significant portion of the subject whichis occluded in one view or the other. Alternatively, if we put the cameras close to eachother on the same edge of the monitor frame, the occlusion problem is less severe, butgeneralization to new distant views is poor because a significant portion of the face isnot observed. After considering various aspects, we have decided to put one camera onthe upper edge of the display and the other on the lower edge, and follow a model-basedstereo approach to eye-gaze correction.

2 Related Works

In a typical desktop video-teleconferencing setup, the camera and the display screencannot be physically aligned, as depicted in figure 2. A participant looks at the imageon the monitor but not directly into the camera, thus she does not appear to make eyecontact with the remote party. Research [23] has shown that if the divergence angle (α)between the camera and the display is greater than five degrees, the loss of eye-contactis noticeable. If we mount a small camera on the side of a 21-inch monitor, and thenormal viewing position is at 20 inches away from the screen, the divergence angle willbe 17 degrees, well above the threshold at which the eye-contact can be maintained.Under such a setup, the video loses much of its communication value and becomesun-effective compared to telephone.

Several systems have been proposed to reduce or eliminate the angular deviationusing special hardware. They make use of half-silvered mirrors or transparent screenswith projectors to allow the camera to be placed on the optical path of the display. Abrief review of these hardware-based techniques has been given in [14]. The expensivecost and the bulky setup prevent them to be used in a ubiquitous way.

On the other track, researchers have attempted to create eye-contact using computervision and computer graphics algorithms. Ott et al. [19] proposed to create a virtualcenter view given two cameras mounted on either side of the display screen. Stereo-scopic analysis of the two camera views provides a depth map of the scene. Thus it ispossible to “rotate” one of the views to obtain a center virtual view that preserves eye

Page 4: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

contact. Similarly, Liu et al. [15] used a trinocular stereo setup to establish eye contact.In both cases, they perform dense stereo matching without taking into account the do-main knowledge. While they are generic enough to handle a variety of objects besidesfaces, they are likely to suffer from the vulnerability of brute-force stereo matching.Furthermore, as discussed in the previous section, we suspect that direct dense stereomatching is unlikely to generate satisfactory results due to the constraint of cameraplacement imposed by the size of the display monitor – a problem that may be lesssevere back in the early 90’s, when the above two algorithms were proposed .

Cham and Jones at Compaq Cambridge Research Laboratory [6] approached thisproblem from a machine learning standpoint. They first register a 2D face model tothe input image taken from a single camera, then morph the face model to the desiredimage. The key is to learn a function that mapps the registered face model parametersto the desired morphed model parameters [12]. They achieve this by non-linear regres-sion from sufficient instances of registered-morphed parameter pairs which are obtainedfrom training data. As far as we know, their research is still in a very early stage, so itis not clear if this approach is capable of handling dramatic facial expression changes.Furthermore, they only deal with the face part of the image – the morphed face imageis superimposed on the original image frame, which sometime leads to errors near thesilhouettes due to visibility changes.

The GazeMaster project at Microsoft Research [10] uses a single camera to trackthe head orientation and eye positions. Their view synthesis is quite unique in that theyfirst replace the human eyes in a video frame with synthetic eyes gazing in the desireddirection, then texture-map the eye-gaze corrected video frame to a generic rigid facemodel rotated to the desired orientation. The synthesized photos they published lookmore like avatars, probably due to the underlying generic face model. Another drawbackis that, as noted in their report, using synthetic eyes sometime inadvertently changes thefacial expression as well.

From a much higher level, this GazeMaster work is similar to our proposed ap-proach, in the sense that they both use strong domain knowledge (a 3D face model) tofacilitate the tracking and view synthesis. However, our underlying algorithms, fromtracking to view synthesis, are very different from theirs. We incorporate a stereo cam-era pair, which provides the important epipolar constraint that we use throughout the en-tire process. Furthermore, the configuration of our stereo camera provides much widercoverage of the face, allowing us to generate new distant views without having to worryabout occlusions.

3 System Overview

Figure 3 illustrates the block diagram of our eye-gaze correction system. We use twodigital video cameras mounted vertically, one on the top and the other on the bottomof the display screen. They are connected to a PC through 1394 links. The cameras arecalibrated using the method in [24]. We choose the vertical setup because it provideswider coverage of the subject and higher disambiguation power in feature matching.Matching ambiguity usually involves symmetric facial features such as eyes and lipcontours aligned horizontally. The user’s personalized face model is acquired using a

Page 5: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

Fig. 3. The componentsof our eye-gaze correc-tion system

Head PoseTracking

View Synthesis

Template Matching

Contour Matching

3D Head pose

Face Model

rapid face modelling tool [16]. Both the calibration and model acquisition require littlehuman interaction, and a novice user can complete these tasks within 15 minutes.

With these prior knowledge, we are able to correct the eye-gaze using the algorithmoutlined as follows:

1. Background model acquisition2. Face tracker initialization3. For each image pair, perform

– Background subtraction– Temporal feature tracking in both images– Updating head pose– Correlation-based stereo feature matching– Stereo silhouette matching– Hardware-assisted view synthesis

Currently, the only manual part of the system is the face tracker initialization whichrequires the user to interactively select a few markers. We are currently working onautomatic initialization.

The tracking subsystem includes a feedback loop that supplies fresh salient featuresat each frame to make the tracking more stable under adversary conditions, such aspartial occlusions and facial expression changes. Furthermore, an automatic trackingrecovery mechanism is also implemented to make the whole system even more robustover extended period of time. Based on the tracking information, we are already ableto manipulate the head pose by projecting the live images onto the face model. How-ever, we also want to capture the subtleties of facial expressions and other foregroundobjects, such as hands and shoulders. So we further conduct correlation-based featurematching and silhouette matching between the stereo images. All the matching informa-tion, together with the tracked features, is used to synthesize a seamless virtual imagethat looks as if it were taken from a camera behind the display screen. We have im-plemented the entire system under the MS Windows environment. Without any effortspending on optimizing the code, our current implementation runs about 4-5 frames persecond on a single CPU 1 GHz PC.

4 Stereo 3D Head Pose Tracking

Our stereo head pose tracking problem can be stated as follows:

Given (i) a set of triplets S = {(p,q,m)} at time t, where p and q are respectivelypoints in the upper (first) and the lower (second) camera images, and m is theircorresponding point in the face model, and (ii) a pair of images from the stereocameras at time t + 1,

Page 6: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

determine (i) S′ = {(p′,q′,m)} at time t + 1 where p′ and q′ are the new positionsof p and q, and (ii) compute the head pose, so that the projections of m in timet + 1 are p′ and q′ in the stereo image pair, respectively.

We use the KLT tracker to track feature points p,q from time t to t + 1 [22]. Notethat there is one independent feature tracker for each camera, thus we need apply theepipolar constraint to remove any stray point. The epipolar constraint states that if apoint p = [u, v, 1]T (expressed in homogeneous coordinates) in the first image and apoint q = [u′, v′, 1]T in the second image correspond to the same 3D point m in thephysical world, they must satisfy the following equation:

qT Fp = 0 (1)where F is the fundamental matrix that encodes the epipolar geometry between the twoimages [9]. In fact, Fp defines the epipolar line in the second image, thus Equation (1)means that the point q must pass through the epipolar line Fp and vice versa.

In practice, due to camera noise and inaccuracy in camera calibration and featurelocalization, we define a band of uncertainty along the epipolar line. For every triplet(p′,q′,m), if the distance from q′ to the p′s epipolar line is greater than a certainthreshold, this triplet will be discarded. We use a distance threshold of three pixels inour experiment.

After we have removed all the stray points that violate the epipolar constraint, weupdate the head pose, represented by a 3 × 3 rotational matrix R and a 3D translationvector t, so that the sum of re-projection errors of m to p′ and q′ is minimized. There-projection error e is defined as

e =∑

i

‖ p′i − φ(A0(Rmi + t) ‖2 + ‖ q′

i − φ(A1[R10(Rmi + t) + t10]) ‖2 (2)

where φ(·) represents the standard pinhole projection, A0 and A1 are the cameras’intrinsic parameters, and (R10, t10) is the transformation from the second camera’s co-ordinate system to the first camera’s. Solving (R, t) by minimizing (2) is a nonlinearoptimization problem. We can use the head pose from time t as the initial guess andconduct the minimization by means of, for example, the Levenberg-Marquardt algo-rithm.

After the head pose is determined, we replenish the matched set S′ by adding moregood feature points selected using the criteria in [22]. A good feature point is a pointwith salient textures in its neighborhood. We must be careful not to add feature points inthe non-rigid parts of the face, such as the mouth region. To do so, we define a boundingbox around the tip of the nose that covers the forehead, eyes, and nose region. Any goodfeature points outside this bounding box will not be added to the matched set. However,they will be used in the next stereo matching stage, which we will discuss in Section5.2.

The replenish scheme greatly improves the robustness of the tracking algorithm.Our experiments have shown that our tracking can keep tracking under large head rota-tions and dramatic facial expression changes.

4.1 Tracker Initialization and Auto-Recovery

The tracker needs to know the head pose at time 0 to start tracking. We let the userinteractively select seven landmark points in each image, from which the initial head

Page 7: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

pose can be determined. The initial selection is also used for tracking recovery when thetracker loses tracking. This may happen when the user moves out of the camera’s field ofview or rotates her head away from the cameras. Fortunately, for our video-conferencingapplication, we could just send one of the video streams un-modified for these cases.When she turns back to look at the screen, we do need to continue tracking with nohuman intervention, which requires automatic recovery of the head pose. During thetracking recovery process, the initial set of landmark points is used as templates tofind the best match in the current image. When a match with a high confidence valueis found, the tracker continues the normal tracking. Our recovery scheme is effectivebecause unlike most other tracking applications, we only need to track the user when sheis looking at the display window. This is exactly the scenario when the initial landmarktemplates are recorded.

Furthermore, we also activate the auto-recovery process whenever the current headpose is close to the initial head pose. This prevents the tracker from drifting. A moreelaborated scheme that uses multiple templates from different head poses can furtherimprove the effectiveness of automatic tracker reset. This can be further extended to anon-parametric description of head poses that can self-calibrate over time.

5 Stereo View Matching

The result from the 3D head pose tracking gives a set of good matches within the rigidpart of the face between the stereo pair. To generate convincing and photo-realistic vir-tual views, we need to find more matching points over the entire foreground images,especially along the contour and the non-rigid parts of the face. We incorporate bothfeature matching and template matching to find as many matches as possible. Duringthis matching process, we use the reliable information obtained from tracking to con-strain the search range. In areas where such information is not available, such as handsand shoulders, we relax the search threshold, then apply the disparity gradient limit toremove false matches.

To facilitate the matching (and later view synthesis in Section 6), we rectify theimages using the technique described in [17], so that the epipolar lines are horizontal.

5.1 Disparity and Disparity Gradient Limit

Before we present the details of our matching algorithm, it is helpful to define disparity,disparity gradient, and the important principle of disparity gradient limit, which will beexploited through out the matching process.

Disparity is well defined for parallel cameras (i.e., the two image planes are thesame) [9], and this is the case because we perform stereo rectification such that thehorizontal axes are aligned in both images. Given a pixel (u, v) in the first image and itscorresponding pixel (u′, v′) in the second image, disparity is defined as d = u′−u (v =v′ as images have been rectified). Disparity is inversely proportional to the distance ofthe 3D point to the cameras. A disparity of 0 implies that the 3D point is at infinity.

Consider now two 3D points whose projections are m1 = [u1, v1]T and m2 =[u2, v2]T in the first image, and m′

1 = [u′1, v

′1]

T and m′2 = [u′

2, v′2]

T in the secondimage. Their disparity gradient is defined to be the ratio of their difference in disparity

Page 8: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

to their distance in the cyclopean image, i.e.,

DG =∣∣∣∣

d2 − d1

u2 − u1 + (d2 − d1)/2

∣∣∣∣ (3)

Experiments in psychophysics have provided evidence that human perception im-poses the constraint that the disparity gradient DG is upper-bounded by a limit K.The limit K = 1 was reported in [5]. The theoretical limit for opaque surfaces is 2 toensure that the surfaces are visible to both eyes [20]. Also reported in [20], less than10% of world surfaces viewed at more than 26cm with 6.5cm of eye separation willpresent with disparity gradient larger than 0.5. This justifies use of a disparity gradientlimit well below the theoretical value (of 2) without imposing strong restrictions on theworld surfaces that can be fused by the stereo algorithm. In our experiment, we use adisparity gradient limit of 0.8 (K = 0.8).

5.2 Feature Matching Using Correlation

For unmatched good features in the first (upper) image, we try to find correspondingpoints, if any, in the second (lower) image by template matching. We use normalizedcorrelation over a 9 × 9 window to compute the matching score. The disparity searchrange is confined by existing matched points from tracking, when available.

Combined with matched points from tracking, we build a sparse disparity mapfor the first image and use the following procedure to identify potential outliers (falsematches) that do not satisfy the disparity gradient limit principle. For a matched pixelm and a neighboring matched pixel n, we compute their disparity gradient betweenthem using (3). If DG ≤ K, we register a vote of good match for m; otherwise, weregister a vote of bad match for m. After we have counted for every matched pixel inthe neighborhood of m, we tally the votes. If the “good” votes are less than the “bad”votes, m will be removed from the disparity map. This is conducted for every matchedpixel in the disparity map; the result is a disparity map that conforms to the principle ofdisparity gradient limit.

5.3 Contour Matching

Template-matching assumes that corresponding images patches present some similarity.This assumption may be wrong at occluding boundaries, or object contours. Yet objectcontours are very important cues for view synthesis. The lack of matching informationalong object contours will result in excessive smearing or blurring in the synthesizedviews. So it is necessary to include a module that extracts and matches the contoursacross views in our system.

The contour of the foreground object can be extracted after background subtraction.It is approximated by polygonal lines using the Douglas-Poker algorithm[8]. The con-trol points on the contour are further refined to sub-pixel accuracy using the “snake”technique[13]. Once we have two polygonal contours, denoted by P = {vi|i = 1..n}in the first image and P ′ = {v′

i|i = 1..m} in the second image, we use the dynamicprogramming technique (DP) to find the global optimal match across them.

Since it is straightforward to formulate contour matching as a dynamic program-ming problem with states, stage, and decisions, we will only discuss in detail the design

Page 9: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

of the cost functions (The reader is referred to Bellman’s book about dynamic program-ming techniques [3]). There are two cost functions, the matching cost and the transitioncost. The matching cost function C(i, j) measures the “goodness” of matching betweensegment Vi = vivi+1 in P and segment V ′

j = v′iv

′i+1 in P ′. The lower the cost, the

better the matching. The transition cost function W (i, j|i0, j0) measures the smooth-ness from segment Vi0 to segment Vi, assuming that (Vi, V

′j ) and (Vi0 , V

′j0

) are matchedpairs of segments. Usually, Vi and Vi0 are continuous segments, i.e., ‖i0 − i‖ ≤ 1. Itpenalize for matches that are out of order. The scoring scheme of DP, formulated as aforward recursion function, is then given by

M(i, j) = min( M(i − 1, j − 1) + C(i, j) + W (i, j|i − 1, j − 1),M(i, j − 1) + C(i, j) + W (i, j|i, j − 1),M(i − 1, j) + C(i, j) + W (i, j|i − 1, j) ) .

The Matching Cost. It takes into account the epipolar constraint, the orientation dif-ference, and the disparity gradient limit.

a

b

c

b′ c'

d

p

q

r

s t

(a) Two segments over-lap

p

q

r

s

a

b

c d t

(b) Two segments donot overlap

a

b

c d

p

q

r

s t e1 e2

(c) Segments almost par-allel to the epipolar lines

Fig. 4. Applying the epipolar constraint to contour matching.

– The epipolar constraint: We distinguish three configurations, as shown in Figure4 where the red line is the contour in the first (upper) image; the blue line is the con-tour in the second (lower) image. The dotted lines are the corresponding epipolarlines. In Figure 4(a), segment bc and segment qr are being matched, and Ce = 0.The epipolar constraint limits qr corresponding to segment b′c′, instead of bc. InFigure 4(b), the epipolar constraint tells that segment ab cannot match segment rsbecause there is no overlap. In that case, a sufficiently large cost (Thighcost) is as-signed to this match. When the orientation of at least one line segment is very closeto that of epipolar lines, intersection of the epipolar line with the line segment can-not be computed reliably. In that case, the cost is the average inter-epipolar distance(de = (e1 + e2)/2), as illustrated in the figure. In summary, the epipolar constraintcost for a pair of segment (Vi, V

′j ) is

Ce =

de if Vi or V ′j is close to horizontal lines;

0 if Vi or V ′j overlaps;

Thighcost otherwise.(4)

– The orientation difference: It is defined as a power function of the orientationdifference between the proposed matching segments (Vi, V

′j ). Let ai and aj be

Page 10: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

orientation of Vi and V ′j , the orientation difference is

Ca = (|ai − aj |

Ta)n (5)

where Ta is the angular difference threshold, and n is the power factor. We useTa = 30◦ and n = 2.

– The disparity gradient limit: It is similar to that used in template matching. How-ever, we do not want to consider feature points in matching contour segments be-cause the contour is on the occluding boundary, where the disparity gradient withrespect to the matched feature points is very likely to exceed the limit. On the otherhand, it is reasonable to assume that the disparity gradient limit will be upheld be-tween the two endpoints of the segment. We adopt the disparity prediction modelin [25]. That is, given a pair of matched points (mi,m′

i), the disparity of a point mis modeled as

d = di + Dini (6)

where di = ‖m′i −mi‖,Di = ‖m−mi‖, and ni ∼ N(0, σ2I) with σ = K/(2−

K). A pair of matched segments contains two pairs of matched endpoints (ms,m′s)

and (me,m′e). We use (ms,m′

s) to predict the disparity of me, and compute thevariance of the “real” disparity from the predicted one. Similarly we also computethe variance of the predicted disparity of ms using (me,m′

e). As suggested in [25],the predicted variance should be less restrictive when the point being considered isaway from the matched point, which leads to the following formulae:

σi = [σmax − σmin](1 − exp(−D2i /τ2)) + σmin (7)

where the range [σmin, σmax] and τ are parameters. We use σmin = 0.5, σmax =1.0, and τ = 30.Now we can finally write out the disparity gradient cost. Let ds = ‖m′

s − ms‖,de = ‖m′

e −me‖, D1 = ‖me − ms‖, D2 = ‖m′e −m′

s‖, and ∆d = de − ds; σe

and σs are computed by plugging in D1 and D2 into (7); the disparity gradient costis given by

Cd = ∆d2/σ2e + ∆d2/σ2

s . (8)

Combining all the above three terms, we have the final matching cost as:C = max (Thighcost, Ce + waCa + wdCd) (9)

where wa and wd are weighting constants. The match cost is caped by Thighcost. Thisis necessary to prevent any corrupted segment in the contour from contaminating theentire matching. In our implementation, Thighcost = 20, wa = 1.0, and wd = 1.0.

The Transition Cost. In contour matching, when two segments are continuous in oneimage, we would prefer that their matched segments in the other image are continuoustoo. This is not always possible due to changes in visibility: some part of the contourcan only be seen in one image. The transition cost (W ) is designed to favor smoothmatching from one segment to the next, while taking into account discontinuities due toocclusions. The principle we use is again the gradient disparity limit. For two consecu-tive segments Vi and Vi+1 in P , the transition cost function is the same as the one usedin matching cost – equation (8), except that the two pairs of matched points involvedare now the endpoint of Vi and the starting point of Vi+1 and their corresponding pointsin P ′.

Page 11: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

6 View Synthesis

From the previous tracking and matching stages, we have obtained a set of point matchesand line matches that can be used to synthesize new views. Note that these matches con-tain not only the modeled face part, but also other foreground part such as hands andshoulders. This is yet another advantage of our stereovision-based approach. We couldobtain a more complete description of the scene geometry beyond the limit of the facemodel. Treating these matches as a whole, our view synthesis methods can create seam-less virtual imagery.

We implemented and tested two methods for view synthesis. One is based on viewmorphing [21] and the other uses hardware-assisted multi-texture blending. The viewmorphing technique allows to synthesize virtual views along the path connecting theoptical centers of the two cameras. A view morphing factor cm controls the exact viewposition. It is usually between 0 and 1, whereas a value of 0 corresponds exactly to thefirst camera view, and a value of 1 corresponds exactly to the second camera view. Anyvalue in between represents a virtual viewpoint somewhere along the path from the firstcamera to the second.

In our hardware-assisted rendering method, we first create a 2D triangular meshusing Delaunay triangulation in the first camera’s image space. We then offset each ver-tex’s coordinate by its disparity modulated by the view morphing factor cm, [u′

i, v′i] =

[ui + cmdi, vi]. The offset mesh is fed to the hardware render with two sets of texturecoordinates, one for each camera image. Note that all the images and the mesh are inthe rectified coordinate space. We need to set the viewing matrix to the inverse of therectification matrix to “un-rectify” the resulting image to its normal view position. Thisis equivalent to the post-warp in view morphing. Thus the hardware can generate thefinal synthesized view in a single pass. We also use a more elaborate blending scheme,thanks to the powerful graphics hardware. The weight Wi for the vertex Vi is based onthe product of the total area of adjacent triangles and the view-morphing factor, as

Wi =∑

S1i ∗ (1 − cm)∑

S1i ∗ (1 − cm) +

∑S2

i ∗ cm; (10)

where S1i are the areas of the triangles of which Vi is a vertex, and S2

i are the areas of thecorresponding triangles in the other image. By changing the view morphing factor cm,we can use the graphics hardware to synthesize correct views with desired eye gaze inreal-time. Because the view synthesis process is conducted by hardware, we can sparethe CPU for more challenging tracking and matching tasks.

Comparing these two methods, the hardware-assisted method, aside from its blazingspeed, generates crisper results if there is no false match in the mesh. On the otherhand, the original view morphing method is less susceptible to bad matches, because itessentially uses every matched point or line segment to compute the final coloring of asingle pixel, while in the hardware-based method only the three closest neighbors areused.

Regarding the background, it is very difficult to obtain a reliable set of matches sincethe baseline between the two views is very large, as can be observed in the two originalimages shown in Fig. 1. In this work, we do not attempt to model the background at all,but we offer two solutions. The first is to treat the background as unstructured, and addimage boundary as matches. The result will be ideal if the background has a uniform

Page 12: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

color; otherwise, it will be fuzzy as shown in the synthesized view shown in Fig. 1.The second solution is to replace the background by anything appropriate. In that case,view synthesis is only performed for the foreground objects. In our implementation,we overlay the synthesized foreground objects on the image from the first camera. Theresults shown in the following section were produced in this way.

7 Experiment Results

We have implemented our proposed approach using C++ and tested with several setsof real data. Very promising results have been obtained. We will first present a set ofsample images to further illustrate our algorithm, then we will show some more resultsfrom different test users. For each user, we built a personalized face model using aface modelling tool[16]. This process, which takes only a few minutes and requires noadditional hardware, only needs to be done once per user. All the parameters in ouralgorithm are set to be the same for all the tests.

Figure 5 shows the intermediate results at various stages of our algorithm. It startswith a pair of stereo images in Fig. 5(a); Fig. 5(b) shows the matched feature points,the epipolar lines of feature points in the first image are drawn in the second image.Fig. 5(c) shows the extracted foreground contours: the red one (typically a few pixelsfar away from the ”true” contour) is the initial contour after background subtractionwhile the blue one is the refined contour using the “snake” technique. In Fig. 5(d) weshow the rectified images for template matching. All the matched points form a meshusing Delaunay triangulation, as shown in Fig. 5(e). The last image (Fig. 5(f)) showsthe synthesized virtual view. We can observe that the person appears to look down andup in the two original image but look forward in this synthesized view.

During our experiments, we captured all test sequences with resolution 320x240at 30 frames per second. Our current implementation can only run at 4 to 5 frames persecond. The results shown here are computed with our system in a “step-through” mode.Except the manual initialization performed once at the beginning of the test sequences,the results are computed automatically without any human interaction.

The first sequence (Figure 6) shows the viability of our system. Note the large dis-parity changes between the upper and lower camera images, making direct template-based stereo matching very difficult. However, our model-based system is able to accu-rately track and synthesize photo-realistic images under the difficult configuration, evenwith partial occlusions or oblique viewing angles.

Our second sequence is even more challenging, containing not only large head mo-tions, but also dramatic facial expression changes and even hand waving. Results fromthis sequence, shown in Figure 7, demonstrated that our system is both effective androbust under these difficult conditions. Non-rigid facial deformations, as well as thesubject’s torso and hands, are not in the face model, yet we are still able to generateseamless and convincing views, thanks to our view matching algorithm that includesa multitude of stereo matching primitives (features, templates, and curves). Templatesmatching finds matching points, as many as possible, in regions where the face modeldoes not cover, while contour matching preserves the important visual cue of silhou-ettes.

Page 13: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

(a) The input imagepair

(b) Tracked featurepoints with epipolarline superimposed

(c) Extracted fore-ground contours

(d) Rectified im-ages for stereomatching

(e) Delaunay triangulation overmatched points

(f) Final synthesized view (un-cropped)

Fig. 5. Intermediate results of our eye-gaze correction algorithm

8 Conclusions

In this paper, we have presented a software scheme for maintaining eye contact dur-ing video-teleconferencing. We use model-based stereo tracking and stereo analysis tocompute a partial 3D description of the scene. Virtual views that preserve eye con-tact are then synthesized using graphics hardware. In our system, modeled-based headtracking and stereo analysis work hand in hand to provide a new level of accuracy, ro-bustness, and versatility that neither of them alone could provide. Experimental resultshave demonstrated the viability and effectiveness of our proposed approach.

While we believe that our proposed eye-gaze correction scheme represents a largestep towards a viable video-teleconferencing system for the mass market, there arestill plenty of rooms for improvements, especially in the stereo view matching stage.We have used several matching techniques and prior domain knowledge to find goodmatches as many as possible, but we have not exhausted all the possibilities. We be-lieve that the silhouettes in the virtual view could be more clear and consistent across

Page 14: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

Fig. 6. Sample results from the first test sequence. The top and bottom rows show the imagesfrom the top and bottom cameras. The middle row displays the synthesized images from a virtualcamera located in the middle of the real cameras. The frame numbers from left to right are 108,144, 167 and 191.

frames if we incorporate temporal information for contour matching. Furthermore, thereare still salient curve features, such as hairlines and necklines, that sometimes go un-matched. They are very difficult to match using a correlation-based scheme because ofhighlights and visibility changes. We are investigating a more advanced curve matchingtechnique.

References

1. A. Azarbayejani, B. Horowitz, and A. Pentland. Recursive Estimation of Structure and Mo-tion Using the Relative Orientation Constraint. In Proceedings of the Computer Vision andPattern Recognition Conference, pages 70–75, 1993.

2. S. Basu, I. Essa, and A. Pentland. Motion Regularization for Model-based Head Tracking.In Proceedings of International Conference on Pattern Recognition, pages 611–616, Vienna,Austria, 1996.

3. R. Bellman. Dynamic Programming. Princeton University Press, Princeton, New Jersey,1957.

4. M. J. Black and Y. Yacoob. Tracking and Recognizing Rigid and Non-Rigid Facial MotionsUsing Local Parametric Model of Image Motion. In Proceedings of International Conferenceon Computer Vision, pages 374–381, Cambridge, MA, 1995.

5. P. Burt and B. Julesz. A gradient limit for binocular fusion. Science, 208:615–617, 1980.6. T.J. Cham and M. Jones. Gaze Correction for Video Conferencing. Compaq Cambridge

Research Laboratory, http://www.crl.research.digital.com/vision/interfaces/corga.7. D. DeCarlo and D. Metaxas. Optical Flow Constraints on Deformable Models with Ap-

plications to Face Tracking. International Journal of Computer Vision, 38(2):99–127, July2001.

Page 15: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

Fig. 7. Sample results from the second test sequence. The upper and lower rows are the originalstereo images, while the middle rows are the synthesized ones. The triangular face model isoverlayed on the bottom images. From left to right and from top to bottom, the frame numbersare 159, 200, 400, 577, 617, 720, 743, and 830.

Page 16: Eye Gaze Correction With Stereovision for Video ...dmn.netlab.uky.edu/~ryang/publications/ECCV-camready.pdfEye Gaze Correction With Stereovision for Video-Teleconferencing Ruigang

8. D.H. Douglas and T.K. Peucker. Algorithms for the Reduction of the Number of Points Re-quired to Represent a Digitized Line or Its Caricature. Canadian Cartographer, 10(2):112–122, 1973.

9. O. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press,1993.

10. J. Gemmell, C.L. Zitnick, T. Kang, K. Toyama, and S. Seitz. Gaze-awareness for Videocon-ferencing: A Software Approach. IEEE Multimedia, 7(4):26–35, October 2000.

11. T. Horprasert. Computing 3-D Head Orientation from a Monocular Image. In InternationalConference of Automatic Face and Gesture Recognition, pages 242–247, 1996.

12. Michael Jones. Multidimensional Morphable Models: A Framework for Representing andMatching Object Classes. International Journal of Computer Vision, 29(2):107–131, Au-guest 1998.

13. M. Kass, A. Witkin, and D. Terzopoulos. Snake: Active Contour Models. InternationalJournal of Computer Vision, 1(4):321–331, 1987.

14. R. Kollarits, C. Woodworth, J. Ribera, and R. Gitlin. An Eye-Contact Camera/Display Sys-tem for Videophone Applications Using a Conventional Direct-View LCD. SID Digest,1995.

15. J. Liu, I. Beldie, and M. Wopking. A Computational Approach to Establish Eye-contactin Videocommunication. In the International Workshop on Stereoscopic and Three Dimen-sional Imaging (IWS3DI), pages 229–234, Santorini, Greece, 1995.

16. Z. Liu, Z. Zhang, C. Jacobs, and M. Cohen. Rapid Modeling of Animated Faces From Video.Journal of Visualization and Compute Animation, 12(4):227–240, 2001.

17. C. Loop and Z. Zhang. Computing Rectifying Homographies for Stereo Vision. In IEEEConf. Computer Vision and Pattern Recognition, volume I, pages 125–131, June 1999.

18. L. Mhlbach, B. Kellner, A. Prussog, and G. Romahn. The Importance of Eye Contact in aVideotelephone Service. In 11th Interational Symposium on Human Factors in Telecommu-nications, Cesson Sevigne, France, 1985.

19. M. Ott, J. Lewis, and I. Cox. Teleconferencing Eye Contact Using a Virtual Camera. InINTERCHI ‘93, pages 119 – 110, 1993.

20. S. Pollard, J. Porrill, J. Mayhew, and J. Frisby. Disparity Gradient, Lipschitz Continuity, andComputing Binocular Correspondance. In O.D. Faugeras and G. Giralt, editors, RoboticsResearch: The Third International Symposium, volume 30, pages 19–26. MIT Press, 1986.

21. S.M. Seitz and C.R. Dyer. View Morphing. In SIGGRAPH 96 Conference Proceedings,volume 30 of Annual Conference Series, pages 21–30, New Orleans, Louisiana, 1996. ACMSIGGRAPH, Addison Wesley.

22. J. Shi and C. Tomasi. Good Features to Track. In the IEEE Conferecne on Computer Visionand Pattern Recognition, pages 593–600, Washington, June 1994.

23. R.R. Stokes. Human Factors and Appearance Design Considerations of the Mod II PIC-TUREPHONE Station Set. IEEE Trans. on Communication Technology, COM-17(2), April1969.

24. Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22(11):1330–1334, 2000.

25. Z. Zhang and Y. Shan. A Progressive Scheme for Stereo Matching. In M. Pollefeys et al., ed-itor, Springer LNCS 2018: 3D Structure from Images - SMILE 2000, pages 68–85. Springer-Verlag, 2001.