Design and Application of a Head Detection and Tracking ...

Design and Application of a Head Detection and

Tracking System

by

Jared Smith-Mickelson

Submitted to the

Department of Electrical Engineering and Computer Science

in Partial Fulfillment of the Requirements for the Degree of

Masters of Engineering in Electrical Engineering and Computer Science ENGMASSACHUSETTS INSTITUTEat the OF TECHNOLOGY

Massachusetts Institute of Technology JUL 2 7 2000

June 2000 LIBRARIES

Copyright 2000 Jared Smith-Mickelson. All rights reserved.

The author hereby grants to M.I.T. permission to reproduce anddistribute publicly paper and electronic copies of this thesis

and to grant others the right to do so.

A uthor ........................... ............Department of Electrical Engineering and Computer Science

May 22, 2000

C ertified by ................................ ....... . ............Trevor J. Darrell

T~hesisupervisor

Accepted by..Arthur C. Smith

Chairman, Department Committee on Graduate Theses

Design and Application of a Head Detection and Tracking System

by

Jared Smith-Mickelson

Submitted to theDepartment of Electrical Engineering and Computer Science

May 22, 2000

In Partial Fulfillment of the Requirements for the Degree ofMasters of Engineering in Electrical Engineering and Computer Science

Abstract

This vision system is designed to detect and track the head of a subject moving aboutwithin HAL, an intelligent environment. It monitors activity in the room through astereo camera pair and detects heads using shape, motion, and size cues. Once a headis found, the system tracks its three dimensional position in real time. To test anddemonstrate the system, an automated teleconferencing application was developed.The head coordinates from the tracking system are transformed into pan and tiltdirectives to drive two steerable teleconferencing cameras. As the subject movesabout the room, these cameras keep the head within their field of view.

Thesis Supervisor: Trevor J. DarrellTitle: Assistant Professor

3

Acknowledgments

Foremost, I would like to thank the three advisors I have had over the course ofthis thesis's development. Professor Charles Sodini and Associate Director HowardShrobe were my instructors for a class entitled Technology Demonstration Systems.They gave me the opportunity to explore various areas of research, led me through theprocess of refining and preparing a proposal, and helped me begin work at an earlystage. Assistant Professor Trevor Darrell was research advisor to me during the laterstages of my work and influential on a technical level. Having in-depth knowledge ofthe field, he was able to discuss with me the finer details of my work, offering relevantsuggestions and pointing me to key papers.

I would also like to give thanks to Michael Coen. As founder and head of the HALproject, he played a critical role in helping integrate my work into the room. He wasalways available to answer questions, share new ideas with, and give feedback.

Lastly, I wish to thank two other members of the HAL project, Krzysztof Gajosand Stephen Peters, both of whom willingly provided their support and were plea-surable coworkers.

5

Contents

1 Introduction

2 Related Work

3 System Implementation

3.1 Overview . . . . . . . . . . . . . . . . . . .

3.2 Room Layout . . . . . . . . . . . . . . . .

3.3 Head Detection . . . . . . . . . . . . . . .

3.3.1 Motion Detection . . . . . . . . . .

3.3.2 Determining Depth . . . . . . . . .

3.3.3 Finding a Coherent Line of Moving

3.3.4 Ellipse Fitting . . . . . . . . . . . .

3.4 Head Tracking . . . . . . . . . . . . . . . .

3.4.1 Determining Accurate Depth . . .

3.5 Transforming Coordinates . . . . . . . . .

3.6 Equipment . . . . . . . . . . . . . . . . . .

4 Results

4.1 Head

4.1.1

4.1.2

4.1.3

Pixels

Detection and Tracking Results

Continuous Detection . . . . .

Occlusion . . . . . . . . . . .

Out-of-plane Rotation . . . .

7

11

15

21

21

23

24

25

27

30

31

32

33

35

37

39

39

39

40

42

4.1.4 Head Decoys . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.5 Accuracy of Depth Calculation . . . . . . . . . . . . . . . . . 43

4.2 Teleconferencing Results . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Conclusion 53

5.1 Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8

List of Figures

1-1 Stereo Camera Pair's View of the Intelligent Room . . . . . . . . . . 12

3-1 System Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3-2 HAL Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3-3 Motion Detection Using Frame Differencing and Thresholding . . . . 26

3-4 Calculation of Depth to High Motion in Each Column . . . . . . . . . 29

3-5 Highest Line of Coherent Motion Wide Enough to be a Head Where

the Resulting Candidate is a Correct Detection . . . . . . . . . . . . 31

3-6 Highest Line of Coherent Motion Wide Enough to be a Head Where

the Resulting Candidate is a False Positive . . . . . . . . . . . . . . . 32

3-7 Template of Head From Left and Right Image . . . . . . . . . . . . . 34

3-8 Stereo Camera Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3-9 Steerable Teleconferencing Camera . . . . . . . . . . . . . . . . . . . 38

4-1 Consecutive Frames of a Sequence Illustrating the Benefits of Contin-

uous Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4-2 Consecutive Frames of an Initialization Sequence . . . . . . . . . . . 44

4-3 Every Forth Frame of an Occlusion Sequence . . . . . . . . . . . . . . 45

4-4 Consecutive Frames of an Occlusion Sequence Using Ellipse Tracking 46

4-5 Consecutive Frames of an Occlusion Sequence Using Template Tracking 47

4-6 Every Tenth Frame of a Rotation Sequence . . . . . . . . . . . . . . . 48

4-7 Every Tenth Frame of a Decoy Sequence . . . . . . . . . . . . . . . . 49

9

4-8 Every Tenth Frame of a Depth Test Sequence . . . . . . . . . . . . . 50

4-9 Plot of Depth to Head Over Time . . . . . . . . . . . . . . . . . . . . 51

10

Chapter 1

Introduction

The continued rise in processor power has recently dropped the computation time

of many computer vision tasks to a point where they can be used in real time sys-

tems. One area of computer vision research that has benefited by this advance and

received considerable attention in the last few years is person tracking. Person track-

ing is a broad field encompassing the detection, tracking and recognition of bodies,

heads, faces, expressions, gestures, actions, and gaze directions. Applications include

surveillance, human computer interaction, teleconferencing, computer animation, vir-

tual holography, and intelligent environments. This thesis describes the implementa-

tion of a head detection and tracking system. To test and demonstrate the system,

the output of the tracker is used to guide the movement of steerable cameras in an

automated teleconferencing system.

The head detection and tracking system was built as an extension to the Al Lab's

intelligent environment known as HAL. The development of HAL is a continuing effort

to create a space in which humans can interact naturally, through voice and gesture,

with computer agents that control the room's appliances and software applications.

The tracking system monitors activity in the room through a monochrome stereo

camera pair mounted high up on a wall opposite a couch. The stereo pair's view of

HAL is shown in Figure 1-1.

11

Figure 1-1: Stereo Camera Pair's View of the Intelligent Room. The top image isfrom the left camera. The bottom is from the right.

The system takes a multi-modal approach to detection, using shape, motion, and

size cues. At the core of the detector is an elliptical shape filter. Applying the filter at

multiple scales to the entire image is costly. To achieve real time detection, the system

uses a novel combination of depth and motion information to narrow the search. The

use of these cues also helps reduce the detector's rate of false positives.

Once a head is detected, the system tracks it using the elliptical shape filter,

constrained by depth and a simple velocity model. The same head is tracked until

the continuously running detector presents a better candidate.

As a head is tracked, its depth is calculated using normalized correlation and

refined by computing the parametric optical flow across the matching templates from

the left and right images. This refinement is essential for accurate depth.

The three-dimensional coordinates of the head relative to the stereo pair are trans-

formed into pan and tilt directives for HAL's two teleconferencing cameras. The result

12

is an automated teleconferencing system whereby HAL's two steerable cameras keep

the subject's head in their field of view.

13

Chapter 2

Related Work

The survey of existing systems presented in this section is by no means exhaustive.

A comprehensive overview of the vast number of systems described in the literature

is beyond the scope of this document. The systems mentioned here were chosen to

illustrate the broad variety of approaches taken to the problem of head detection and

tracking.

An earlier vision system for HAL is described by Coen and Wilson in [3]. By

applying background differencing, skin color detection, and face detection algorithms

on images from every camera in the room, the system builds hyper-dimensional feature

vectors. These vectors are classified using previously trained statistical models of

events. Events are qualitative in nature: someone is sitting on the couch, someone is

standing in front of the couch, someone is lying on the couch.

Background differencing is carried out on images from HAL's steerable telecon-

ferencing cameras by synthesizing a background from a panoramic image constructed

offline. To detect skin color the algorithm described in [5] is used. This technique

involves classifying each pixel as "skin" or "not skin" using empirically estimated

Gaussian models in the log color-opponent space. The face detection algorithm used

is the same as that used by Cog, the humanoid robot [12]. A face is detected when

ratios between the average intensities of various spatial regions are satisfied.

15

All three of these modules are prone to error. Because the teleconferencing cam-

eras do not rotate about their centers of projections, background differencing is unre-

liable, especially for regions physically close to the cameras. Backgrounds also tend to

change with lighting and human use. The face detector works only for frontal views.

And the skin color detector will find all skin colored objects in the room, regardless

of whether they belong to a person. Fortunately, these nonidealities are permissible

as the system has only to decide between a limited number of qualitative events. The

system cannot, however, produce accurate coordinates of a subject's head.

Rowley et al., at CMU, have a neural network that detects frontal upright views

of faces [10]. The network was trained using hand labeled images of over one thou-

sand faces taken from the Web. To refine the network, negative examples were also

presented during training. The negative examples were chosen in a bootstrapping

fashion by applying early versions of the network to natural images containing no

faces.

In a later paper [11], an addition to the system is described which allows it to

detect faces with arbitrary in-plane rotation. The addition is a neural net router which

determines and corrects the orientation of each candidate location before it is passed

to the original system. This method, however, cannot be applied to out-of-plane

rotation. CMU's system is limited to frontal views of faces. It is also computationally

expensive and is generally applied only to static scenes. However, ten years from now,

it may be feasible to run numerous neural nets in parallel, each trained for recognizing

different orientations of the head, achieving a robust real-time continuous detection

system for human heads.

McKenna et al. developed a neural network based on CMU's and applied it to

both the detection and tracking of heads in dynamic scenes [8]. For detection, they use

motion to constrain the search space of the expensive neural net computation. After

grouping moving pixels into blobs, they estimate which part of each blob corresponds

to a head and limit the search to these regions. The details of how they estimate the

16

position of the head are left out, so the similarity of their approach to that described

in section 3.3 is undetermined. For tracking, a Kalman filter is used to predict the

position of the head. The neural network is applied to a small window around this

predicted position. Its output is fed back in to the Kalman filter as a measurement.

Another system which addresses both detection and tracking is described by

Breymer et al. at SRI International [1]. Its continuous detector segments foreground

objects by applying background differencing to depth map sequences. It then corre-

lates person-shaped templates with detected foreground objects to present candidates

to the tracker. The tracker uses a Kalman filter to predict position. Measurements

are made by performing correlations with an intensity template and are fed back in to

the Kalman filter. To avoid drift, the position of the intensity template is re-centered

using the person-shaped template. The shape of the person templates reflect the

assumption implied by Breymer et al. that people hold their arms at their sides and

remain upright. The system is not designed to handle exceptions to this assumption.

Pfinder, developed at the media lab by Wren et al., tracks people by following the

color and shape of blobs [14]. After extracting a user silhouette through background

subtraction, the system segments it into blobs by clustering feature vectors made up

of color and spatial information. The position of each blob is individually tracked

and predicted using a Kalman filter. Pixels in a new frame of video are classified

as background or as belonging to a particular blob using a maximum a posteriori

probability approach. Connectivity constraints are added and smoothing operators

are applied to improve the clarity of the blob segmentation. The system is designed to

handle shadows and occlusion. Skin color is used to help label the blobs corresponding

to the head and the hands.

Darrell et al. describe a system developed at Interval Research that integrates

depth, color, and face detection modules to detect and track heads [4]. Using real-

time depth sequences from a stereo system implemented on an FPGA, images are

segmented into user silhouettes. For each silhouette, head candidates are collect-

17

ed from the three modules and integrated to produce a final head position. The

depth module places candidate heads under the maxima of each silhouette. The color

module searches for skin colored regions within each silhouette in a manner derived

from [5]. Candidates are placed in skin colored regions that are high enough and of

the right size to be heads. The face detection module uses CMU's neural network [10]

and is initially run over the entire image. All resulting detections are presented as

candidates. This process is slow. In order to present candidates successively, the

position of skin color and depth candidates are locally tracked and presented by the

face detector module if they have overlapped with a face detection hit in the recent

past. As in [2], the failure modes of the modules used in this system are claimed to

be nearly independent. When this is the case, increasing the number of modules used

greatly increases the reliability of the system.

A wholly different approach was taken by Morimoto et al. at IBM Almaden Re-

search Center [9]. Noting that human eyes reflect near infrared light directly back

towards the source, they built a camera which captures two images: one illuminated

with on-axis infrared light and another illuminated with off-axis infrared light. Re-

gions that are bright in the on-axis image and dark in the off-axis image generally

correspond to eyes. The positions of heads are extrapolated from the detected posi-

tions of the eyes. One drawback to the system is that since the two images are taken

asynchronously, only static eyes can be detected.

Stan Birchfield, at Stanford, has a head tracker that combines outputs from an

elliptical shape module and a color histogram module to correct the predicted position

of the subject's head. The elliptical shape module computes the average of dotproduct

magnitude between an ellipse normal and the image gradient at candidate positions in

a window around the predicted location. The second module computes the histogram

intersection between the newly calculated histogram at each candidate position and a

previously learned static model of the subject's hair and skin color. The claim is that

the two modules have roughly orthogonal failure modes and therefore complement

18

each other to produce a robust tracking system.

Birchfield's system does not address the issue of initialization except to say that

the image region used to define the subject's model histogram is set either manually

or by the output of the elliptical shape module alone. In a cluttered environment,

however, the ellipse module cannot be used to reliably find the head. In addition,

the computation time needed to run the ellipse fitting across the entire image and

at multiple scales is on the order of seconds. It is therefore unlikely that Birchfield's

system can be used to reliably detect the heads of new subjects.

The systems summarized in this section were chosen to illustrate the wide variety

of approaches taken to the problem of head detection and tracking. The use of color,

shape, motion, depth, size, background differencing, depth background differencing,

neural networks, ratio templates, and infrared reflectance have all been demonstrated.

Yet more systems have been built which use invasive techniques such as fiducials and

headgear. These have been intentionally left out. The focus of HAL is to enable

humans to interact with computer systems as they do with each other. Systems

which force the user to conform to unnatural, unfamiliar, or obtrusive methods of

interaction are not condoned. In the development of HAL, the goal is to instead to

design computer systems capable of interacting on a human level.

This thesis documents an extension to Birchfield's elliptical tracker [2]. The new

system is designed to handle initialization through continuous real-time detection in

the spirit of [1]. Motion and size cues have been incorporated to compensate for the

unavailability of color. And stereo is used to compute accurate depth to the subject's

head.

19

Chapter 3

System Implementation

3.1 Overview

Tracking systems have been designed to handle occlusion, out-of-plane rotation,

changes in lighting, and deformation. But they are never foolproof. A usable system

must expect and be capable of recovering from tracking failure. One way to meet this

requirement is through the use of continuous real-time detection as described in [1].

If detection occurs in every frame, tracking failure, if recognized, can be immediately

corrected. The head detector described in this thesis has been designed for real-time

use and is run in parallel with the tracker. When the detector finds a better-fit can-

didate than is currently being tracked, it triggers the tracker to switch attention to

the new target. If the tracker is without a target, it will lock on to the first candidate

found by the detector.

The head tracker is quite robust. Its low failure rate allows a high frequency of

misses on the part of the detector. Misses cause no detrimental effects while the

tracker is correctly following the head. When the tracker fails, it usually does so in

conjunction with head motion. The detector, however, is most reliable under such

circumstances and thereby works to complement the tracker's inefficiencies. The

failure modes of the detector and tracker are nearly mutually exclusive. This claim

21

is similar to that of Birchfield's [2] mentioned in section 2.

To follow the head, the system employs an elliptical contour tracker. Head position

is predicted using a simple velocity model. An ellipse is then best fit to the image

gradient in a window around the predicted position. The fit is measured by averaging

the magnitude of the dot product between the ellipse normal and the image gradient.

This tracking technique is resistant to out-of-plane rotation, partial occlusion, and

variation in hair and skin colors.

Depth to the head is calculated from disparity in position of projections onto

the image planes of the stereo pair. To find this disparity, the image of the head

as captured by the left camera is used as a template and matched to a position

in the right image. The correspondence is found through normalized correlation.

This process yields a discrete result. To achieve a sub-pixel measure of disparity,

parametric optical flow is computed across the match.

The head detector uses the same elliptical shape filter as the tracker. When

applied at multiple scales to the entire image, this shape fitting is a computationally

expensive operation. To achieve real time detection rates, its search space must be

drastically reduced. This reduction is accomplished through a novel use of motion

and size cues. Motion is found by frame differencing. If a pixel's intensity changes

significantly from one frame to the next, it is considered moving. Size is found from

disparity in correspondence matches as described above.

Reducing the search space of the elliptical shape filter requires an assumption

about where the head is most likely to be. The assumption made in this system is

that the top of a head in motion produces a coherent line of pixels as wide as the

head itself, and that no other such line appears above the head. The detector needs

then to find the highest coherent line of moving pixels wide enough to be a human

head and apply an appropriately sized elliptical shape filter to the region under this

line. This assumption also helps reduce the detector's rate of false positives. An

unconstrained elliptical shape filter may detect an object in the background or one

22

whose size cannot possibly be that of a human head. Situations where this assumption

fails include when a large object is moving above the head and when the head is not

moving fast enough to register motion.

To find a coherent line of moving pixels wide enough to represent the top of a head

and to chose an appropriately sized ellipse for which to search, the detector must first

have knowledge of depth to the motion in the scene. Ideally, it needs to know the

distance to the highest moving pixel in each column of the image. Unfortunately,

the depth to an arbitrary pixel cannot always be found with confidence. Ambiguity

arises when attempts are made to find correspondences for image patches lacking

strong features. To address this, the detector favors image patches around lower

moving pixels if they offer more promising correspondence properties, namely strong

edges in multiple directions.

Once the highest coherent line of moving pixels is found, the detector returns a

new head candidate, one which best fits an appropriately sized ellipse in the region

directly under the line. This candidate becomes the new tracking target if the tracker

is without one or if the tracker's current target is less fit with regard to elliptical

shape.

Finally, the three-dimensional coordinates of the head relative to the stereo pair

are transformed into pan and tilt directives for the two teleconferencing cameras in

the HAL. The cameras steer to keep the subject's head with their field of view.

A flow chart of the system is show in figure 3-1.

3.2 Room Layout

HAL is an intelligent office space. It monitors user activity via lapel microphones,

eight video cameras, and the states of equipment in the room. It can present material

to users through a sound system, two projector displays, and a television. A picture

of HAL is shown in figure 3-2. There is a couch against the wall. In front of this

couch is a coffee table upon which sits, at a height of 50cm, one of the two steerable

23

detectormotion normalized _. highest coherent ellipse

detection correlation line of motion fittingstereoimages r- - -- -- -- ------- -- -- - - ---- -- ~-- -- -- -- -- -- - -- -- - - -- --- I- --- - ---

ellipse choose betterfitting fit candidate

position parametric normalized head

prediction optical flow correlation positiontracker1

coordinate steerabletransform camera

coordinate steerabletransform camera

Figure 3-1: System Flow Chart

teleconferencing cameras. This camera can capture close-up frontal shots of people

sitting on the couch, but cannot tilt high enough to see the face of someone standing.

A second teleconferencing camera is located on top of a television beyond the end of

the couch at a height of 120cm. This camera obtains three-quarters views of people

sitting on the couch. The stereo camera pair is mounted high up on the wall opposite

the couch. It looks down towards the couch at an angle of 350 from horizontal.

3.3 Head Detection

The head detector looks for an elliptical shape in a search space constrained by

motion and depth cues. One drawback to this approach is that stationary heads are

invisible to the detector. In a teleconferencing scenario, however, it is reasonable to

assume that the head will move regularly. And if it does not, there is no need for the

cameras to steer to a new position. It should also be noted that head motion generally

accompanies tracking failure. If there is no motion, a head cannot be detected. But,

24

Figure 3-2: HAL Layout

without motion, it is unlikely the tracker will fail. Except for situations involving

severe occlusion, the tracker robustly trains stationary objects.

3.3.1 Motion Detection

Occasionally the most naive approach is found to yield adequate results. This was case

with motion detection. Simple frame differencing is used to find pixels corresponding

to moving objects. If a pixel's intensity changes significantly from one frame to the

next, it is considered moving.

Et(x, y, t) = -E(x, y, t) = E(x, y, t) - E(x, y, t - 1) (3.1)6t

(|Et(x, y, t)I > thm) ++ MOTION(x, y, t) (3.2)

The threshold value, thm, is set well above the magnitude of noise in the image to

minimize false positives. This approach requires little computation and has minimal

latency. Other techniques, such as finding temporal zero crossings as in [8], require

25

smoothing in time which introduces significant latencies. The results of (3.2) can be

seen in figure 3-3.

Figure 3-3: Motion Detection Using Frame Differencing and Thresholding. The topimage is a raw frame of video. The bottom image shows the result of the motiondetection algorithm (3.2).

To constrain the search space for the shape filter, an assumption is made: The

top of a head in motion will create a coherent line of moving pixels as wide as the

head itself, and no other such line will appear above the head. There are, of course,

situations where this is not the case. For example, someone may wave their arms above

their head as in figure 3-6. Or, the head's velocity may not be great enough to create

a coherent line of motion. But, as mentioned above, the complimentary nature of

tracking and continuous detection allows for a substantial degree of detection misses.

Given this assumption, the shape filter need only search the region under the

highest coherent line of moving pixels wide enough to represent the top of a head. To

further constrain the search space, the size of the ellipse can be set according to the

depth to the line. To find such a line, the depth to the highest moving pixel in each

26

column of the image must be calculated. This depth is used to adjust the width of a

search window through which to look for the highest coherent line of moving pixels.

Thus, moving objects too small to be heads are ignored. The line of motion they

create is too narrow. The white line in figure 3-4 shows the highest moving pixel in

each column of the image.

3.3.2 Determining Depth

Depth can be extracted from correspondences across the images of a stereo camera

pair. An object's depth, z, relates inversely to the disparity, d, in the position of its

projections on to the two image planes.

_ fb (3.3)

d

Here f is the focal length, and b is the baseline, the distance between the two

cameras. The stereo pair used for this system has a horizontal baseline of 7cm.

Disparity is found by normalized correlation as in [7]. Given an image patch from

the left camera, this technique involves finding a corresponding image patch from the

right camera which maximizes the normalized correlation of the two.

argmax EZ ix E Ii (x, Y)Ir (x + , y + n) (3.4)'(,7 X,)2IZj~h Z2 (.4 Ir (X + y + ).2

Here, (i, j) is the position of the bottom-left corner of the patch taken from the

left image, w and h are the width and height of the patch, and ((, g) is the dispari-

ty. For a calibrated stereo pair with horizontal baseline, the vertical disparity, q, is

zero. For a well aligned, but uncalibrated stereo pair, a two dimensional search for

correspondence is workable. The match will appear somewhere along the nearly hor-

izontal epipolar line, and the vertical component of disparity can simply be ignored.

The only calibration absolutely necessary is correcting for any horizontal offset due

27

to convergence.

When an image is thought of as a vector of intensity values, (3.4) is simply an

inner product, the cosine of the angle between two vectors. The dimensionality of the

space is equal to the number of pixels in the patch. In terms of stochastic detection

theory, this technique is similar to that of using a matched filter to recognize a known

signal in a noisy channel.

Of interest to the detector, as mentioned above, is the depth to the highest moving

pixel in each column of the image. The detector uses this depth information to select

a coherent line of moving pixels wide enough to be the top of a head. The elliptical

shape filter is then applied to the region of the image under this line.

Unfortunately, depth cannot be accurately calculated at every point desired. It is

difficult to find, with confidence, the correspondence of an image patch lacking strong

features. Correspondences found for patches containing high contrast edges in mul-

tiple directions are generally more accurate. The detection system, when calculating

the depth to the highest moving pixel in each column, will choose a patch centered

about a lower pixel if that patch contains stronger features. The black squares in the

top image of figure 3-4 represent the image patches chosen by the detector. The black

squares in the bottom image show the corresponding patches found by normalized

correlation. The black line in the top image represents the depth to each patch. The

higher the line, the closer the patch is to the stereo pair.

To measure the strength of the features within an image patch, a principal com-

ponent analysis is applied to the set of image gradients of the patch. In the case of

an image patch containing strong edges in multiple directions, both components will

have high energy. The energies of the principal components are the eigenvalues of the

image gradient's covariance matrix

1 j+h i+w E(x, y)2 E(x, y)E, (x, y) (3.5)

(w + 1)(h + )y=j x=i EX(x, y)E"(x, y) E"(x, y)2

28

Figure 3-4: Calculation of Depth to High Motion in Each Column. The aspect ratiosof the images as they are produced by the stereo pair have been preserved in thisfigure to illustrate the true size of the correspondence templates. The stereo cameraproduces two line interlaced frames, each with a resolution of 320 x 120. All otherimages in this document have been resized for clarity.

where

Ex(x, y) = +E(x, y) = -[E(x+1,y)-E(x,y)+E(x+l,y+1)-E(x,y+1)] (3.6)6x 2

and

Ev(x,y) = -E(x, y) = -[E(x,y+1)-E(x,y)+E(x+1,y+1)-E(x+1,y)] (3.7)6y 2

The measure of feature strength used by the detector is the energy of the smaller

component. This criterion was independently derived by Shi and Tomasi [13].

There are tradeoffs involved in choosing the size of the image patch for which to

find a correspondence. Small patches are computationally efficient and can be more

29

accurate when there is considerable variation in depth. However, if the patches are

made too small, ambiguity arises in the match. A patch size of nine by nine pixels

was found to offer a good balance.

3.3.3 Finding a Coherent Line of Moving Pixels

Once depth has been determined to motion in each column of the image, the detector

must find the highest coherent line of moving pixels wide enough to be a human

head. To do this, the detector uses the depth information to pick a 14cm wide

window through which to look for a coherent line of moving pixels. The baseline of

the stereo pair used in this system is 7cm, and since scale is proportional to disparity,

the window need simply be twice as wide as the disparity. This proportionality can

be derived from the projection equation which relates the width of an object, w, to

the width of its projection, w'.

W = -w (3.8)z

Applying equation (3.3) yieldsW

w= -d (3.9)

A line of moving pixels is considered coherent if its variance vertically is below a

threshold. Let h(x) be the highest moving pixel in each column, d(x) be the disparity

of that high motion, and C be the set of columns whose windows satisfy the coherency

constraint.1 i+d(i)

Pi = h(x) (3.10)2d(i)1x=i-d(i)

2 1 i+d)0 2d(i) +1 (h(x) - pa) 2 (3.11)

2d~)+1(i)

(Ui2 < thvd(i) 2) ++* (i E C) (3.12)

30

The detector finds the column in C whose window has the largest average pi.

argmaxi E C pi (3.13)

The elliptical shape filter is applied under this highest coherent line of moving

pixels. The thick white line in figure 3-5 shows the highest 14cm wide coherent line

of moving pixels. In thin white, the best fit ellipse under this line is shown.

Figure 3-5: Highest Line of Coherent Motion Wide Enough to be a Head Where theResulting Candidate is a Correct Detection

3.3.4 Ellipse Fitting

The detector locates heads by searching for an elliptical shape 21cm wide and 25cm

tall. The measure of elliptical fit used, 0, is the average of dot product magnitude

between the ellipse normal and the image gradient. To reduce the attraction towards

isolated strong edges, the dot product magnitude is clipped above a threshold, thc.

1N., Ex (x + sx, (i), y + sy,()4(x, y, ) = Z min(thc, n,(i) E )

No =1 LEy (x + sx'(i), y + Sy,(i))(3.14)

For an ellipse of width -, N, is the number of pixels along the perimeter, n, (i) is

the normal at the ith perimeter pixel, and (sx,(i), sy,(i)) is the position of the ith

31

perimeter pixel, relative to the center of the ellipse. The height of the ellipse is set at

1.2o-. Expect for the clipping, this is the same measure used by Birchfield in [2].

The detector computes 0 once for every column making up the highest coherent

line of moving pixels. The size of the ellipse is set using the disparity measurements.

The width of a head is assumed to be 21cm, or 3d(i), and the height, 25cm, or 3.6d(i).

Since the line of motion is taken to be the top of a head, the ellipse is placed below and

tangent to the line at vertical position of h(i) - 1.8d(i). The final output candidate

of the detector is the ellipse which had the best fit.

If the tracker is currently without a target, the candidate found by the detector is

tracked. Otherwise the elliptical fit of the candidate is compared to the elliptical fit

of the target being tracked. If it is greater, the tracker switches attention to the new

candidate. Figure 3-6 shows in white, the candidate found by the detector. However

the elliptical fit of this false positive is less than that of the actual head being tracked,

the black ellipse. The false positive is ignored.

Figure 3-6: Highest Line of Coherent Motion Wide Enough to be a Head Where theResulting Candidate is a False Positive. The tracking target is shown in black.

3.4 Head Tracking

The tracker uses a simple constant velocity model to predict the new position of the

head. It then searches a region around the predicted position for a best fit ellipse.

32

The size of the region is constrained linearly with disparity under the assumption that

human head acceleration is limited. A range of ellipse sizes, 14-21cm, is used to allow

for changes in depth from one frame to the next and slight variation in curvature due

to head rotation. Size is determined from the depth calculation made in the previous

frame.

The fact that the detector and tracker use the same measure of elliptical fit pro-

vides for an alternate and perhaps simpler model of the detector tracker synergy. For

every new frame, a space is constructed in which an elliptical shape is searched for.

The space is the union of a window around the predicted location of the head and a

region under the highest coherent line of moving pixels. The best fit ellipse in this

space is taken to be the new position of the head.

3.4.1 Determining Accurate Depth

It is important that the depth of the tracker's final coordinate output be accurate.

Small errors in depth may translate to significant pan and tilt errors in HAL's two

teleconferencing cameras. As presented in section 3.3.2, normalized correlation alone

is insufficient, for it gives discrete results. At a range of 300cm, a one pixel error in

disparity translates to a depth error of 30cm. After a normalized correlation match

is found, a sub-pixel disparity measurement is achieved by calculating a parametric

optical flow across the template from the left image and its matching template from

the right image. The parametric optical flow is purely translational. Figure 3-7 shows

the template of the head as taken from the left image and its corresponding template

from the right image. Pure translational flow constrains all flow vectors to be equal

and provides an elegant least squares solution to the brightness constraint equation

Exu + Eyv + El. = 0 (3.15)

33

Figure 3-7: Template of Head From Left and Right Image

Here, El, is the change in intensity across the templates from the left and right images.

u and v represent the sub-pixel flow from the left to the right templates. Given a

disparity ( , r/), E., E,, and Eir are calculated as follows.

Ex(x, y) = [I1(X +1, y) - I1(X, y) +

I1(X + 1, y + 1) - I(x, y + 1) +

Ir(x + + 1, y + r7) - Ir(X + , y + r7) +

Ir (X + (+ 1, y + 77 + 1) - Ir (x + 6, y + rq +1)1

Ey(x,y) = [ , y + 1) - I1(X, y) +

I(x + 1, y + 1) -I(x+ 1, y) +

Ir(x + , Y +q r+ 1) - Ir(X + 6, Y +q) +

Ir(x + 6+ 1, Y +'q + 1) - Ir(x + 6 + 1,y +q)]

Eir(X, Y) = [I(x + , Y + 7) - I(X, Y) +

Ir(X + y + r + 1) - I(x, y + 1) +

Ir(X + + ly + 7) - I(x + 1, y) +

Ir (X + (+ 1,y + T, + 1) - I1(X + 1, y + )](3.16)

34

Each 2 x 2 pixel group in the template image gives a brightness constraint. The result

is the over-determined system

Ex E- E., (3.17)

EX, Ey, and Eir are column vectors containing gradient calculations from each 2 x 2

pixel group used by (3.16). For example,

Ex(x, y)

E.(x, y + 1)

Ex(x, y + h - 1)

EX= Ex(x + 1, y) (3.18)

E2(x + 1, y + 1)

Ex(x + w - 1, y + h - 2)

Ex(x + w - 1, y + h - 1)

To solve for u and v, a pseudo inverse is used.

[1 ET E] TT Ex EYT Eir (3.19)

V EyT Ey

The final disparity is (( + u, + v).

3.5 Transforming Coordinates

As the head is tracked in the images from the stereo camera, and accurate measure-

ments of disparity are made, pan and tilt directives must be calculated to drive the

movement of the teleconferencing cameras. This is done in three steps. First, the

35

real-world position of the head, in Cartesian coordinates relative to the stereo camera

pair, must be found. This position is then transformed into Cartesian coordinates

relative to each of the two teleconferencing cameras, using knowledge of the cameras'

relative orientations. Finally, these transformed Cartesian coordinates are mapped

to polar pan and tilt values to drive the movement of the teleconferencing cameras.

To transform the location and disparity, (x, y, d), of the head being tracked in the

image to Cartesian coordinates relative to the stereo pair, ( ysp, z,), the projection

equations are used [6].

xSP = b (3.20)d

Ysp -b (3.21)db

zs,= -f (3.22)d

The Cartesian coordinates are then multiplied by a rotation/translation matrix for

each of the two teleconferencing cameras in HAL, resulting in coordinates ( ytc, ztr)

relative to each teleconferencing camera.

xte cos 0 - sin 6 sinq' -sin 0 cos 0 cos 9Ax - sin Z~z 1 P

Ytc 0 Cos 0 - sin AY Ysp (3.23)

ztJ sin G cos 0 sin 0 cos 0 cos 4 sin 9Ax + cos 9xy zS- 1

In the above equation, # is the downward tilt angle of the stereo pair. Ax, Ay,

and Az represent the distance from the teleconferencing camera to the stereo pair.

This translation is described in the tilt corrected coordinate frame of the stereo pair.

6 represents how far the teleconferencing camera is rotated in the xz-plane, again

relative to the tilt corrected stereo pair.

Using inverse tangent relationships, the coordinates of the head relative to the

teleconferencing cameras are further transformed into pan and tilt directives. These

36

pan and tilt values are sent over a serial line to drive the movement of the cameras.

1 (tc) 7rpan = tan- + -[1 - sign(zt,)]sign(xte)

Zc 2

tilt = tan- ()

(3.24)

(3.25)

The teleconferencing cameras used in this system have a 1800 pan and 900 tilt range.

3.6 Equipment

The stereo camera pair used is the STH-V2 made by Videre Design, figure 3-8. It can

output left and right video signals at a resolution of 320 x 240 or one interlaced signal

combining 320 x 120 subsampled images. The later mode is used for this thesis. The

lenses used give the camera a 390 field of view.

Figure 3-8: Stereo Camera Pair

The stereo images are captured using a Matrox Meteor frame grabber and pro-

cessed on a 600MHz Pentium III. The steerable teleconferencing cameras used are

Sony EVI-D30s.

37

Figure 3-9: Steerable Teleconferencing Camera

38

Chapter 4

Results

4.1 Head Detection and Tracking Results

It is difficult to quantitatively assess the behavior of the head detection and tracking

system. Statistics such as deviation from ground truth, rate of false positives, and

mean time to detection all depend greatly on the situation. How fast, how often, and

in what direction does the subject's head move? Is the head ever severely occluded?

What types of of background clutter are present? How often does motion occur above

the head? Statistics for a system such as this are only meaningful when collected over

hundreds of real-world trials. And even then, subjects must be carefully instructed to

act naturally, to neither intentionally try to fool the system nor be overly cautious.

In light of these difficulties, this section instead provides a qualitative assessment

of the system. Situations are presented which illustrate both the system's strengths

and weaknesses. And comparisons are drawn to other methods considered during the

development of the system.

4.1.1 Continuous Detection

Figure 4-1 illustrates the success and importance of continuous detection. Here, as

the head moves downward in the image, the elliptical tracker is held fast by the high

39

contrast edge between the couch and the back wall and looses the curved bottom edge

of the jaw. By the sixth frame, it is following the front of the hair line rather than

the perimeter of the face. At this point, the tracker's hold on the head is in jeopardy.

Further downward movement will likely cause the tracker to loose the head entirely.

In the seventh frame, however, the candidate presented by the continuously running

detector is better fit than the current target and the system is restabilized.

Another benefit of continuous detection is that it eliminates the need for initial-

ization. Many head tracking systems ignore the issue of initialization. Others have

manual procedures. The focus during the design of this system was always kept on

usability. Figure 4-2 shows each frame of an initialization sequence. In the first frame

of this sequence, the tracker is without a target. By the second frame, enough of

the torso has come into the frame to register motion. The detector notices this and

presents the torso, as its best fit candidate, to the tracker. During the next four

frames, the tracker, having no better target, follows the torso. When enough of the

head comes into the frame, the continuously running detector recognizes that the

head is a better fit target than the torso, and switches the tracker's attention to the

head.

4.1.2 Occlusion

The nature of the elliptical tracker makes the system resistant to partial occlusion.

Figure 4-3 shows every forth frame of a sequence in which the tracker is unaffected

by arms occluding the stereo camera's view of the head. This resistance is a result

of taking the average edge strength around the perimeter of the ellipse. Objects

partially occluding the head generally leave enough of the head's edges visible to

keep this average high. In fact, human heads are rarely the exact shape for which

the ellipse filter is searching. Close examination of the position of the ellipse as the

system is tracking reveals that it is usually following just one or two high contrast

curves and not the entire perimeter of the head.

40

Figure 4-1: Consecutive Frames of a Sequence Illustrating the Benefits of ContinuousDetection. The frame order in this sequence and all others in this document is leftto right, top to bottom.

Template based trackers can easily fail under situations of occlusion. The occlud-

ing object often pulls the template off the target object. To illustrate this, figures

4-4 and 4-5 show consecutive frames of an occlusion sequence. In the first sequence,

an elliptical tracker is used. The occluding object is properly ignored. In the second

sequence, a template tracker is used and is pulled off the subject's head by the oc-

cluding hand. Template trackers determine the new position of an image patch by

computing a normalized correlation (3.4).

41

I

I

4.1.3 Out-of-plane Rotation

The elliptical shape tracker was also chosen for its ability to handle out-of-plane head

rotation. Techniques such as template tracking that follow patterns of brightness fail

upon object rotation when the pattern becomes self occluded. In most scenarios, the

assumption that the head will not rotate significantly is invalid. The elliptical tracker

instead relies upon the shape of the perimeter of the head as projected onto the image

plane. This shape is nearly elliptical, regardless of head rotation. Figure 4-6 shows a

rotation sequence through which the tracker remains correctly locked on the head.

4.1.4 Head Decoys

Figure 4-7 shows the results of a situation contrived to demonstrate irrecoverable

system failure. This can happen when the tracker locks on to an object whose elliptical

shape is far better fit than a human head. As long as the elliptical object is visible,

the tracker will never switch its attention away from it. In this figure, the black ellipse

shows the tracking target. The white ellipse shows the detector candidate. A high

contrast drawing of a head sized ellipse was made as a decoy to steal the attention of

the tracker. In the first six frames of the sequence, the tracker correctly follows the

head. In the seventh frame, the drawing is moved slightly. The detector notices the

decoy and determines that it is a better-fit ellipse than the head being tracked. At

this point, the tracker switches its attention to the drawing. In the last three frames,

although the detector is finding the true human head, the elliptical fit of the decoy is

far stronger and holds the attention of tracker.

Although this situation is contrived, it illustrates a major weakness. The system

detects and tracks ellipses, not heads. Objects in the background that better fit

ellipses better than human heads do are detriments to the system. If these decoys

are moved or if a tracked head passes directly in front of them, the system can fail.

42

4.1.5 Accuracy of Depth Calculation

Section 3.4.1 describes a technique for accurately calculating depth to the head. After

a discrete disparity value is found via normalized correlation, a sub-pixel shift is

calculated across the matching templates from the left and right images. This shift

is found using parametric optical flow and is added to the preliminary discrete value

to obtain a more accurate disparity. To test the validity of this approach, a sequence

was taken of cyclical head movement towards and away from the stereo pair, figure

4-8. The depth determined by the tracker is plotted in figure 4-9. For comparison, the

plot also shows with a dotted line the discrete output of the normalized correlation

calculation alone.

Accurate disparity calculations are critical in a teleconferencing scenario. In HAL,

the teleconferencing cameras' two views of the couch are nearly orthogonal to that of

the stereo pair. A slight error in depth can translate to a significant error in the pan

or tilt angle of a teleconferencing camera. At frame 110 of the depth test sequence,

figure 4-9 shows a discrepancy of 15cm between the depth calculated with and without

parametric optical flow. If a teleconferencing camera were aiming for a tight shot of

the head, an error of this magnitude could result in undesirable cropping of the

head. Measuring sub-pixel disparities increases the accuracy of depth calculations by

roughly one order of magnitude and is a critical feature of this system.

4.2 Teleconferencing Results

The major drawback to the cameras used in this system is that their maximum drive

speed is only 800 per second. When the subject is close and moving laterally, the

cameras can take on the order of seconds to move to a new position. Hence, they

cannot react quickly to a head moving out of their field of view, despite the fact that

real-time tracking data is available. Aside from this caveat, the system works well.

In most situations, the cameras provide well-centered close-up shots of the head.

43

Figure 4-2: Consecutive Frames of an Initialization Sequence

44

I

Figure 4-3: Every Forth Frame of an Occlusion Sequence.

45

I

Figure 4-4: Consecutive Frames of an Occlusion Sequence Using Ellipse Tracking.

46

Figure 4-5: Consecutive Frames of an Occlusion Sequence Using Template Tracking.

47

Figure 4-6: Every Tenth Frame of a Rotation Sequence

48

Figure 4-7: Every Tenth Frame of a Decoy Sequence

49

Figure 4-8: Every Tenth Frame of a Depth Test Sequence. The first image in thissequence is frame 50. The last is frame 130. The image templates used in thecorrespondence calculation are shown in the bottom left corner of each frame. A plotof depth to the head in this sequence can be found in figure 4-9

50

280-

270-

260-

250-

240-

230-

22040 50 60 70 80 90 100 110 120 130 140

Frame Number

Figure 4-9: Plot of Depth to Head Over Time. The dotted line plots depth as derivedfrom the discrete output of the normalized correlation calculation alone. The solid lineplots depth after the sub-pixel optical flow results have been added to the disparitycalculation. The video sequence corresponding to this plot can be seen in figure 4-8

51

Chapter 5

Conclusion

A system has been developed which detects and tracks the head of a subject mov-

ing about within HAL, an intelligent environment. It monitors activity in the room

through a stereo camera pair. The detector works by looking for an elliptical shape

in a search space constrained by motion and size cues. The tracker follows the ellip-

tical shape until the detector presents one which is better fit. Depth to the head is

calculated using normalized correlation and refined in accuracy by determining the

parametric optical flow across the matched image templates from the left and right

cameras.

To test and demonstrate the system, an automated teleconferencing application

was developed. The three-dimensional coordinates of the subject's head are trans-

formed into polar coordinates to drive the pan and tilt of two steerable cameras.

The test was a success. Although the steerable cameras move slowly and cannot

keep up with a quickly moving head, they eventually center on the head when it

comes to rest. The system is robust against partial occlusion, rotation, changes in

lighting, and variation in hair and skin color.

53

5.1 Future Work

One clear path for future work is to extend the system to support the detection and

tracking of multiple heads. There is nothing in the design of the current system to

prevent such an extension. Rather than tracking the best fit ellipse in the scene, the

system could simply track all candidates whose elliptical fits exceeded a threshold.

The detector, rather than switching the attention of the tracker, could instead spawn

new trackers. Additional logic could be added to retire trackers whose targets do

not move for long periods of time. As a side effect, extending the system to support

multiple heads may alleviate the problem illustrated in figure 4-7 whereby a decoy

can permanently steal the attention of the tracker.

As was mentioned in section 2, some of the more robust head detection and

tracking systems have been the result of multi-modal approaches. And the authors

of such systems generally claim that adding more modes increases performance. Two

modes which could be added to future versions of this system are skin color detection

and pattern recognition. Both of these were considered during the development of the

existing system but never implemented. Color was not available from the monochrome

stereo pair. And existing pattern recognition systems, such as CMU's [10], failed due

to the downward viewing angle of the stereo pair. There are, however, other cameras

in the room which could supply skin color and face detection information to the

system, narrowing the search for heads in the image from the stereo pair to epipolar

regions. Additionally, a pattern recognition module could be specially trained to

detect heads from the downward viewing angle of the stereo pair.

One problem with the existing system is the time it takes for the teleconferencing

cameras to drive to a new position. It might be advantageous for future versions

of the system to account for this delay. The current system, when commanding the

cameras to drive, relays the position of the head as of time the command is issued.

The system might instead predict where the head will be by the time the cameras

finish the movement and instead relay this position.

54

Another solution to the slow camera problem is to digitally crop the head from

wider angle shots. In each frame captured by the teleconferencing cameras, head

position could be used to select a cropping region. The resolution of the resulting

sequence would, of course, be reduced. But, this reduction is already common practice

in teleconferencing scenarios due to limited bandwidth.

55

Bibliography

[1] D. Beymer and K. Konolige. Real-time tracking of multiple people using contin-

uous detection. IEEE Conference on Computer Vision and Pattern Recognition,

1999.

[2] S. Birchfield. Elliptical head tracking using intensity gradients and color his-

tograms. IEEE Conference on Computer Vision and Pattern Recognition, pages

232-237, Santa Barbara, CA, June 1998.

[3] M. Coen and K. Wilson. Learning spatial event models from multiple-camera

perspectives. Annual Conference of the IEEE Industrial Electronics Society,

1999.

[4] T. Darrell, G. Gordon, M. Harville, and J. Woodfill. Integrated person tracking

using stereo, color, and pattern detection. IEEE Conference on Computer Vision

and Pattern Recognition, pages 601-609, Santa Barbara, CA, June 1998.

[5] M. Fleck, D. Forsyth, and C. Bregler. Finding naked people. volume 2 of

European Conference on Computer Vision, pages 592-602, 1996.

[6] B. K. P. Horn. Robot Vision. The MIT Press, Cambridge, Massachusetts, 1986.

[7] K. Konolige. Small vision systems: Hardware and implementation. Eighth In-

ternational Symposium on Robotics Research, Hayama, Japan, October 1997.

[8] S. McKenna and S. Gong. Tracking faces. International Conference on Automatic

Face and Gesture Recognition, Killington, Vermont, October 1996.

57

[9] C. Morimoto, D. Koons, A. Amir, and M. Flickner. Real-time detection of

eyes and faces. Workshop on Perceptual User Interfaces, San Francisco, CA,

November 1998.

[10] H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23-38,

January 1998.

[11] H. Rowley, S. Baluja, and T. Kanade. Rotation invariant neural network-based

face detection. IEEE Conference on Computer Vision and Pattern Recognition,

Santa Barbara, CA, June 1998.

[12] B. Scassellati. Eye finding via face detection for a foveated, active vision system.

National Conference on Artificial Intelligence, Madison, WI, 1999.

[13] J. Shi and C. Tomasi. Good features to track. IEEE Conference on Computer

Vision and Pattern Recognition, pages 593-600, 1994.

[14] C. Wren, W. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time

tracking of the human body. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 19(7):780-785, July 1997.

58

Design and Application of a Head Detection and Tracking ...

Documents

Transcript of Design and Application of a Head Detection and Tracking ...