Mutual information based registration of...

18
Mutual information based registration of multimodal stereo videos for person tracking q Stephen J. Krotosky * , Mohan M. Trivedi Computer Vision and Robotics Research Laboratory, University of California, San Diego, 9500 Gilman Dr. 0434, La Jolla, CA 92093-0434, USA Received 15 September 2006; accepted 23 October 2006 Available online 20 December 2006 Communicated by James Davis and Riad Hammoud Abstract Research presented in this paper deals with the systematic examination, development, and evaluation of a novel multimodal registra- tion approach that can perform accurately and robustly for relatively close range surveillance applications. An analysis of multimodal image registration gives insight into the limitations of assumptions made in current approaches and motivates the methodology of the developed algorithm. Using calibrated stereo imagery, we employ maximization of mutual information in sliding correspondence win- dows that inform a disparity voting algorithm to demonstrate successful registration of objects in color and thermal imagery. Extensive evaluation of scenes with multiple objects at different depths and levels of occlusion shows high rates of successful registration. Ground truth experiments demonstrate the utility of the disparity voting techniques for multimodal registration by yielding qualitative and quan- titative results that outperform approaches that do not consider occlusions. A basic framework for multimodal stereo tracking is inves- tigated and promising experimental studies show the viability of using registration disparity estimates as a tracking feature. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Thermal infrared sensing; Multisensor fusion; Person tracking; Visual surveillance; Situational awareness 1. Introduction A fundamental issue associated with multisensory vision is that of accurately registering corresponding information and features from the different sensory systems. This issue is exasperated when the sensors are capturing signals derived from totally different physical phenomena, such as color (reflected energy) and thermal signature (emitted energy). Multimodal imagery applications for human anal- ysis span a variety of application domains, including med- ical [1], in-vehicle safety systems [2] and long-range surveillance [3]. The combination of both types of imagery yields information about the scene that is rich in color, depth, motion and thermal detail. Once registered, such information can then be used to successfully detect, track and analyze movement and activity patterns of persons and objects in the scene. At the heart of any registration approach is the selection of the most relevant similarity metric, which can accurately match the disparate physical properties manifested in images recorded by multimodal cameras. Mutual Informa- tion (MI) provides an attractive metric for situations where there are complex mappings of the pixel intensities of cor- responding objects in each modality, due to the disparate physical mechanisms that give rise to the multimodal imag- ery [4]. Egnal has shown that mutual information is a via- ble similarity metric for multimodal stereo registration when the mutual information window sizes are large enough to sufficiently populate the joint probability histo- gram of the mutual information computation [5]. Further investigations into the properties and applicability of mutual information for windowed correspondence measure has been done by Thevenaz and Unser [6]. Challenges lie in obtaining these appropriately sized window regions for 1077-3142/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2006.10.008 q This research is sponsored by the Technical Support Working Group (TSWG) for Combating Terrorism, DHS and the U.C. Discovery Grant. * Corresponding author. E-mail addresses: [email protected] (S.J. Krotosky), mtrivedi@ ucsd.edu (M.M. Trivedi). www.elsevier.com/locate/cviu Computer Vision and Image Understanding 106 (2007) 270–287

Transcript of Mutual information based registration of...

Page 1: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

www.elsevier.com/locate/cviu

Computer Vision and Image Understanding 106 (2007) 270–287

Mutual information based registration of multimodal stereo videosfor person tracking q

Stephen J. Krotosky *, Mohan M. Trivedi

Computer Vision and Robotics Research Laboratory, University of California, San Diego, 9500 Gilman Dr. 0434, La Jolla, CA 92093-0434, USA

Received 15 September 2006; accepted 23 October 2006Available online 20 December 2006

Communicated by James Davis and Riad Hammoud

Abstract

Research presented in this paper deals with the systematic examination, development, and evaluation of a novel multimodal registra-tion approach that can perform accurately and robustly for relatively close range surveillance applications. An analysis of multimodalimage registration gives insight into the limitations of assumptions made in current approaches and motivates the methodology of thedeveloped algorithm. Using calibrated stereo imagery, we employ maximization of mutual information in sliding correspondence win-dows that inform a disparity voting algorithm to demonstrate successful registration of objects in color and thermal imagery. Extensiveevaluation of scenes with multiple objects at different depths and levels of occlusion shows high rates of successful registration. Groundtruth experiments demonstrate the utility of the disparity voting techniques for multimodal registration by yielding qualitative and quan-titative results that outperform approaches that do not consider occlusions. A basic framework for multimodal stereo tracking is inves-tigated and promising experimental studies show the viability of using registration disparity estimates as a tracking feature.� 2007 Elsevier Inc. All rights reserved.

Keywords: Thermal infrared sensing; Multisensor fusion; Person tracking; Visual surveillance; Situational awareness

1. Introduction

A fundamental issue associated with multisensory visionis that of accurately registering corresponding informationand features from the different sensory systems. This issueis exasperated when the sensors are capturing signalsderived from totally different physical phenomena, suchas color (reflected energy) and thermal signature (emittedenergy). Multimodal imagery applications for human anal-ysis span a variety of application domains, including med-ical [1], in-vehicle safety systems [2] and long-rangesurveillance [3]. The combination of both types of imageryyields information about the scene that is rich in color,depth, motion and thermal detail. Once registered, such

1077-3142/$ - see front matter � 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.cviu.2006.10.008

q This research is sponsored by the Technical Support Working Group(TSWG) for Combating Terrorism, DHS and the U.C. Discovery Grant.

* Corresponding author.E-mail addresses: [email protected] (S.J. Krotosky), mtrivedi@

ucsd.edu (M.M. Trivedi).

information can then be used to successfully detect, trackand analyze movement and activity patterns of personsand objects in the scene.

At the heart of any registration approach is the selectionof the most relevant similarity metric, which can accuratelymatch the disparate physical properties manifested inimages recorded by multimodal cameras. Mutual Informa-tion (MI) provides an attractive metric for situations wherethere are complex mappings of the pixel intensities of cor-responding objects in each modality, due to the disparatephysical mechanisms that give rise to the multimodal imag-ery [4]. Egnal has shown that mutual information is a via-ble similarity metric for multimodal stereo registrationwhen the mutual information window sizes are largeenough to sufficiently populate the joint probability histo-gram of the mutual information computation [5]. Furtherinvestigations into the properties and applicability ofmutual information for windowed correspondence measurehas been done by Thevenaz and Unser [6]. Challenges lie inobtaining these appropriately sized window regions for

Page 2: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 271

computing mutual information in scenes with multiplepeople and occlusions, where a balanced tradeoff betweenlarger windows for matching evidence and smallerwindows for registration detail is needed.

This paper presents the following contributions: we firstgive a detailed analysis of current methods to multimodalregistration with a comparative analysis that motives ourapproach. We then present our approach for mutual infor-mation based multimodal registration. A disparity votingtechnique that uses the accumulation of disparity valuesfrom sliding correspondence windows gives reliable androbust registration and an analysis of several thousandframes demonstrates its success for complex scenes withhigh levels of occlusion and numbers of objects occupyingthe imaged space. An accuracy evaluation to ground truthmeasurements is presented and a comparative study usingpractical segmentation methods illustrates how the occlu-sion handling of the disparity voting algorithm is animprovement over previous approaches. Additionally, abasic framework for person tracking in multimodal videois presented and a promising experimental study is givento illustrate the use of the disparities generated from multi-modal registration as a feature for tracking. We then dis-cuss current algorithmic issues and potential resolutionsfor future research.

2. Multimodal registration approaches: comparative analysis

of algorithms

In a multimodal, multicamera setup, because each cam-era can be at a different position in the world and have dif-ferent intrinsic parameters, objects in the scene can not beassumed to be located at the same position in each image.Due to these camera effects, corresponding objects in eachimage may have different sizes, shapes, positions, andintensities. In order to combine the information in eachimage, it is required that the corresponding objects in thescene be aligned, or registered. Sensory measurementscan then be fused or features combined in a variety of waysthat can fuel algorithms that take advantage of the infor-mation provided from multiple and differing image sources[7]. Experiments in our previous work [2] have offered anal-ysis and insight into the commonalities and uniqueness ofthe multimodal imagery. Multimodal image registrationapproaches vary based on factors such as camera place-ment, scene complexity and the desired range and densityof registered objects in the scene. In order to better under-stand the algorithmic details of the various multimodal reg-istration techniques, it is important to outline theunderlying geometric framework for registration. Muchof the multiple view geometry properties derived in thispaper are adapted from Hartley and Zisserman [8].

Given a two camera setup with camera center locationsC and C 0, a 3D point in space can be defined relative toeach of the camera coordinate systems as P = (X,Y,Z)T

and P 0 = (X 0,Y 0,Z 0)T, respectively. The coordinate systemtransformation between P and P 0 is:

P 0 ¼ RP þ T ð1Þwhere R is the matrix that defines the rotation between thetwo camera centers and T is the translation vector thatrepresents the distance between them. Additionally, theprojection matrices for each camera are defined as K

and K 0, where the projected points on the image planeare the homogeneous coordinates p = (x,y, 1) andp 0 = (x 0,y 0, 1).

Let p be a plane in the scene parameterized with N, thesurface normal of the plane and dp is the distance from thecamera center C. Then a point lies on that plane ifNTP = dp. The homography induced by p is P 0 = HPP

where:

HP ¼ R� TN T

dpð2Þ

Applying the projection matrices K and K 0, we havep 0 = Hp, where H = K 0HPK�1 giving

H ¼ K 0 R� TN T

dp

� �K�1 ð3Þ

This homographic transformation describes the trans-formation of points only when the points lie on the planep (e.g. NTP = dp). When a point does not lie on this plane,then an additional parallax component needs to be addedto transformation equation to accommodate the projectivedepth of other points in the scene relative to the plane p. Ithas been shown in [8] that the transformation that includesthe additional parallax term is:

p0 ¼ Hp þ de0 ð4Þ

where e 0 is the epipole in C 0 and d is the parallax relative tothe plane p. The epipole is the intersecting point betweenthe image plane and the line containing the optical centersof C and C 0. The equation in (4) has effectively decomposedthe point correlation equation into a term for the inducedplanar homography (Hp) and the parallax associatedwith points that do not satisfy the planar homographyassumption (de 0). It is within this framework that we willdescribe the registration techniques used for multimodalimagery. Fig. 1 illustrates the main approaches tomultimodal image registration that will be analyzed.Additionally, Table 1 provides a summary of referencesutilizing these approaches and indicates the assumptions,methods and limitations in each.

2.1. Infinite homographic registration

In Conaire et al. [9] and Davis and Sharma [3], it isassumed that the thermal infrared and color cameras arenearly colocated and the imaged scene is far from the cam-era, so that the deviation of pedestrians from the groundplane is negligible compared to the distance between theground and the cameras. Under these assumptions, an infi-nite planar homography can be applied to the scene and allobjects will be aligned in each image.

Page 3: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

Fig. 1. Geometric illustration of the four main approaches to multimodal image registration. (a) Infinite homography. (b) Global. (c) Stereo geometric. (d)Partial image ROI.

272 S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287

The infinite planar homography, H1, is defined as thehomography that occurs when the plane p is at infinity.An illustration of this type of registration geometry isshown in Fig. 1(a). Starting from (3), we define

H1 ¼ limdp!1

H ¼ K 0RK�1 ð5Þ

When the plane is at infinity, the homography betweenpoints is only a rotation R between the cameras and theinternal projection matrices for each camera, K and K 0.Similarly, from (4), Hartley and Zisserman [8] showed thatthe correspondence equation for image points in an infinitehomography is:

p0 ¼ H1p þ K 0tZ

ð6Þ

where Z ¼ 1d is the depth from C and K 0t = e 0 is the epipole

in C 0.Infinite homographic registration techniques are used

when the scene distance is very far from the camera. Whenall observed objects are very far from C, then Z fi1 andthe parallax effects will be negligible. Alternatively, whenthe cameras are nearly colocated, i.e. t fi 0, the parallaxterm also becomes negligible. In both cases the correspon-dence equation becomes:

Page 4: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

Table 1Review of approaches to multimodal registration and body analysis

Work Modalities Assumptions Calibration Registration Application Algorithm Comments

Visual IR 3D

Trivediet al. [2]

X X X Approximate colocation None None—Comparativeevaluation

Head detectionfor airbagdeployment

Head detection and trackingusing background subtractionand elliptical templates.Segmentation usingbackground subtraction indisparity for visual imageryand in hot spot localizationfor thermal infrared imagery

Comparative analysis of headdetection algorithms usingboth stereo and thermalinfrared imagery

Davis andSharma[18,19,3]

X X Colocation. Observed scenefar from camera.

None Infinitehomographic

Persondetection andbackgroundmodeling

Fused contour saliency maps(CSMs) are used to formsilhouettes to enhancebackground modeling

Does not deal with occlusionand discriminating peoplemerged into one silhouette.Camera placement can beprohibitive

Conaireet al. [9]

X X Colocation. Observed scene farfrom camera. Majority of sceneis background. Hotspots validfor human segmentation

None Infinitehomographic

Persondetection andbackgroundmodeling

Hysteresis threshold of initialforegrounds used to formbackground model updatefrom foreground objectvelocity, size, edgemagnitude, and thermalbrightness

Hotspot segmentation is alimiting assumption. Doesnot deal with occlusion anddiscriminating people mergedinto one silhouette. Deviationfrom histogram assumptiononly valid when majority ofscene is background

Irani andAnandan[10]

X X Parametric transformationmodel can globally match entirescene

None Global—parametriccorrelationsurface

Generalmultimodalregistration

Directional-derivative-energyfeatures obtained forGaussian pyramid of inputimages. Local correlation offeatures used to iterativelyfind best global parametricalignment using Newton’smethod

Experimental images onlycontain one dominant planein scene and no foregroundobjects. Global parametricmodel not likely to modellarge parallax effects well

Coiras et al.[11]

X X Most edges common acrossmodalities

None Global—affine Generalmultimodalregistration

The global affinetransformation that bestmaximizes the global edge-formed triangle matching isdetermined fromtransformations obtained bymatching individual formedtriangles in the image

Global affine model cannotaccount for large parallaxeffects. Experiments are notperformed for multipleobjects in scene at differentplanes

Han andBhanu[12]

X X Simplified projectivetransformation for planar sceneobjects. Background objectsunregistered. Human mustwalk within same plane duringsequence. Hotspots valid forhuman segmentation

Uncalibrated Global—projective modelfor planarobjects

Persondetection

Top-of-head and centroidfrom two frames in sequenceused as input to HierarchicalGenetic Algorithm (HGA)that searches for bestregistration

Walking along differentplanes results in differentregistration. Multiple peopleat different depths will not beregistered. Unrealistic thathumans walk in same line forregistration. Need entiresequence before first framecan be registered

(continued on next page)

S.J

.K

roto

sky

,M

.M.

Trived

i/

Co

mp

uter

Visio

na

nd

Ima

ge

Un

dersta

nd

ing

10

6(

20

07

)2

70

–2

87

273

Page 5: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

Table 1 (continued)

Work Modalities Assumptions Calibration Registration Application Algorithm Comments

Visual IR 3D

Itoh et al.[13]

X X X Colocation. Predefinedoperating region

Calib. boardusing 25points

Global—quadratic modelfrom Calib.points

Hand andobject detection

Features such as skin tone,hot spots, depth in operatingregion, and motion are fusedto localize hands in operatingregion

Registration assumptionsonly valid for objects within arange of certain depthslocated inside the limited‘‘workspace’’. Informationfrom each modality isheuristically thresholded andnot probabilisticallygeneralized

Ye [14] X X Colocation. Registration isdisplacement and scaling

None Globalregistration andtracking usingHausdorffdistance edgematching

Person tracking Top points of segmentedobjects are tracked.Registration is iterativelyrefined over time usingmotion information.Registered images are usedfor face detection by hotspots

Global matching not validwhen people are at largedepth differences. Experi-ments do not test largemovements over sequenceswhere registrationparameters would bechanging

Ju et al. [15] X X X Colocation. Hi-res. stereo. Onlyonce face in scene, positionedcarefully

Stereo cameracalibration

Stereo geometric 3Dthermographyof face

Multiscale stereo depthinformation. Mapped onto3D face model

Registration evaluation inlow-res stereo environmentand in real-world conditions,e.g. multiple people,occlusions, lighting, etc.?

Bertozziet al. [16]

X X X Colocation. Stereo pairs foreach modality

Calibratedstereo rigs

Stereo geometric Pedestriandetection

Stereo estimates from eachunimodal stereo paircombined in disparity space

Four camera systemcumbersome in terms ofsetup and maintenance, aswell as in terms of imageprocessing and datamanagement

Chen et al.[17]

X X Target tracking problemassumed solved. Registration isonly a displacement and knownscale

Scale factorknown apriori

Partial imageROI

Objectdetection.Concealedweapondetection

Maximizing mutualinformation (MMI) ofindividual bounding boxROI’s for each object inscene. Simplex method usedto search for MMI

Assumption of perfect targettracking gives ideal boundingboxes. With a real worldtracker, how to handleocclusions, overlaps, andincompleteness?

This paper X X X Stereo configuration.Reasonable foregroundextraction. Each object has asingle in scene

Calibratedmultimodalstereo

Disparity voting Objectdetection andperson tracking

Disparity voting for slidingmutual informationcorrespondence windowsyields registration disparitiesfor objects in scene

Successful registrationthrough occlusions andscenes with multiple people.Disparity estimates can beused as feature in trackingalgorithms

274S

.J.

Kro

tosk

y,

M.M

.T

rivedi

/C

om

pu

terV

ision

an

dIm

ag

eU

nd

erstand

ing

10

6(

20

07

)2

70

–2

87

Page 6: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 275

p0 ¼ H1p ð7Þ

The use of an infinite planar homography is a an effec-tive way of registering the scene, but only when the scenethat is being registered conforms to the homographicassumptions. This means that the scene must be very farfrom the camera so that object’s displacement from theground plane will be negligible compared to the observa-tion distance. While this type of assumption is appropriatefor long distance and overhead surveillance scenes, this isnot valid in situations where objects and people can be atvarious depths whose difference is significant relative totheir distance from the camera. In these cases, the infinitehomography assumption will not align all objects in thescene. In addition, when the assumption of an infinitehomography does hold, the lack of a parallax term pre-cludes any estimate of depth that could be used as a differ-entiator for occluding objects.

2.2. Global image registration

Global approaches to registration can be used when fur-ther assumptions about the movement and placement ofobjects and people in a scene are employed to make the reg-istration fit a specific model. The registration will be accu-rate when the scene follows the specific model used, but canbe grossly inaccurate when the imaged scene does not fitthe assumptions of the model.

The usual assumption of these techniques is that allobjects lie on the same plane in the scene. Often to enforcethis assumption, only foreground objects are considered.Global image registration techniques make the assumptionthat d, the measure of difference from the homographicplane in (4), will be small for all objects in the scene. How-ever, in scenes where objects of interest are at differentplanes, only the objects lying on the plane p that inducesthe homography will be registered. All other objects thatlie on different planes will be misaligned due to the secondterm de 0 in (4).

If the distance of objects from the plane is small com-pared to the distance of cameras from the plane, the paral-lax effects tend to zero and the homography accuratelydescribes the registration of objects in the scene at anydepth. Works that have applied this global registrationtechnique operated either on the single plane or approxi-mate colocation assumption to allow for accurate sceneregistration. An illustration of this type of registration isshown in Fig. 1(b).

Irani and Anandan [10] used directional-derivative-energy operators to generate features from a GaussianPyramid of the visual and thermal infrared images andused local correlation values for these features to obtaina global alignment for the multimodal image pair. Align-ment is done by estimating a parametric surface correspon-dence that can estimate the registration alignment of thetwo images. Newton’s method is used to iteratively search

for the parametric transformation that maximizes the glob-al alignment.

Coiras et al. [11] matches triangles formed from edgefeatures in visual and thermal infrared images to learn anaffine transformation model for static images. The globalaffine transformation that best maximizes the globaledge-formed triangle matching is searched from transfor-mations obtained by matching individual formed trianglesin one image to other individual formed triangles in the sec-ond image.

Han and Bhanu [12] used the features extracted when ahuman walked in a scene to learn a projective transforma-tion model to register visual and IR images. It is assumedthat the person walking in the scene walks in a straight lineduring the registration sequence. This enforces that the per-son is located within a single plane throughout thesequence and ensures that the global projective transforma-tion model assumption holds. Feature points derived fromforeground silhouettes in two pair of images in thesequence are used as input into a Hierarchical GeneticAlgorithm that searches for the best global transformation.

Itoh et al. [13] used a calibration board to registeredcolocated color and thermal infrared cameras for use in asystem that recognized hand movement for multimediaproduction. The calibration board points were used toestablish a quadratic transformation model between thecolor and thermal infrared images. Registration is onlyrequired for a predefined workspace with a fixed rangewithin the image scene and the calibration board was placeto ensure registration in that region.

Similarly, Ye [14] used silhouette tracking and Haus-dorff distance edge matching to register visual and thermalinfrared images. In this case, it is assumed that the camerasare nearly colocated and that registration can be accom-plished with a displacement and scaling. The detected toppoints of foreground silhouettes are tracked using themotion associations with previously tracked points. TheHausdorff distance measure is used to match edge featuresin each silhouette and estimate the scale and translationparameters. The registration and tracking are then usedand updated to provide simultaneous tracking and iterativeregistration.

Global image registration methods place some limitingassumptions on the configuration of objects in the scene.Specifically, it is assumed that all registered objects willlie on a single plane in the image and it is impossible toaccurately register objects at different observation depths,as the registration transform for each object will dependon the varying perspective effects of the camera. Thismeans that accurate registration can only occur when thereis only one observed object in the scene [12], or when all theobserved objects are restricted to lie at approximately thesame distance from the camera [13,14]. The global align-ment algorithms proposed by Irani and Anandan [10]and Coiras et al. [11] do not account for situations wherethere are objects at different depths or planes in the image.Both use the assumption that the colocation of the cameras

Page 7: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

276 S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287

and the observed distances are such that the parallax effectscan be ignored.

The primary limitation to global registration methods isthat it is impossible to register objects at different depths.Global methods effectively restrict the successfully registeredarea to be a single plane in the image. When colocatedcameras are used to relax the single plane restriction,parallax effects become negligible, and the problem becomesakin to infinite homographic methods.

2.3. Stereo geometric registration

When a stereo camera setup is used in combination withadditional cameras from other modalities, the images fromeach modality can be combined using the stereo 3D pointestimates and the geometric relation between the stereoand multimodal cameras. As demonstrated in Ju et al.[15], stereo cameras can give accurate 3D point coordinatesfor objects in the image. If the remaining cameras are thencalibrated to the reference stereo pair, usually with a cali-bration board, then the pixels in those images (thermalinfrared) can be reprojected onto the reference stereoimage. The resulting reprojection will be registered to thestereo reference image.

In this case, for a point p in the reference stereo image,an estimate of its 3D location P̂ is given from the calibratedstereo geometry parameters. Additionally, the calibrationof the left reference stereo image and the additional ther-mal infrared modality give the rotation R and T betweencamera coordinates. This allows the change of coordinatesystem to the thermal infrared reference frame,P TIR ¼ RP̂ þ T . The 3D point can then be reprojected ontothe infrared image plane.

pTIR ¼ KTIRP TIR ð8Þ

The thermal image point is then put into homogeneousform and the intensity value at this location in the thermalinfrared image can then be assigned to the point p in thestereo reference image. Such a registration technique isillustrated in Fig. 1(c).

For the case of stereo geometric registration techniques,objects in a scene at very different depths can be registeredas long as the stereo disparity information is available forthat object. If the stereo algorithm can provide dense andaccurate stereo for the objects in the scene, stereo geometricregistration it is a good way of quickly and effectively reg-istering the visual and infrared imagery. In the experimentsof Ju et al. [15] the observed object (head) was carefullyplaced into the scene and it was assumed that it was theonly object in the scene. Stereo data was captured usinghigh resolution stereo cameras in a fairly stable and well-conditioned scene. The resulting 3D stereo image was denseand accurate in these conditions. However, experimentsneed to be conducted to see how these environmental con-ditions can be relaxed. Namely, it is important to examinehow stereo geometric registration techniques performs inreal world conditions, where using standard resolution

cameras in environments of poor lighting, poor texturesand occlusions can affect the quality and reliability of the3D reprojection registration technique.

Multiple stereo camera approaches to stereo geometrichave been investigated by Bertozzi et al. [16]. Using fourcameras configured into two unimodal stereo pairs thatyield two separate disparity estimates, registration canoccur in the disparity domain. While this approach yieldsredundancy and registration success, the use of four camer-as can be cumbersome both in physical creation, calibra-tion and management, as well as in data storage andprocessing.

2.4. Partial image ROI registration

An approach to registering objects at multiple depths isto use partial image region-of-interest registration. Themain assumption of this approach is that each individualobject in the scene is at a specific plane and that each planecan be individually registered with a separate homography.For each of the i regions-of-interest X in the image, ifp 2 Xi then

p0 ¼ Hip þ die0 ð9ÞAgain, it assumed that the parallax effects are negligible

within each object, as each is approximated to be a singleplanar object in the scene. As long as each Xi satisfies thisassumption, the registration technique will be applicable.This is illustrated in Fig. 1(d).

Chen et al. [17] proposed that the visual and infraredimagery be registered using a maximization of mutualinformation technique on bounding boxes that correspondto detected objects in one of the modalities. It is assumedthat the corresponding region is at a scale and displacementaway. It is also assumed that the scale is fixed and known apriori. The matching bounding box is then searched for inthe other modality using a simplex method. This allowsbounding boxes that correspond to objects at differentdepths to be successfully registered.

Chen et al. assume that the bounding boxes representinga single object can always be properly segmented andtracked in one of the modalities. The assumption thatbounding boxes will be properly segmented will often nothold, especially in uncontrolled scenes where the issues oflighting, texture and occlusions can produce segmentationresults that contain two or more merged objects at differentdepths. Using bounding boxes that contain multiple objectswill not register properly as the required assumption thatan ROI contains objects within a single plane will not hold.

3. An approach to mutual information based multimodal

registration

Our registration algorithm [20] addresses the registra-tion of objects at different depths in relatively close rangesurveillance scenes and eliminates the need for perfectlysegmented bounding boxes by relying on reasonable initial

Page 8: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 277

foreground segmentation and using a disparity voting algo-rithm to resolve the registration for occluded or malformedsegmentation regions. This approach gives robust registra-tion disparity estimation with statistical confidence valuesfor each estimate. Fig. 2 shows a flowchart outlining ouralgorithmic framework. Individual modules are describedin the subsequent sections.

3.1. Multimodal image calibration

A minimum camera solution for registering multimodalimagery in these short range surveillance situations wouldbe to use a single camera from each modality, arrangedin a stereo pair. Unlike colocating the cameras, arrangingthe cameras into a stereo pair allows objects at differentdepths to be registered. To perform this type of registra-tion, it is desirable to first calibrate the color and thermalinfrared cameras. Knowing the intrinsic and extrinsic cali-bration parameters transforms the epipolar lines to liealong the image scanlines, enabling disparity correspon-dence matching to be a one-dimensional search. Calibra-tion can be performed using standard techniques, such asthose available in the Camera Calibration Toolbox forMatlab [21]. The toolbox assumes input images from each

Fig. 2. Flowchart of disparity voting approach to multimodal imageregistration.

modality where a calibration board is visible in the scene. Intypical visual setups, this is simply a matter of placing acheckerboard pattern in front of the camera. However,due to the large differences in visual and thermal imagery,some extra care needs to be taken to ensure the calibrationboard looks similar in each modality. A solution is to use astandard calibration board and illuminate the scene withhigh intensity halogen bulbs placed behind the cameras.This effectively warms the checkerboard pattern, makingthe visually dark checks appear brighter in the thermalimagery. Placing the board under constant illuminationreduces the blurring associated with thermal diffusion andkeeps the checkerboard edges sharp, allowing for calibra-tion with subpixel accuracy. An example pair of images inthe visual and thermal infrared domain and the subsequent-ly calibrated and rectified image pair are shown in Fig. 3.

3.2. Image acquisition and foreground extraction

The acquired and rectified image pairs are denoted as IL,the left color image, and IR, the right thermal image. Dueto the high differences in imaging characteristics, it is verydifficult find correspondences for the entire scene. Instead,registration is focused on the pixels that correspond toforeground objects of interest. Naturally then, it is desir-able to determine which pixels in the frame belong to theforeground. In this step, only a rough estimate of the fore-ground pixels is necessary and a fair amount of falsepositives and negatives is acceptable. Any ‘‘good’’ segmen-tation algorithm could potentially be used with success.The corresponding foreground images are FL and FR,respectively. Additionally, the color image is converted tograyscale for mutual information based matching. Exam-ple input images and foreground maps are shown in Fig. 4.

Fig. 3. Multimodal stereo calibration using a heated calibration board toallow for a visible checkerboard pattern in thermal imagery. (a) Colorimage. (b) Thermal image. (c) Rectified color image. (d) Rectified thermalimage.

Page 9: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

Fig. 4. Image acquisition and foreground extraction for color and thermalimagery. (a) Color. (b) Color segmentation. (c) Thermal. (d) Thermalsegmentation. (For interpretation of the references to color in this figurelegend, the reader is referred to the Web version of this article.)

30 25 20 15 10 5 00.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

d

I i,d

Fig. 5. Mutual information for correspondence windows. (a) Color image.(b) Thermal image. (c) Mutual information. (d) Disparity voting matrix.(For interpretation of the references to color in this figure legend, thereader is referred to the Web version of this article.)

278 S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287

3.3. Correspondence matching using maximization of mutual

information

Once the foreground regions are obtained, the corre-spondence matching can begin. Matching occurs by fixinga correspondence window along one reference image inthe pair and sliding the window along the second imagethat is the best match. Let h and w be the height and widthof the image, respectively. For each column i 2 0. . .w, letWL,i be a correspondence window in the left image ofheight h and width M centered on column i. The widthM that produces the best results can be experimentallydetermined for a given scene. Typically, the value for M

is significantly less than the width of an object in the scene.Define a correspondence window WR,i,d in the right imagehaving height h*, the largest spanning foreground distancein the correspondence window, and centered at a columni + d, where d is a disparity offset. For each column i, a cor-respondence value is found for all d 2 dmin. . .dmax.

Given the two correspondence windows WL,i and WR,i,d,we first linearly quantize the image to N levels such that

N �ffiffiffiffiffiffiffiffiffiffiffiffiffiMh�=8

pð10Þ

where Mh* is the area of the correspondence window. Theresult in (10) comes from Thevenaz and Unser’s [6] sugges-tion that this equation is reasonable to determine the num-ber of levels needed to give good results for maximizing themutual information between image regions.

Now we can compute the quality of the match betweenthe two correspondence windows by measuring the mutualinformation between them. The mutual informationbetween two image patches is defined as

IðL;RÞ ¼X

l;r

P L;Rðl; rÞ logP L;Rðl; rÞ

P LðlÞP RðrÞð11Þ

where PL,R(l, r) is the joint probability mass function (pmf)and PL(l) and PR(r) are the marginal pmf’s of the left andright image patches, respectively.

The two-dimensional histogram, g, of the correspon-dence window is utilized to evaluate the pmf’s neededto determine the mutual information. The histogram gis an N by N matrix so that for each point, the quan-tized intensity levels l and r from the left and right cor-respondence windows increment g(l, r) by one.Normalizing by the total sum of the histogram givesthe probability mass function

P L;Rðl; rÞ ¼gðl; rÞPl;rgðl; rÞ

ð12Þ

The marginal probabilities can be easily determined bysumming PL,R(l, r) over the appropriate dimension.

P LðlÞ ¼X

r

P L;Rðl; rÞ ð13Þ

RðrÞ ¼X

l

P L;Rðl; rÞ ð14Þ

Now that we are able to determine the mutual informa-tion for two generic image patches, let’s define the mutualinformation between two specific image patches as Ii,d

where again i is the center of the reference correspondencewindow and i + d is the center of the second correspon-dence window. For each column i, we have a mutual infor-mation value Ii,d for d 2 dmin. . .dmax. The disparity d�i thatbest matches the two windows is the one that maximizesthe mutual information

d�i ¼ arg maxd

I i;d ð15Þ

The process of computing the mutual information fora specific correspondence window is illustrated in Fig. 5.An example plot of the mutual information values over

Page 10: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 279

the range of disparities is also shown. The red box in thecolor image is a visualization of a potential referencecorrespondence window. Candidate sliding correspon-dence windows for the thermal image are visualized ingreen boxes.

Fig. 6. The resulting disparity image D* from combining the left and rightdisparity images D�L and D�S as defined in (20). (a) Disparity image. (b)Unregistered. (c) Registered. (For interpretation of the references to colorin this figure legend, the reader is referred to the Web version of thisarticle.)

3.4. Disparity voting with sliding correspondence windows

We wish to assign a vote for d�i , the disparity that max-imizes the mutual information, to all foreground pixels inthe reference correspondence window. Define a disparityvoting matrix DL of size (h, w, dmax � dmin + 1), the rangeof disparities. Then given a column i, for each image pixelthat is in the correspondence window and foreground map,(u,v) 2 (WL,i \ FL), we add to the disparity voting matrixat DLðu; v; d�i Þ.

Since the correspondence windows are M pixels wide,pixels in each column in the image will have M votes fora correspondence matching disparity value. For each pixel(u,v) in the image, DL can be thought of as a distribution ofmatching disparities from the sliding correspondencewindows. Since it is assumed that all the pixels attributedto a single person are at the same distance from the camera,a good match should have a large number of votes for asingle disparity value. A poor match would be widelydistributed across a number of different disparity values.Fig. 5(d) shows the disparity voting matrix for a samplerow in the color image. The x-axis of the image is thecolumns i of the input image. The y-axis of the image isthe range of disparities d = dmin. . .dmax, which can beexperimentally determined based on scene structure andthe areas in the scene where activity will occur. Entries inthe matrix correspond to the number of votes given to aspecific disparity at a specific column in the image. Brighterareas correspond to a higher vote tally.

The complementary process of correspondence windowmatching is also performed by keeping the right thermalinfrared image fixed. The algorithm is identical to the onedescribed above, switching the left and right denotations.The corresponding disparity accumulation matrix is givenas DR.

Once the disparity voting matrices have been evaluatedfor the entire image, the final disparity registration valuescan be determined. For both the left and right images, wedetermine the best disparity value and its correspondingconfidence measure as

D�Lðu; vÞ ¼ arg maxd

DLðu; v; dÞ ð16Þ

C�Lðu; vÞ ¼ maxd

DLðu; v; dÞ ð17Þ

For a pixel (u,v) the values of C�Lðu; vÞ represent thenumber of times the best disparity value D�Lðu; vÞ was votedfor. A higher confidence value indicates that the disparitymaximized the mutual information for a large number ofcorrespondence windows and in turn, the disparity valueis more likely to be accurate than at a pixel with lower con-

fidence. Values for D�R and C�R are similarly determined. Thevalues of D�R and C�R are also shifted by their disparities sothat they align to the left image:

D�Sðu; vþ D�Rðu; vÞÞ ¼ D�Rðu; vÞ ð18ÞC�Sðu; vþ D�Rðu; vÞÞ ¼ C�Rðu; vÞ ð19Þ

Once the two disparity images are aligned, they can becombined. We have chosen to combine them using anAND operation. This tends to give the most robust results.So for all pixels (u, v) such that C�Lðu; vÞ > 0 andC�Sðu; vÞ > 0,

D�ðu; vÞ ¼D�Lðu; vÞ; C�Lðu; vÞP C�Sðu; vÞD�Sðu; vÞ; C�Lðu; vÞ < C�Sðu; vÞ

�ð20Þ

The resulting image D*(u,v), is the disparity image forall the overlapping foreground object pixels in the image.It can be used to register multiple objects in the image, evenat very different depths from the camera. Fig. 6 shows theresult of registration for the example frame carriedthroughout the algorithmic derivation. Fig. 6(a) showsthe computed disparity image D*, while Fig. 6(b) showsthe initial alignment of the color and thermal images andFig. 6(b) shows the alignment after shifting the foregroundpixels by the resulting disparity image. The thermal fore-ground pixels are overlaid (in green) on the color fore-ground pixels (in purple).

The resulting registration in Fig. 6 is successful in align-ing the foreground areas associated with each of the threepeople in the scene. Each person in the scene lies at a dif-ferent distance from the camera and yields a different dis-parity value that will align its corresponding imagecomponents.

4. Experimental validation and analysis

The disparity voting registration algorithm was testedusing color and thermal data collected where the cameraswere oriented in the same direction with a baseline of10 cm. The cameras were placed so that the optical axiswas approximately parallel to the ground imaging a sceneapproximately 6 m · 6 m. This placement was used to sat-isfy the assumption that there would be approximately con-stant disparity across all pixels associated with a specificperson in the frame. Placing the cameras in this sort of

Page 11: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

280 S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287

position is a reasonable thing to do, and such a position isappropriate for many applications. Video was captured asup to four people moved throughout an indoor environ-ment. For these specific experiments, foreground segmenta-tion in the visual imagery was done using the codebookmodel proposed by Kim et al. [22]. In the thermal imagery,the foreground is obtained using an intensity thresholdunder the assumption that the people in the foregroundare hotter than the background. This approach providedreasonable segmentation in each image. In cases where seg-mentation can only be obtained for one modality, the dis-parities can be computed with only that modality as thereference, at the cost of less robustness. We will show suc-cessful registration for examples of varying segmentationquality. The goal was to obtain registration results for var-ious configurations of people including different positions,distances from camera, and levels of occlusion.

Examples of successful registration for additionalframes are shown in Fig. 7. Columns (a) and (b) showthe input color and thermal images, while column (c) illus-trates the initial registration of the objects in the scene andcolumn (d) shows the resulting registration overlay afterthe disparity voting has been performed. These examplesshow the registration success of the disparity voting algo-rithm in handling occlusion and properly registering multi-ple objects at widely disparate depths from the camera.

4.1. Algorithmic evaluation

We have analyzed the registration results of our dispar-ity voting algorithm for more than 2000 frames of capturedvideo. To evaluate the registration, we define a correctframe as when the color and infrared data correspondingto each foreground object in the scene were visibly aligned.If one or more objects in the scene are not visibly aligned,then the registration is deemed incorrect for the entireframe. Table 2 shows the results of this evaluation. Thedata is broken down into groups based on the number ofobjects in the scene.

This analysis shows that when there was no visibleocclusion in the scene, registration was correct 100% ofthe time. We further break down the analysis to consideronly the frames where there are occluding objects in thescene. Under these conditions, the registration success ofthe disparity voting algorithm is shown in Table 3. The reg-istration results for the occluded frames is still quite high,with most errors occurring during times of near totalocclusion.

4.2. Accuracy evaluation using ground truth disparity values

In order to demonstrate the accuracy of our disparityvoting algorithm (DV) in handling occlusions, we offer aquantitative comparison to ground truth. It is our conten-tion that the disparity voting algorithm will provide goodregistration results during occlusions, when initial segmen-tation gives regions that contained merged objects. Our dis-

parity voting algorithm makes no assumptions about theassignment of pixels to individual objects, only that a rea-sonable segmentation can be obtained. We demonstratethat the disparity voting registration can successfully regis-ter all objects in the scene even through occlusions. We willalso show the results for bounding box approaches (BB)[17] for completeness.

We generate the ground truth by manually segmentingthe regions that correspond to foreground for each image.We then determine the ground truth disparity by individu-ally matching each manually segmented object in the scene.This ground truth disparity image allows us to directly andquantitatively compare the registration success of the dis-parity voting algorithm and the bounding box approach.By comparing the registration results to the ground truthdisparities, we are able to quantify the success of each algo-rithm and show that the disparity voting algorithm outper-forms the bounding box approach for occluding objectregions.

Fig. 8 illustrates the ground truth disparity comparisontests. Column (a) shows the ground truth disparity, column(b) shows the disparity generated using the bounding box(BB) algorithm, and column (c) shows the disparity gener-ated using the disparity voting (DV) algorithm. Fig. 9 plotsthe absolute difference in disparity values from the groundtruth for each corresponding row in Fig. 8. The BB resultsare plotted in dotted red, while the DV results are plottedin solid blue. Notice how the two algorithms perform iden-tically to ground truth in the first row, as there are noocclusion regions. The subsequent examples all have occlu-sion regions and the DV approach more closely followsground truth than the BB approach. The BB registrationresults have multiple objects registered at the same depththough the ground truth shows that they are at separatedepths. Our disparity voting algorithm is able to determinethe distinct ground truth disparities for different objectsand the |D Disparity| plots show that the DV algorithm isquantitatively closer to the ground truth, with most regis-tration errors within one pixel of ground truth with largererrors usually occurring only in small portions of theimage. On the other hand, when errors occur in the bound-ing box approach, the resulting disparity offset error islarge and occurs for the entire scope of erroneously regis-tered object.

4.3. Comparative study of registration algorithms with non-

ideal segmentation

We perform a qualitative evaluation using the real seg-mentations generated from codebook background subtrac-tion in the color image and intensity thresholding in thethermal image. These common segmentation algorithmsonly give foreground pixels and make no attempt to discernthe structure of objects in the scene. Fig. 10 illustrates sev-eral examples that compare the registration results of thedisparity voting and bounding box algorithms. Noticehow the disparities for the bounding box (BB) algorithm

Page 12: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 281

in row (5) are constant for the entire occlusion region eventhough the objects are clearly at very different disparities.The disparity results for our disparity voting algorithm in

Fig. 7. Registration results using disparity voting algorithm for exampleinterpretation of the references to color in this figure legend, the reader is refe

row (6) show distinct disparities in the occlusion regionsthat correspond to the appropriate objects in the scene.Visual inspection of rows (7) and (8) show that the result-

frames. (a) Color. (b) Infrared. (c) Unregistered. (d) Registered. (Forrred to the Web version of this article.)

Page 13: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

Table 2Registration results for disparity voting algorithm with multiple people ina scene

No. objects in frame No. frames correct Total frames % Correct

1 55 55 100.002 171 172 99.423 1087 1111 97.844 690 720 95.83Total 2003 2058 97.33

Table 3Registration results for disparity voting algorithm with multiple people ina scene: frames with occlusion

No. objects in frame No. frames correct Total frames % Correct

2 51 52 98.083 653 677 96.454 581 611 95.09Total 1285 1340 95.90

282 S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287

ing registered alignment from the disparity values is moreaccurate for the DV approach.

Fig. 11 shows the registration alignment for each algo-rithm in closer detail for a selection of frames. Noticehow the disparity voting approach is able to align eachobject in the frame, while the bounding box approachhas alignment errors due to the fact that the segmentationof the image yielded bounding boxes that contained morethan one object. Clearly, disparity voting is able to handlethe registration in these occlusion situations and the result-ing alignment appears qualitatively better than the bound-ing box approach.

Fig. 8. Comparison of bounding box (BB) approach to the proposed disparitydisparity. (c) DV disparity.

4.4. Robustness evaluation

We demonstrate the robustness of our algorithm byapplying it to another set of data taken of a different scenewith a different set of cameras. For these experiments, wehave up to six people move through an approximately6 m · 6 m environment. The cameras are arranged with a10 cm baseline and are calibrated and rectified as describedin Section 3.1. Again, segmentation is performed using thecodebook background model for the color imagery andintensity thresholding for the thermal imagery. Correspon-dence window sizes and threshold values were kept con-stant from past experiments.

Fig. 12 shows successful registration for example framescontaining an increasing number of people in the scene.Column (c) of the figure shows distinct levels of alignmentdisparity for each person in the scene and column (e) showsthe resulting registered alignment. Notice how the disparityvoting algorithm is able to properly determine the dispari-ties necessary to align the color and thermal image in situa-tions with multiple people and multiple levels of occlusion.Fig. 13 shows detailed examples of the registration align-ment. Note how image features, especially facial region,appear well aligned in the images.

5. Multimodal video analysis for person tracking: basic

framework and experimental study

We have shown that the disparity voting algorithm formultimodal registration is a robust approach to estimatingthe alignment disparities in scenes with multiple occluding

voting algorithm for ground truth segmentation. (a) Ground truth. (b) BB

Page 14: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

50 100 150 200 2500

0.20.40.60.8

11.21.41.61.8

2

50 100 150 200 2500123456789

10

50 100 150 200 2500

1

2

3

4

5

6

Fig. 9. Plots of |DD| from ground truth for each example in Fig. 8. Bounding box errors for an example row are plotted in dotted red, while errors indisparity voting registration are plotted in solid blue. (For interpretation of the references to color in this figure legend, the reader is referred to the Webversion of this article.)

Fig. 10. Comparison of BB algorithm [17] to the proposed disparity voting (DV) algorithm for a variety of occlusion examples using non-idealsegmentation: (1) the color image, (2) the color segmentation, (3) the thermal image, (4) the thermal segmentation, (5) the BB disparity image, (6) the DVdisparity image, (7) the BB registration, (8) the DV registration. (For interpretation of the references to color in this figure legend, the reader is referred tothe Web version of this article.)

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 283

Page 15: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

Fig. 11. Details of registration alignment errors in the bounding boxregistration approach and corresponding alignment success for thedisparity voting (DV) Algorithm for several occlusion examples usingnon-ideal segmentation. (a) BB registration. (b) DV registration.

284 S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287

people. The disparities generated from the registration pro-cess yield values that can be used to differentiate the peoplein the room. It is with this in mind that we investigate theuse of multimodal disparity as a feature for tracking peoplein a scene.

Tracking human motion using computer visionapproaches is a well-studied area of research and a good

Fig. 12. Examples illustrating the robustness of the disparity voting algorithmnumber of people. Column (e) illustrates the registration using disparity voting(d). (a) Color. (b) Thermal. (c) Disparity. (d) Unregistered. (e) Registered.

survey by Moeslund and Granum [23] gives lucid insightinto the issues, assumptions and limitations of a large vari-ety of tracking approaches. One approach, disparity basedtracking, has been investigated for conventional color ste-reo cameras and has proven quite robust in localizingand maintaining tracks through occlusion, as the trackingis performed in 3D space by transforming the stereo imageestimates into a plan-view occupancy map of the imagedspace [24]. We wish to explore the feasibility of using suchapproaches to tracking with the disparities generated fromdisparity voting registration. An example sequence offrames in Fig. 14 illustrates the type of people movementswe aim to track. The sequence has multiple people occupy-ing the imaged scene. Over the sequence, there are multipleocclusions of people at different depths. The registrationdisparities that are used to align the color and thermalimages can be used as an feature for tracking peoplethrough these occlusions and maneuvers.

Fig. 15 shows an algorithmic framework for multimodalperson tracking. In tracking approaches, representativefeatures are typically extracted from all available imagesin the setup [25]. Features are used to associate tracks from

in registering multiple people in a scene. Each row contains an increasing. It is a marked improvement over the initial, unregistered image in column

Page 16: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

Fig. 13. Detailed examples of successful registration alignment using disparity voting.

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 285

frame to frame and the output of the tracker is often usedto guide subsequent feature extraction. All of these algo-rithmic modules are imperative for reliable and robusttracking. For our initial investigations, we will focus onthe viability of registration disparity as a tracking feature.

In order to determine the accuracy of the disparity esti-mates for tracking, we first calibrate the scene. This is doneby having a person walk around the testbed area, stoppingat preset locations in the scene. At each location we mea-sure the disparity generated from our algorithm and usethat as ground truth for analyzing the disparities generatedwhen there are more complex scenes with multiple peopleand occlusions. Fig. 16(a) is the variable baseline multi-modal stereo rig and Fig. 16(b) shows the ground truth dis-parity range for the testbed from the calibrationexperiments captured with this rig.

To show the viability of registration disparity as track-ing feature in a multimodal stereo context, we compareground truth positional estimates to those generated fromthe disparity voting algorithm. Lateral position informa-tion for each track was hand segmented by clicking onthe center point of the person’s head in each image. Thisis a reasonable method, as robust head detection algo-rithms for head detection could be implemented for both

Fig. 14. Example input sequence for multiperson tracking experiments. NoticFrame 20. (c) Frame 40. (d) Frame 60. (e) Frame 80. (f) Frame 100. (g) Fram

color and thermal imagery (skin-tone, hot spots, head tem-plate matching). Approaches such as vertical projection orv-disparity could also be used to determine the locations ofpeople in the scene. Ground truth disparity estimates weregenerated by visually determining the disparity based onthe person’s position relative to the ground truth disparityrange map as shown in Fig. 16. Experimental disparitieswere generated using the disparity voting algorithm withthe disparity of each person determined from disparity val-ues in the head region. A moving average of 150 ms wasused to smooth instantaneous disparity estimates.

Fig. 17 shows the track patterns and ground truth for theexample sequence in Fig. 14. The ground truth is plotted insolid colors for each person in the sequence, while the dis-parity estimates from the disparity voting algorithm areshown in corresponding colored symbols with dotted linesconnecting the estimates. Fig. 17(a) is a representation ofthe tracks, illustrating a ‘‘plan-view’’-like representation ofthe movements and disparity changes of the people in thetestbed. Fig. 17(b) shows a time varying version of the samedata, with the frame number plotted in the third dimension.

The plots in Fig. 17 show that the disparities generatedfrom the disparity voting registration reasonably follow theground truth tracks. As the green tracked person moves

e occlusions, scale, appearance and disparity variations. (a) Frame 0. (b)e 120. (h) Frame 140.

Page 17: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

0 5 10 15 20 25 300

1

2

3

4

5

6

7

8

lateral position (pixels x 10–1)

disp

arity

(pi

xels

)

010

2030

40

0

2

4

6

80

20

40

60

80

100

120

140

lateral position (pixels x 10–1)disparity (pixels)

fram

e

Fig. 17. Tracking results showing close correlation between ground truth(in solid colors) and disparity tracked estimates (in dotted colors). Eachcolor shows the path of each person in the sequence. (a) Track patternsand ground truth for four person tracking experiment. (b) Time varyingtrack patterns and ground truth for four person tracking experiment. (Forinterpretation of the references to color in this figure legend, the reader isreferred to the Web version of this article.)

Fig. 15. Algorithmic flowchart for multiperson tracking.

286 S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287

behind and becomes occluded by the blue tracked person,we see that the disparities generated when he re-emergesfrom the occlusion are in line with the ground truth dispar-ities and can be used to re-associate the track after theocclusion.

Errors from ground truth are particularly apparentwhen people are further from the camera. This is becauseof the non-linearity of the disparity distribution. Thereare more distinct disparities nearer to the camera. As youmove deeper in the scene in Fig. 16, the change in disparityfor the same change in distance is much less. At these dis-tances, errors of even one disparity shift are very pro-nounced. Conventional stereo algorithms typically usedapproaches that give subpixel accuracy, but the currentimplementation of our disparity voting algorithm onlygives pixel level disparity shifts. While this may be accept-

Fig. 16. (a) Variable baseline multimodal stereo rig, (b) experimentally detdetermining the disparities for a single person standing at predetermined poin

able for registration alignment, refinement steps are neces-sary to make disparity a more robust tracking feature.Approaches that use multiple primitives [26], such as edges,

ermined disparity range for testbed. The disparities were computed byts in the imaged scene.

Page 18: Mutual information based registration of …cvrr.ucsd.edu/publications/2007/SKrotosky_CVIU07_mutual...Mutual information based registration of multimodal stereo videos for person tracking

S.J. Krotosky, M.M. Trivedi / Computer Vision and Image Understanding 106 (2007) 270–287 287

shapes, and silhouettes, etc., could be used to augment theaccuracy of the disparity voting algorithm. Additionally,using multiple tracking features could provide additionalmeasurements that can be used to boost the associationaccuracy.

6. Discussion and concluding remarks

Multimodal imagery applications for human analysisspan a variety of application domains, including medical[1], in-vehicle safety systems [2] and long-range surveillance[3]. Often, the registration algorithms these types of systemsemploy do not operate on data that has multiple objects andmultiple depths that are significant relative to their distancefrom the camera. It is in this realm, including close-rangesurveillance [20] and pedestrian detection applications[27], that we believe disparity voting registration techniquesand corresponding tracking algorithms will prove useful.

In this paper we have provided an analysis of theapproaches to multimodal image registration and detailedthe assumptions, applicability and limitations of each.We then introduced and analyzed a method for registeringmultimodal images with occluding objects in the scene. Byusing the disparity voting approach, an analysis of over2000 frames yielded a registration success rate of over97% with a 96% success rate when considering only occlu-sion examples. Additionally, ground truth accuracy evalu-ations illustrate how the disparity voting algorithmprovides accurate registration for multiple people in sceneswith occlusion. Comparative studies show the improve-ments upon the accuracy and robustness of previousbounding box techniques in both a quantitative and quali-tative manner. We have presented a framework for track-ing and have shown promising experimental studies thatsuggest that disparity voting results can be used as a featurethat will allow for the differentiation of people in a sceneand give accurate tracking associations in complex sceneswith multiple people and occlusions.

References

[1] P. Thevenaz, M. Bierlaire, M. Unser, Halton sampling for imageregistration based on mutual information, Sampling Theory in Signaland Image Processing (in press). Available from: <http://bigwww.epfl.ch/preprints/thevenaz0602p.html>.

[2] M.M. Trivedi, S.Y. Cheng, E.M.C. Childers, S.J. Krotosky, Occu-pant posture analysis with stereo and thermal infrared video:algorithms and experimental evaluation, IEEE Trans. Veh. Technol.53 (6) (2004) 1712–1968.

[3] J. Davis, V. Sharma, Fusion-based background-subtraction usingcontour saliency, in: IEEE CVPR Workshop on Object Tracking andClassification beyond the Visible Spectrum, 2005.

[4] P. Viola, W.M. Wells, Alignment by maximization of mutualinformation, Int. J. Comput. Vis. 24 (2) (1997) 137–154.

[5] G. Egnal, Mutual information as a stereo correspondence measure,Tech. Rep. MS-CIS-00-20, University of Pennsylvania, 2000.

[6] P. Thevenaz, M. Unser, Optimization of mutual information formultiresolution image registration, IEEE Trans. Image Process. 9 (12)(2000) 2083–2089.

[7] G.L. Foresti, C.S. Regazzoni, P.K. Varshney, Multisensor Surveil-lance Systems: The Fusion Perspective, Springer Press, 2003.

[8] R. Hartley, A. Zisserman, Multiple View Geometry in ComputerVision, Cambridge University Press, 2002.

[9] C.O. Conaire, E. Cooke, N. O’Connor, N. Murphy, A. Smeaton,Background modeling in infrared and visible spectrum video forpeople tracking, in: IEEE CVPR Workshop on Object Tracking andClassification beyond the Visible Spectrum, 2005.

[10] M. Irani, P. Anandan, Robust multi-sensor image alignment, in:Computer Vision, 1998. Sixth International Conference on, 1998.

[11] E. Coiras, J. Santamaria, C. Miravet, Segment-based registrationtechnique for visual-infrared images, Opt. Eng. 39 (1) (2000) 282–289.

[12] J. Han, B. Bhanu, Detecting moving humans using color and infraredvideo, in: IEEE Inter. Conf. on Multisensor Fusion and Integrationfor Intelligent Systems, 2003.

[13] M. Itoh, M. Ozeki, Y. Nakamura, Y. Ohta, Simple and robusttracking of hands and objects for video-based multimedia production,in: IEEE Conf. on Multisensor Fusion and Integration for IntelligentSystems, 2003.

[14] G. Ye, Image registration and super-resolution mosaicing, <http://www.library.unsw.edu.au/~thesis/adt-ADFA/uploads/approved/adt-ADFA20051007.144609/public/01front.pdf> (2005).

[15] X. Ju, J.-C. Nebel, J.P. Siebert, 3D thermography imaging standard-ization technique for inflammation diagnosis, in: Proc. SPIE,Photonics Asia, 2004.

[16] M. Bertozzi, A. Broggi, M. Felias, G. Vezzoni, M.D. Rose, Low-levelpedestrian detection by means of visible and far infra-red tetra-vision,in: IEEE Conf. on Intelligent Vehicles, 2006.

[17] H. Chen, P. Varshney, M. Slamani, On registration of regions ofinterest (ROI) in video sequences, in: IEEE Conf. on Advanced Videoand Signal Based Surveillance (AVSS’03), 2003.

[18] J. Davis, V. Sharma, Robust detection of people in thermal imagery,in: IEEE 17th Inter. Conf. on Pattern Recognition, 2004.

[19] J. Davis, V. Sharma, Robust background-subtraction for persondetection in thermal imagery, in: Computer Vision and PatternRecognition Workshop, 2004 Conference on, 2004.

[20] S.J. Krotosky, M.M. Trivedi, Registration of multimodal stereoimages using disparity voting from correspondence windows, in:IEEE Conf. on Advanced Video and Signal based Surveillance(AVSS’06), 2006.

[21] J.-Y. Bouguet, Camera calibration toolbox for matlab, <http://www.vision.caltech.edu/bouguetj/calib_doc/>.

[22] K. Kim, T. Chalidabhongse, D. Harwood, L. Davis, Real-timeforeground-background segmentation using codebook model, Real-Time Imaging 11 (3) (2005) 163–256.

[23] T.B. Moesland, E. Granum, A survey of computer vision-basedhuman motion capture, Comput. Vis. Image Und. 81 (3) (2001) 231–268.

[24] M. Harville, D. Li, Fast, integrated person tracking and activityrecognition with plan-view templates from a single stereo camera,in: IEEE Conf. on Computer Vision and Pattern Recognition,2004.

[25] K. Huang, M.M. Trivedi, Video arrays for real-time tracking ofperson, head, and face in an intelligent room, Mach. Vis. Appl 14 (2)(2003) 103–111.

[26] S. Marapane, M.M. Trivedi, Multi-primitive hierarchical (MPH)stereo analysis, IEEE Trans. Pattern Anal. Mach. Intell. 16 (3) (1994)227–240.

[27] S.J. Krotosky, M.M. Trivedi, Multimodal stereo image registrationfor predestrian detection, in: IEEE Conf. on Intelligent Transporta-tion Systems, 2006.