Real Time Person Detection and Tracking by Mobile Robots ... · method of face detection [6] is...

Real Time Person Detection and Tracking by Mobile Robots usingRGB-D Images

Duc My Vo, Lixing Jiang and Andreas Zell

Abstract— Detecting and tracking humans are key problemsfor human-robot interaction. In this paper we present analgorithm for mobile robots to detect and track people reliably,even when humans go through different illumination conditions,often change in a wide variety of poses, and are frequentlyoccluded. We have improved the performance of face andupper body detection to quickly find people in each frame.This combination enhances the efficiency of human detection indealing with partial occlusions and changes in human poses. Tocope with the higher challenges of complex changes of humanposes and occlusions, we at the same time combine a fastcompressive tracker with a Kalman filter to track the detectedhumans. Experimental results on a challenging database showthat our method achieves high performance and can run in realtime on mobile robots.

I. INTRODUCTION

Detecting and tracking multiple humans on mobile robotplatforms still remains a challenging task. State-of-the-artalgorithms have not yet solved challenging problems of hu-man detection and tracking. First, the mobile robot has oftento track the moving humans in a large variability of poseand appearance. Second, the mobile robot frequently has tocope with challenges of full occlusions or self-occlusions.Moreover, due to frequent movement, the mobile robot hasoften to change the field of view, causing fast changes of thehuman appearance in each frame. Thus it is not easy for themobile robot to reliably track people over long periods oftime. Eventually, the mobile robot has to interact with manypeople in real time, resulting in a limitation of computationalcosts of the tracking system.

Taking inspiration from several state-of-the-art approaches[1], [2], [3], we propose a new algorithm of detectingand tracking multiple people on mobile robots. The firstimportant component is a set of person detectors, helpingmobile robots in each frame to reliably find the location ofhumans and update the human trackers. This set includes theface detector and upper body detector. The face detector ishelpful and strongly reliable when the human face is visibleand the upper body detector has significant advantages whendealing with the occlusion of the lower body or the face. Dueto complicated changes in human pose and appearance, ourdetectors can not find the position of a person in every frame.Hence, to keep tracking people efficiently, we use a trackingmethod, based on the combination of a fast compressivetracker and a Kalman filter. This combination enhances the

D. M. Vo and L. Jiang are with the Chair of Cognitive Systems,headed by Prof. A. Zell, Computer Science Department, Universityof Tubingen, Sand 1, D-72076 Tubingen, Germany {duc-my.vo,andreas.zell}@uni-tuebingen.de

Fig. 1. Example tracking result of our algorithm.

efficiency of our system to adapt to human changes of pose,scales and appearance as well as to partial or full occlusion.In addition, we utilize the depth information from RGB-Dimages to reduce computational costs and false positives,resulting in a real time performance of human detection andtracking on the mobile robot.

The remaining parts of this paper are organized as follows:in Section II we mention the state-of-the-art algorithms offace recognition which motivated our research. In SectionIII, our method is presented in detail. In Section IV the ex-perimental results obtained from databases are described. Weconclude this paper in Section V, mentioning our intentionswith regard to our future work.

II. RELATED WORK

Although there are many approaches to tracking multi-ple humans by mobile robots, such as sample-based jointprobabilistic data association filters [4], and Kalman filters[5], most of them have not been successful to adapt tohuman changes of pose, scales and appearance as wellas to partial or full occlusions. State-of-the-art algorithmsof human detection [6], [7], make a great contribution totracking-by-detection approaches [8], [9], thus significantlyimproving the tracking system of mobile robots. Choi et al.[1] proposed a method of detecting and tracking people bymobile robots, based on the algorithm of Reversible JumpMarkov Chain Monte Carlo Particle Filtering (RJ-MCMC).Due to detecting humans based on relatively reliabale ob-servation cues of humans in each frame, this method wasshown to be robust to complicated changes of human posesand partial occlusions. These observation cues include a

Fig. 2. Flow chart of our approach.

human detector using a Histogram of Orientations [7], a facedetector using the Viola-Jones method of objection detection[6], and the detectors of skin, motion and depth-based shape.However, the computational costs of the detectors and thetracking algorithm of Reversible Jump Markov Chain MonteCarlo Particle Filtering are very expensive. For human-robotinteraction, the computational complexity of this algorithmhas not met the requirement of real time performance.

Other promising approaches for human tracking are onlinelearning methods to handle the complex appearance variationof human poses. Some examples of these algorithms includeincremental learning [10], online multiple instance learning[11] and visual tracking using L1 minimization [12]. Todeal with the appearance change of the object and itspartial occlusion, Zhang et al. [3] proposed the method ofcompressive tracking. This method uses compressed features,extracted from the tracked object, to online update a simpleBayes classifier. As a result, this classifier is able to quicklyadapt to the object changes of pose, rotation, deformation,and self-occlusion. In addition, this method is suitable forreal time applications due to its low computational costs.Since the mobile robot and humans often move and changetheir directions and orientations, an effective improvementof the compressive tracker can be a good solution to adaptto all these changes and reliably track humans.

III. APPROACHFigure 2 illustrates an overview of the proposed person

detection and tracking framework for mobile robots. Humandetection is applied in each new frame. The detection moduleis comprised of a face detector and an upper body detector.In order to meet the requirement of real time mobile robotperformance, one detector is used in the current frame andthe other one is used in the next frame, and so on.

In the stage of face detection, the technique of depth-basedskin color segmentation is provided to speed up the search ofthe face and reduce the false positive rate. Since the searchareas in each frame are reduced significantly, the Viola-Jonesmethod of face detection [6] is implemented to detect humansquickly and reliably.

In the stage of upper body detection, an upper body detec-tor is trained on the CALVIN dataset [2]. Since searching for

Fig. 3. Flowchart of segmentation steps for upper body and face detection.

humans in the whole image is a time consuming operation,we decrease search areas in each image and estimate humanscales necessary to search in those areas. For this reason,the depth information is utilized to segment potential areaswhere humans probably appear, and skip non-human areasin each frame. As a result, the trained upper body detectorcan in real time quickly find the humans in images.

Our tracking method is based on a fast compressive trackerand a Kalman filter. The new position of our tracker is takeneither from the output of the fast compressive tracker or fromthe predicted position of the Kalman filter and it dependson whether large occlusion regions are found in the currentframe or not. If a complete occlusion is found, the Kalmanfilter plays an important role to predict the next position ofthe temporally occluded human. If no significant occlusionis recognized, the fast compressive tracker provides a moreaccurately predicted position than the Kalman filter. In ourresearch, the depth information has proven to be useful fordetecting occlusions.

A. Face detection

As mentioned in our previous work [13], if the imageof the frontal face is visible we apply our face detectorto quickly and reliably detect people. As shown in Figure3, the information of geometric constraints, navigation andthe technique of depth-based skin color segmentation areprovided to make our face detector much faster and moreaccurate. Our face detection involves three basic steps: First,in order to reduce computational costs we use a set ofsampling points spanning the whole image to collect the in-formation of color, texture and depth. Second, the constraintsof geometry and navigation information are used to removethe background. Finally, the techniques of skin detectionand depth-based skin colour segmentation are applied aroundfiltered sampling points to find the potential regions in whichthe face detector is able to localize the face position. Inaddition, we can speed up face detection by limiting the

Fig. 4. Tracking steps of a fast compressive tracker.

range of facial scales, which is mentioned in [13] and thusestimate the sizes of the humans that are possible present inthese regions.

B. Upper body detection

Similar to the above step of face detection, the informa-tion of geometric constraints, navigation and the techniqueof depth-based segmentation are helpful for removing thebackground and reducing search areas, as shown in Figure3. As a result, we have a small set of search areas where theupper body detector is applied for to detect humans, basedon Histograms of Oriented Gradients [7]. Basically, similarto what we do in face detection, we estimate the sizes ofhumans in these search areas in order to significantly reducecomputational costs.

The upper body detector is trained by using a linear SVM.Particularly, search windows are divided into cells which areused to compute a Histogram of Oriented Gradients. Theupper body detector classifies the search window runningthrough every position and scale to find the human location.

C. Fast compressive tracking

If large changes in the appearance of humans by illumi-nation, different poses or by partial occlusion exist, the dataassociation temporally fails. When those failures happen,the fast compressive tracker plays a very important role,following the human and adapting to those complex changesas well as to partial occlusion.

To keep tracking the human, the fast tracker uses a searchwindow, which is updated by each corresponding detection,as shown in Figure 4. First, we collect a set of image samplesnear the current human location in the search window. Thenwe estimate the distance between the human, appearingin the search window, and the camera by sampling thedepth information in this search window. Similar to thesegmentation step presented in our previous work [13], weuse a set of sampling points spanning the whole searchwindow to collect the information on depth. The technique ofdepth-based segmentation is applied around sampling pointsto find the human, which is the biggest segmented regionin the search window. The distance between the human andthe camera is estimated based on the average depth valueof the sampling points belonging to the segmented region.In order to filter out samples, the above technique of depth-based segmentation is used for all samples to segment objectsin each of these samples. Because a distance estimationbetween the human and the camera exists, a sample can befiltered out if no segmented object is in the range closer than0.5 meters from the tracking human location.

After reducing a large amount of negative samples, theremaining samples are kept for classification. Each filteredsample z ∈ Rw×h is convolved with multiple-scale filtersh1,1, ..., hw,h computed as follows

hi,j(x, y) =

{1, 1 6 x 6 i, 1 ≤ y ≤ j0, otherwise

(1)

where i and j are the sizes of a filtered image.These images are concatenated as a feature vectors x =(x1, ...xm)T ∈ Rm where m = (wh)2.

Since this feature vector x is very high dimensional, weuse a random projection to transform x ∈ Rm into a lowerdimensional space v ∈ Rn

v = Rx (2)

where R ∈ Rn×m is a random projection matrix. Theelements of this random projection matrix are defined as

rij =√s×

1, with probability 12s

0, with probability 1− 1s

−1, with probability 12s

(3)

where s = m/4.By using a naive Bayes classifier for each feature vector

v ∈ Rn, we can find the new position of the tracking humanin the current frame, corresponding to the maximal responseof this classifier. All elements in v are modeled with a Bayesclassifier as follows:

H(v) = log(∏i=1

n p(vi|y=1)p(y=1)∏i=1n p(vi|y=0)p(y=0)

)=

n∑i=1

log(

p(vi|y=1)p(vi|y=0)

)(4)

where p(y = 1) = p(y = 0), and y ∈{0, 1

}is a binary

variable representing the sample label.After using the classifier H in (4) to find the tracking loca-

tion, the classifier H is updated to adapt to human changes ofrotation, occlusion and scale as well as to adapt to complex

Fig. 5. Update steps of a fast compressive tracker.

changes of background and illumination. For updating, wecollect a set of possitive samples near the current center ofthe search window and a set of negative samples far awayfrom this position, as shown in Figure 5. Similar to thestep of classification, low dimensional features v ∈ Rn areextracted from these two sets of samples following the stepsof depth-based filtering, multi-scale filter banks, and randomprojection. We use these features to update the classifierparameters. Since the conditional distributions p(vi | y = 1)and p(vi | y = 0) are Gaussian distributions with

p(vi | y = 1) ∼ N(µ1i , δ

1i ), p(vi | y = 0) ∼ N(µ0

i , δ0i ) (5)

we have to update the parameters µ1i , δ1i , µ0

i and δ0i in theclassifier H . These parameters are updated online as follows

µ1i ← λµ1

i + (1− λ)µ1

σ1i ←

√λ(σ1

i )2 + (1− λ)(σ1)2 + λ(1− λ)(µ1

i − µ1)2(6)

where σ1 =√

1n

∑n−1k=0|y=1(vi(k)− µ1)2 and µ1 =

1n

∑n−1k=0|y=1 vi(k), and λ is a learning parameter.

D. Kalman filter for occlusion handling

When a new fast compressive tracker is initiated, a Kalmanfilter is also set up as an alternative tracker in case that the

human is completely occluded by another person or largeobjects. That means that the output of the Kalman filter isused for tracking when the human is significantly occludedand the fast compressive tracker can not provide a reliableprediction.

A Kalman filter consists of measurement update equationsand time update equations. When the compressive trackeris still tracking the human efficiently without recognizedocclusion, the measurement update equations correct theKalman filter by using the significantly reliable output fromthe fast compressive tracker. The time update equations areused to predict the current position of the human, and thisprediction only replaces the one from the compressive trackerwhen an occlusion is found. The Kalman filter state vectorincludes five parameters which are x-y coordinates of thebounding box of the human region, the velocity in the x andy directions, and the scale of the human region. The state-space representation of the Kalman filter is given as

xt

ytx′

t

y′t

st

=

1 0 ∆t 0 00 1 0 ∆t 00 0 1 0 00 0 0 1 00 0 0 0 1

xt−1

yt−1

x′t−1

y′t−1

st−1

+Wt (7)

where xt, yt are the coordinates, and x′t, y′t are the

velocities, st is the scale of the human region. ∆t is definedas the time interval and Wt is the measurement noise. Inorder to update the Kalman filter, the output of the NaiveBayes classifier in the fast compressive tracker is given asthe measurement input. The Kalman filter uses this data toeffectively correct the system. The measurement correctionequation is represented as follow

xt

ytx′t

y′t

st

=

xt

ytx′

t

y′t

st

+Kt

mxt

mytmSt

−1 0 ∆t 0 00 1 0 ∆t 00 0 0 0 1

xt

ytx′

t

y′t

st

(8)

where Kt is the Kalman factor, mxt, myt and mst are themeasurement variables computed from the position and sizeof the human assessed by the compressive tracker.

The prediction of the Kalman filter is required wheneverthe human location can not be found by the compressivetracker due to occlusion by other persons or objects. Whenthe human is significantly occluded, the fast compressivetracker can filter out all samples since no segmented objectcloser than 0.5 meters to the tracking human location can befound. For this situation, the Kalman filter is considered inthe short time interval as the alternative solution to continueto track the occluded human. After to frames, if a newdetection is matched to the tracker, the fast compressivetracker is recovered at the new detected position. Otherwise,if we can not find any detection matching the Kalman filterduring tk frames, the target is automatically terminated.

E. Hungarian algorithm for matching

When we find a new detection of a human we use theHungarian algorithm to search the tracker corresponding to

this detection. If it is not matching any available tracker,a new tracker is initiated on the new detected position. Ifthe Hungarian algorithm finds the corresponding tracker, thistracker is updated by the new detected position. On theother hand, if no detections are found for the same trackerduring a period of termination, this tracker is automaticallyterminated.

The Hungarian algorithm is based on the cost valuescorresponding to the overlap ratio between valid targets andnew detections in each frame. In order to compute the costfor each pair of a target and a detection, we based the formulaon the overlap ratio between them, as follows

Rki =

2 ∗ sOiksDi + sTk

(9)

where sOik is the overlap area, sDi is the area of the ith

detection and sTk is the area of the kth target. The cost iscomputed as following

Cki =

{0 if Rk

i ≤ Rmin

−log(Rki ) otherwise

(10)

where Rmin is a threshold to evaluate whether the distancebetween the detection and the target is too far or not.By minimizing the cost function as mentioned in [13], anoptimal solution is found to correctly match targets anddetections.

IV. EXPERIMENTAL SETUP

A. Dataset

We used the first Michigan dataset (static dataset) [1],collected in indoor environments with a fixed MicrosoftKinect camera mounted approximately 2 meters high, to testthe accuracy and the processing time of our method andits competitors. This database consists of 17 log files eachspanning 2 to 3 minutes. For evaluating the accuracy of ourmethod under the conditions of a running mobile robot, thesecond dataset (the on-board dataset) was used with a Mi-crosoft Kinect camera mounted on-board a robot (PR2). Thisdataset consists of 18 log files recorded in offices, corridorsand the cafeteria. Our goal was to evaluate the performanceof our method in indoor environments in which both thehumans and the robot move under different illuminationconditions and in which the human either changes in a varietyof poses or is occluded. Figure 8 shows some sample imagesextracted from our datasets.

We evaluated the accuracy and the processing time ofour method and its closest competitor, the Reversible JumpMarkov Chain Monte Carlo Particle Filtering (RJ-MCMC)[1]. In all our experiments, humans are hand-annotatedby bounding boxes around upper bodies. The experimentsimplemented on both Michigan datasets were carried outusing C++ on a PC with 2.5 GHz Intel Core i5 CPU.

0

0.2

0.4

0.6

0.8

1

10-1 100

missrate

false positives per image

OURS (52%)RJ-MCMC (60%)

Fig. 6. Results of human tracking on the first Kinect dataset. Our algorithm,with an improvement of 8.0 %, significantly outperforms the RJ-MCMC.

B. Results

We use the log-average miss rate (LAMR), mentioned in[14], to compare the performances, shown by the curve ofmiss-rate versus false-positive-per-image (FPPI). The log-average miss rate is computed by averaging miss rate at nineFPPI rates evenly spaced in the range of 10−2 to 100. If acurve ends before reaching a given FPPI rate, the minimummiss rate is applied.

On the first dataset, we show the comparison of twoalgorithms in Figure 6. Our algorithm, an improvement of8.0 %, significantly outperforms the RJ-MCMC. On thesecond dataset, the improvement of our algorithm is 8.55%, as indicated in Figure 7. These results prove that thecombination of the fast compressive tracker and Kalmanfilter is more efficient than the RJ-MCMC, even when wedo not use the expensive human detectors, such as the fullbody human detector, the depth based shape detector, themotion detector and skin color detector. Although both theface detector and the upper body detector can detect humansreliably, they can not detect the human in certain complicatedposes in many frames. In these cases, the fast compressivetracker gives a high contribution to the performance of ouralgorithm due to its robustness to different poses of humansas well as in partial occlusion. In addition, the Kalmanfilter plays a significant role as the alternative to the fastcompressive tracker to deal with a full occlusion.

Besides the accuracy of an algorithm, the processing timeis also a very important factor in mobile robot performance.Hence we also compare our algorithm with the RJ-MCMCto point out which one meets the requirement of real timeprocessing. The speeds of our algorithm and the RJ-MCMCon the Michigan datasets are shown on Table I. Although RJ-MCMC uses a GPU implementation, it is still much slowerthan our algorithm. This is explained by some improvementsin reducing the search space of human detections as wellas decreasing the number of search samples in compressivetrackers. In particular, in each frame the fast compressivetracker just has to classify 40 samples on average insteadof more than 7000 samples as the original one. This signif-icantly improves the speed of the fast compressive tracker.

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 8. Examples of tracking results. The mobile robot can detect humans in different poses and in severe occlusions.

0

0.2

0.4

0.6

0.8

1

10-1 100

missrate

false positives per image

OURS (51.46%)RJ-MCMC (60.01%)

Fig. 7. Results of human tracking on the second Kinect dataset. Ouralgorithm, with the improvement of 8.5 %, is better than the RJ-MCMC.

TABLE ICOMPARISON OF SPEED ON THE MICHIGAN DATABASE

First dataset Second dataset Platform

Ours 23.8 fps 22.2 fps CPURJ-MCMC 4 fps 4 fps GPU

V. CONCLUSION AND FUTURE WORK

In this paper, we have introduced a system for multipleperson detection and tracking by a mobile robot. The resultsindicate that the fusion of detections from the face detectorand upper body detector provides reliable observation cuesfor tracking multiple humans. Furthermore, the combinationof the fast compressive tracker and Kalman filter is robustto motion, pose variation and occlusion. In the future, weare trying to develop an algorithm of human reidentificationbased on the information of color and depth in order tocombine it with the current tracking system. This combi-

nation will enable the mobile robot to track people morereliably and be able to recover lost tracks caused by longterm full occlusions or temporary disappearance of humansin the robot’s field of view.

REFERENCES

[1] W. Choi, C. Pantofaru, and S. Savarese, “Detecting and trackingpeople using an rgb-d camera via multiple detector fusion,” in 2011IEEE International Conference on Computer Vision Workshops (ICCVWorkshops), Nov 2011, pp. 1076–1083.

[2] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, “Progressive searchspace reduction for human pose estimation,” in IEEE Conference onComputer Vision and Pattern Recognition, 2008. CVPR 2008., June2008, pp. 1–8.

[3] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive track-ing,” in Proceedings of the 12th European Conference on ComputerVision - Volume Part III, ser. ECCV’12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 864–877.

[4] D. Schulz, W. Burgard, D. Fox, and A. Cremers, “Tracking multiplemoving targets with a mobile robot using particle filters and statisticaldata association,” in Proceedings 2001 ICRA. IEEE InternationalConference on Robotics and Automation, 2001., vol. 2, 2001, pp.1665–1670 vol.2.

[5] N. Bellotto and H. Hu, “Multisensor-based human detection andtracking for mobile service robots,” IEEE Transactions on Systems,Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 1, pp. 167–181, Feb 2009.

[6] P. Viola and M. Jones, “Robust real-time object detection,” in Inter-national Journal of Computer Vision, 2001.

[7] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in IEEE Computer Society Conference on Computer Visionand Pattern Recognition, 2005. CVPR 2005., vol. 1, June 2005, pp.886–893 vol. 1.

[8] C. Wojek, S. Walk, and B. Schiele, “Multi-cue onboard pedestriandetection,” in IEEE Conference on Computer Vision and PatternRecognition, 2009. CVPR 2009., June 2009, pp. 794–801.

[9] W. Choi and S. Savarese, “Multiple target tracking in world coordinatewith single, minimally calibrated camera,” in Proceedings of the 11thEuropean Conference on Computer Vision: Part IV, ser. ECCV’10.Berlin, Heidelberg: Springer-Verlag, 2010, pp. 553–567. [Online].Available: http://dl.acm.org/citation.cfm?id=1888089.1888132

[10] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learningfor robust visual tracking,” Int. J. Comput. Vision, vol. 77, no. 1-3, pp.125–141, May 2008.

[11] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object trackingwith online multiple instance learning,” Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 33, no. 8, pp. 1619–1632,2011.

[12] X. Mei and H. Ling, “Robust visual tracking using l1 minimization,”in Computer Vision, 2009 IEEE 12th International Conference on,2009, pp. 1436–1443.

[13] M. Vo-Duc, A. Masselli, and A. Zell, “Real time face detection usinggeometric constraints, navigation and depth-based skin segmentationon mobile robots,” in 2012 IEEE International Symposium on Roboticand Sensors Environments, 2012.

[14] C. Wojek, S. Walk, S. Roth, and B. Schiele, “Monocular 3d sceneunderstanding with explicit occlusion reasoning,” in Computer Visionand Pattern Recognition (CVPR), 2011 IEEE Conference on, June2011, pp. 1993–2000.

Real Time Person Detection and Tracking by Mobile Robots ... · method of face detection [6] is...

Documents

Transcript of Real Time Person Detection and Tracking by Mobile Robots ... · method of face detection [6] is...