Post on 02-Mar-2016
description
Deformable part modelsRoss GirshickUC Berkeley
CS231B Stanford University Guest Lecture April 16, 2013
Image understanding
photo by thomas pix http://www.flickr.com/photos/thomaspix/2591427106
Snack time in the lab
What objects are where?
..
.
I seetwinkies!
robot: I see a table with twinkies,pretzels, fruit, and some mysterious chocolate things...
DPM lecture overview
(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.
References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu
David McAllesterToyota Technological Institute at Chicago
mcallester@tti-c.org
Deva RamananUC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.
1. IntroductionWe consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.
Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-tained with our person model.
The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.
Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.
Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a
1
AP 12% 27% 36% 45% 49% 2005 2008 2009 2010 2011
Part 1: modeling
Part 2: learning
Formalizing the object detection task
Many possible ways
Input
person
motorbike
Desired output
Many possible ways, this one is popular:
Formalizing the object detection task
cat,dog,chair,cow,person,motorbike,car,...
Input
person
motorbike
Desired output
Performance summary:
Average Precision (AP)0 is worst 1 is perfect
Many possible ways, this one is popular:
Formalizing the object detection task
cat,dog,chair,cow,person,motorbike,car,...
Benchmark datasets
PASCAL VOC 2005 2012 - 54k objects in 22k images - 20 object classes - annual competition
Benchmark datasets
PASCAL VOC 2005 2012 - 54k objects in 22k images - 20 object classes - annual competition
Reduction to binary classification
Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusionsand a wide range of variations in pose, appearance, clothing, illumination and background.
probabilities to be distinguished more easily. We will oftenuse miss rate at 104FPPW as a reference point for results.This is arbitrary but no more so than, e.g. Area Under ROC.In a multiscale detector it corresponds to a raw error rate ofabout 0.8 false positives per 640480 image tested. (The fulldetector has an even lower false positive rate owing to non-maximum suppression). Our DET curves are usually quiteshallow so even very small improvements in miss rate areequivalent to large gains in FPPW at constant miss rate. Forexample, for our default detector at 1e-4 FPPW, every 1%absolute (9% relative) reduction in miss rate is equivalent toreducing the FPPW at constant miss rate by a factor of 1.57.
5 Overview of ResultsBefore presenting our detailed implementation and per-
formance analysis, we compare the overall performance ofour final HOG detectors with that of some other existingmethods. Detectors based on rectangular (R-HOG) or cir-cular log-polar (C-HOG) blocks and linear or kernel SVMare compared with our implementations of the Haar wavelet,PCA-SIFT, and shape context approaches. Briefly, these ap-proaches are as follows:Generalized Haar Wavelets. This is an extended set of ori-ented Haar-like wavelets similar to (but better than) that usedin [17]. The features are rectified responses from 99 and1212 oriented 1st and 2nd derivative box filters at 45 inter-vals and the corresponding 2nd derivative xy filter.PCA-SIFT. These descriptors are based on projecting gradi-ent images onto a basis learned from training images usingPCA [11]. Ke & Sukthankar found that they outperformedSIFT for key point based matching, but this is controversial[14]. Our implementation uses 1616 blocks with the samederivative scale, overlap, etc., settings as our HOG descrip-tors. The PCA basis is calculated using positive training im-ages.Shape Contexts. The original Shape Contexts [1] used bi-nary edge-presence voting into log-polar spaced bins, irre-spective of edge orientation. We simulate this using our C-HOG descriptor (see below) with just 1 orientation bin. 16angular and 3 radial intervals with inner radius 2 pixels andouter radius 8 pixels gave the best results. Both gradient-
strength and edge-presence based voting were tested, withthe edge threshold chosen automatically to maximize detec-tion performance (the values selected were somewhat vari-able, in the region of 2050 graylevels).Results. Fig. 3 shows the performance of the various detec-tors on the MIT and INRIA data sets. The HOG-based de-tectors greatly outperform the wavelet, PCA-SIFT and ShapeContext ones, giving near-perfect separation on the MIT testset and at least an order of magnitude reduction in FPPWon the INRIA one. Our Haar-like wavelets outperform MITwavelets because we also use 2nd order derivatives and con-trast normalize the output vector. Fig. 3(a) also shows MITsbest parts based and monolithic detectors (the points are in-terpolated from [17]), however beware that an exact compar-ison is not possible as we do not know how the database in[17] was divided into training and test parts and the nega-tive images used are not available. The performances of thefinal rectangular (R-HOG) and circular (C-HOG) detectorsare very similar, with C-HOG having the slight edge. Aug-menting R-HOG with primitive bar detectors (oriented 2ndderivatives R2-HOG) doubles the feature dimension butfurther improves the performance (by 2% at 104 FPPW).Replacing the linear SVM with a Gaussian kernel one im-proves performance by about 3% at 104 FPPW, at the costof much higher run times1. Using binary edge voting (EC-HOG) instead of gradient magnitude weighted voting (C-HOG) decreases performance by 5% at 104 FPPW, whileomitting orientation information decreases it by much more,even if additional spatial or radial bins are added (by 33% at104 FPPW, for both edges (E-ShapeC) and gradients (G-ShapeC)). PCA-SIFT also performs poorly. One reason isthat, in comparison to [11], many more (80 of 512) principalvectors have to be retained to capture the same proportion ofthe variance. This may be because the spatial registration isweaker when there is no keypoint detector.
6 Implementation and Performance StudyWe now give details of our HOG implementations and
systematically study the effects of the various choices on de-1We use the hard examples generated by linear R-HOG to train the ker-
nel R-HOG detector, as kernel R-HOG generates so few false positives thatits hard example set is too sparse to improve the generalization significantly.
pos = { ... ... }
neg = { ... background patches ... }
Descriptor Cues
input image weightedpos wtsweightedneg wts
avg. grad outside in block
The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important
Histograms of Oriented Gradients for Human Detection p. 11/13
SVM Sliding window detector
Dalal & Triggs (CVPR05)
HOG
Sliding window detection
Compute HOG of the whole image at multiple resolutions Score every subwindow of the feature pyramid Apply non-maxima suppression
(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.
References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.
Image pyramid HOG feature pyramid
pb+Q`2(A, T) = w (A, T)
Detection
p number of locations p ~ 250,000 per image
Detection
p number of locations p ~ 250,000 per image
test set has ~ 5000 images
>> 1.3x109 windows to classify
Detection
p number of locations p ~ 250,000 per image
test set has ~ 5000 images
>> 1.3x109 windows to classify
typically only ~ 1,000 true positive locations
Detection
p number of locations p ~ 250,000 per image
test set has ~ 5000 images
>> 1.3x109 windows to classify
typically only ~ 1,000 true positive locations
Extremely unbalanced binary classification
Dalal & Triggs detector on INRIA3.5 Overview of Results 27
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Prec
isio
n
RecallPrecision different descriptors on INRIA static person database
Ker. RHOGLin. RHOGLin. R2HogWaveletPCASIFTLin. EShapeC
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Prec
ision
RecallPrecision descriptors on INRIA static+moving person database
RHOG + IMHmdRHOGWavelet
(a) (b)
Fig. 3.6. The performance of selected detectors on the INRIA static (left) and static+moving(right) person data sets. For both of the data sets, the plots show the substantial overall gainsobtained by using HOG features rather than other state-of-the-art descriptors. (a) Comparesstatic HOG descriptors with other state of the art descriptors on INRIA static person data set.(b) Compares combined the static and motion HOG, the static HOG and the wavelet detectorson the combined INRIA static and moving person data set.
[2001] but also includes both 1st and 2nd-order derivative filters at 45 interval and the corre-sponding 2nd derivative xy filter. It yields AP of 0.53. Shape contexts based on edges (E-ShapeC)perform considerably worse with an AP of 0.25. However, Chapter 4 will show that generalisedshape contexts [Mori and Malik 2003], which like standard shape contexts compute circularblocks with cells shaped over a log-polar grid, but which use both image gradients and orienta-tion histograms as in R-HOG, give similar performance. This highlights the fact that orientationhistograms are very effective at capturing the information needed for object recognition.
For the video sequences we compare our combined static andmotion HOG, static HOG, andHaar wavelet detectors. The detectors were trained and tested on training and test portions ofthe combined INRIA static and moving person data set. Details on how the descriptors and thedata sets were combined are presented in Chapter 6. Figure 3.6(b) summarises the results. TheHOG-based detectors again significantly outperform the wavelet based one, but surprisinglythe combined static and motion HOG detector does not seem to offer a significant advantageover the static HOG one: The static detector gives an AP of 0.553 compared to 0.527 for themotion detector. These results are surprising and disappointing because Sect. 6.5.2, where weused DET curves (c.f . Sect. B.1) for evaluations, shows that for exactly the same data set, theindividual window classifier for the motion detector gives significantly better performance thanthe static HOG window classifier with false positive rates about one order of magnitude lowerthan those for the static HOG classifier. We are not sure what is causing this anomaly and arecurrently investigating it. It seems to be linked to the threshold used for truncating the scoresin the mean shift fusion stage (during non-maximum suppression) of the combined detector.
AP = 75% (79% in my implementation)
Very good Declare victory and go home?
Dalal & Triggs on PASCAL VOC 2007
AP = 12%(using my implementation)
Descriptor Cues
input image weightedpos wtsweightedneg wts
avg. grad outside in block
The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important
Histograms of Oriented Gradients for Human Detection p. 11/13
How can we do better?
Revisit an old idea: part-based models (pictorial structures)- Fischler & Elschlager 73, Felzenszwalb & Huttenlocher 00
Combine with modern features and machine learning
Part-based models
Parts local appearance templates Springs spatial connections between parts (geom. prior)
Image: [Felzenszwalb and Huttenlocher 05]
Part-based models
Local appearance is easier to model than the global appearance- Training data shared across deformations- part can be local or global depending on resolution
Generalizes to previously unseen configurations
General formulation
= (,)
= (, . . . , )
(, . . . , ) v1v2 ppart locations in the image
(or feature pyramid)
Part configuration score function
p
score(, . . . , ) =
=()
(,)
(, )
Part match scores
spring costs
v1v2Highest scoring configurations
Part configuration score function
Objective: maximize score over p1,...,pn hn configurations! (h = |P|, about 250,000) Dynamic programming
- If G = (V,E) is a tree, O(nh2) general algorithm O(nh) with some restrictions on dij
score(, . . . , ) =
=()
(,)
(, )
Part match scores
spring costs
Star-structured deformable part models
test image star model detection
root part
Recall the Dalal & Triggs detector
HOG feature pyramid Linear filter / sliding-window detector SVM training to learn parameters w
(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.
References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.
Image pyramid HOG feature pyramid
b+Q`2(A, T) = w (A, T)p
D&T + parts
Add parts to the Dalal & Triggs detector- HOG features- Linear filters / sliding-window detector- Discriminative training
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu
David McAllesterToyota Technological Institute at Chicago
mcallester@tti-c.org
Deva RamananUC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.
1. IntroductionWe consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.
Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-tained with our person model.
The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.
Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.
Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a
1
[FMR CVPR08][FGMR PAMI10]
p0
z
Image pyramid HOG feature pyramid
root
Sliding window DPM score function
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu
David McAllesterToyota Technological Institute at Chicago
mcallester@tti-c.org
Deva RamananUC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.
1. IntroductionWe consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.
Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-tained with our person model.
The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.
Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.
Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a
1
p0
z
Spring costsFilter scores
x = (TR, . . . , TM)
score(A, Ty) = maxTR,...,TMMB=yKB(A, TB)
MB=R/B(Ty, TB)
Image pyramid HOG feature pyramid
root
Detection in a slide
+
xxx
...
...
...
model
response of root filter
transformed responses
responses of part filters
feature map feature map at 2x resolution
detection scores for each root location
low value high value
color encoding of filter response values
root filter
1-st part filter n-th part filter
test image
max
[() (, )]
What are the parts?
Aspect soup
General philosophy: enrich models to better represent the data
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054
Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.
BottleCar
BicycleSofa
Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent cheap placements, and dark values represent expensive placements.
in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP a factor of two above thebest previous score. Finally, we trained a model with parts
but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.
We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-
6
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054
Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.
BottleCar
BicycleSofa
Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent cheap placements, and dark values represent expensive placements.
in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP a factor of two above thebest previous score. Finally, we trained a model with parts
but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.
We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-
6
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054
Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.
BottleCar
BicycleSofa
Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent cheap placements, and dark values represent expensive placements.
in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP a factor of two above thebest previous score. Finally, we trained a model with parts
but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.
We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-
6
Mixture models
Data driven: aspect, occlusion modes, subclasses
FMR CVPR 08: AP = 0.27 (person) FGMR PAMI 10: AP = 0.36 (person)
(a) Car component 1 (initial parts)
(b) Car component 1 (trained parts)
(c) Car component 2 (initial parts)
(d) Car component 2 (trained parts)
(e) Car component 3 (initial parts)
(f) Car component 3 (trained parts)
Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
62
(a) Car component 1 (initial parts)
(b) Car component 1 (trained parts)
(c) Car component 2 (initial parts)
(d) Car component 2 (trained parts)
(e) Car component 3 (initial parts)
(f) Car component 3 (trained parts)
Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
62
(a) Car component 1 (initial parts)
(b) Car component 1 (trained parts)
(c) Car component 2 (initial parts)
(d) Car component 2 (trained parts)
(e) Car component 3 (initial parts)
(f) Car component 3 (trained parts)
Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
62
Pushmipullyu?
Good generalization properties on Doctor Dolittles farm
This was supposed todetect horses
( + ) / 2 =
Latent orientation
Unsupervised left/right orientation discovery
FGMR PAMI 10: AP = 0.36 (person)voc-release5: AP = 0.45 (person)Publicly available code for the whole system: current voc-release5
0.42
0.47
0.57
horse AP
Summary of results
(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.
References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu
David McAllesterToyota Technological Institute at Chicago
mcallester@tti-c.org
Deva RamananUC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.
1. IntroductionWe consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.
Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-tained with our person model.
The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.
Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.
Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a
1
[DT05]AP 0.12
[FMR08]AP 0.27 [FGMR10]
AP 0.36 [GFM voc-release5]AP 0.45
[GFM11]AP 0.49
Part 2: DPM parameter learning
? ?
??
???
??
?
?
?
component 1 component 2
fixed model structure
Part 2: DPM parameter learning
? ?
??
???
??
?
?
?
component 1 component 2
fixed model structure training images y
+1
Part 2: DPM parameter learning
? ?
??
???
??
?
?
?
component 1 component 2
fixed model structure training images y
+1
-1
Part 2: DPM parameter learning
? ?
??
???
??
?
?
?
component 1 component 2
fixed model structure training images y
+1
-1Parameters to learn: biases (per component) deformation costs (per part) filter weights
Linear parameterization
Spring costsFilter scores
x = (TR, . . . , TM)
score(A, Ty) = maxTR,...,TMMB=yKB(A, TB)
MB=R/B(Ty, TB)
KB(A, TB) = wB (A, TB)
/B(Ty, TB) = dB (/tk, /vk, /t, /v)
Filter scores
Spring costs
b+Q`2(A, Ty) = maxx w (A, (Ty, x))
Positive examples (y = +1)
We want7w(t) = maxxw(t)w (t, x)
to score >= +1
w(t) includes all z with more than 70% overlap with ground truth
x specifies an image and bounding box
person
Negative examples (y = -1)
x specifies an image and a HOG pyramid location p0
We want7w(t) = maxxw(t)w (t, x)
to score
Typical dataset
300 8,000 positive examples
500 million to 1 billion negative examples(not including latent configurations!)
Large-scale*
*unless someone from google is here
How we learn parameters: latent SVM
1(w) = Rkwk + *Bmax{y, R vB7w(tB)}
1(w) = Rkwk + *Bmax{y, R vB7w(tB)}
1(w) = Rkwk + *BS
max{y, R maxxw(t)
w (tB, x)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
How we learn parameters: latent SVM
1(w) = Rkwk + *Bmax{y, R vB7w(tB)}
1(w) = Rkwk + *BS
max{y, R maxxw(t)
w (tB, x)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
w
+ score
z1
z2 z3
z4
convex
How we learn parameters: latent SVM
1(w) = Rkwk + *Bmax{y, R vB7w(tB)}
w
score
z1
z2 z3
z4
1(w) = Rkwk + *BS
max{y, R maxxw(t)
w (tB, x)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
w
+ score
z1
z2 z3
z4
convexconcave :(
How we learn parameters: latent SVM
Observations
w
score
z1
z2 z3
z4
w
+ score
z1
z2 z3
z4
convexconcave :(
Latent SVM objective is convex in the negatives
but not in the positives
>> semi-convex
Convex upper bound on loss
w
score
z1
z2 z3
z4
w (current)
w
score
z1
ZPi = z2
z3
z4
w (current)
max{y, R maxxw(t)
w (tB, x)}
max{y, Rw (tB,wSB)}convex
Auxiliary objective
Let ZP = {ZP1, ZP2, ... }
1(w,wS) = Rkwk + *BS
max{y, Rw (tB,wSB)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
Auxiliary objective
Let ZP = {ZP1, ZP2, ... }
1(w,wS) = Rkwk + *BS
max{y, Rw (tB,wSB)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
Note that 1(w,wT) minwS 1(w,wS) = 1(w)
Auxiliary objective
Let ZP = {ZP1, ZP2, ... }
1(w,wS) = Rkwk + *BS
max{y, Rw (tB,wSB)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
w = minw,wS1(w,wS) = minw 1(w)and
Note that 1(w,wT) minwS 1(w,wS) = 1(w)
Auxiliary objective
w = minw,wS1(w,wS) = minw 1(w)
This isnt any easier to optimize
Auxiliary objective
w = minw,wS1(w,wS) = minw 1(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on 1(w,wS)
Auxiliary objective
w = minw,wS1(w,wS) = minw 1(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on 1(w,wS)
Initialization: either by picking a w(0) (or ZP)
Auxiliary objective
w = minw,wS1(w,wS) = minw 1(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on 1(w,wS)
Initialization: either by picking a w(0) (or ZP)
Step 1:wSB = argmax
xw(tB)w(i) (tB, x) B S
Auxiliary objective
w = minw,wS1(w,wS) = minw 1(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on 1(w,wS)
Initialization: either by picking a w(0) (or ZP)
Step 1:
Step 2:w(i+R) = argmin
w1(w,wS)
wSB = argmaxxw(tB)
w(i) (tB, x) B S
Step 1
This is just detection:
+
xxx
...
...
...
model
response of root filter
transformed responses
responses of part filters
feature map feature map at 2x resolution
detection scores for each root location
low value high value
color encoding of filter response values
root filter
1-st part filter n-th part filter
test image
wSB = argmaxxw(tB)
w(i) (tB, x) B S
Step 2
minwRkw
k + *BS
max{y, Rw (tB,wSB)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
Convex
Step 2
minwRkw
k + *BS
max{y, Rw (tB,wSB)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
Convex
Similar to a structural SVM
Step 2
minwRkw
k + *BS
max{y, Rw (tB,wSB)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
Convex
Similar to a structural SVM
But, recall 500 million to 1 billion negative examples!
Step 2
minwRkw
k + *BS
max{y, Rw (tB,wSB)}
+ *BL
max{y, R+ maxxw(t)
w (tB, x)}
Convex
Similar to a structural SVM
But, recall 500 million to 1 billion negative examples!
Can be solved by a working set method bootstrapping data mining constraint generation requires a bit of engineering to make this fast
Comments
Latent SVM is mathematically equivalent to MI-SVM (Andrews et al. NIPS 2003)
Latent SVM can be written as a latent structural SVM (Yu and Joachims ICML 2009)
natural optimization algorithm is concave-convex procedure similar to, but not exactly the same as, coordinate descent
xi1
bag of instances for xi
xi2xi3
z1
z2z3
latent labels for xi
What about the model structure?
? ?
??
???
??
?
?
?
component 1 component 2
fixed model structure training images y
+1
-1Model structure # components # parts per component root and part filter shapes part anchor locations
Learning model structure
Split positives by aspect ratio
Warp to common size
Train Dalal & Triggs model for each aspect ratio on its own
Learning model structure
Use D&T filters as initial w for LSVM training
Merge components
Root filter placement and component choice are latent
Learning model structure
Add parts to cover high-energy areas of root filters
Continue training model with LSVM
Learning model structure
without orientation clustering
with orientation clustering
Learning model structure
In summary repeated application of LSVM training to models of increasing complexity structure learning involves many heuristics (and vision insight!)