Dpm Slides

Deformable part modelsRoss GirshickUC Berkeley

CS231B Stanford University Guest Lecture April 16, 2013

Image understanding

photo by thomas pix http://www.flickr.com/photos/thomaspix/2591427106

Snack time in the lab

What objects are where?

..

.

I seetwinkies!

robot: I see a table with twinkies,pretzels, fruit, and some mysterious chocolate things...

DPM lecture overview

(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixelshows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVMs.

References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The

8th ICCV, Vancouver, Canada, pages 454461, 2001.[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de

Prins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 6675, 2000.

[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100105, October 1996.

[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):8298, 1999.

[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages8793, 1999.

[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):4568, 2001.

[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 6675, 2004.

[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91110, 2004.

[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):6386, 2004.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 6981, 2004.

[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349361, April 2001.

[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):1533, 2000.

[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700714, 2002.

[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151177, 2004.

[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181194, 1977.

[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734741, 2003.

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu

David McAllesterToyota Technological Institute at Chicago

mcallester@tti-c.org

Deva RamananUC Irvine

dramanan@ics.uci.edu

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [13,6,10,12,13,15,16,22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by conceptually weaker models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining hard negative examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

AP 12% 27% 36% 45% 49% 2005 2008 2009 2010 2011

Part 1: modeling

Part 2: learning

Formalizing the object detection task

Many possible ways

Input

person

motorbike

Desired output

Many possible ways, this one is popular:


cat,dog,chair,cow,person,motorbike,car,...

Input

person

motorbike

Desired output

Performance summary:

Average Precision (AP)0 is worst 1 is perfect

Many possible ways, this one is popular:


cat,dog,chair,cow,person,motorbike,car,...

Benchmark datasets

PASCAL VOC 2005 2012 - 54k objects in 22k images - 20 object classes - annual competition

Reduction to binary classification

Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusionsand a wide range of variations in pose, appearance, clothing, illumination and background.

probabilities to be distinguished more easily. We will oftenuse miss rate at 104FPPW as a reference point for results.This is arbitrary but no more so than, e.g. Area Under ROC.In a multiscale detector it corresponds to a raw error rate ofabout 0.8 false positives per 640480 image tested. (The fulldetector has an even lower false positive rate owing to non-maximum suppression). Our DET curves are usually quiteshallow so even very small improvements in miss rate areequivalent to large gains in FPPW at constant miss rate. Forexample, for our default detector at 1e-4 FPPW, every 1%absolute (9% relative) reduction in miss rate is equivalent toreducing the FPPW at constant miss rate by a factor of 1.57.

5 Overview of ResultsBefore presenting our detailed implementation and per-

formance analysis, we compare the overall performance ofour final HOG detectors with that of some other existingmethods. Detectors based on rectangular (R-HOG) or cir-cular log-polar (C-HOG) blocks and linear or kernel SVMare compared with our implementations of the Haar wavelet,PCA-SIFT, and shape context approaches. Briefly, these ap-proaches are as follows:Generalized Haar Wavelets. This is an extended set of ori-ented Haar-like wavelets similar to (but better than) that usedin [17]. The features are rectified responses from 99 and1212 oriented 1st and 2nd derivative box filters at 45 inter-vals and the corresponding 2nd derivative xy filter.PCA-SIFT. These descriptors are based on projecting gradi-ent images onto a basis learned from training images usingPCA [11]. Ke & Sukthankar found that they outperformedSIFT for key point based matching, but this is controversial[14]. Our implementation uses 1616 blocks with the samederivative scale, overlap, etc., settings as our HOG descrip-tors. The PCA basis is calculated using positive training im-ages.Shape Contexts. The original Shape Contexts [1] used bi-nary edge-presence voting into log-polar spaced bins, irre-spective of edge orientation. We simulate this using our C-HOG descriptor (see below) with just 1 orientation bin. 16angular and 3 radial intervals with inner radius 2 pixels andouter radius 8 pixels gave the best results. Both gradient-

strength and edge-presence based voting were tested, withthe edge threshold chosen automatically to maximize detec-tion performance (the values selected were somewhat vari-able, in the region of 2050 graylevels).Results. Fig. 3 shows the performance of the various detec-tors on the MIT and INRIA data sets. The HOG-based de-tectors greatly outperform the wavelet, PCA-SIFT and ShapeContext ones, giving near-perfect separation on the MIT testset and at least an order of magnitude reduction in FPPWon the INRIA one. Our Haar-like wavelets outperform MITwavelets because we also use 2nd order derivatives and con-trast normalize the output vector. Fig. 3(a) also shows MITsbest parts based and monolithic detectors (the points are in-terpolated from [17]), however beware that an exact compar-ison is not possible as we do not know how the database in[17] was divided into training and test parts and the nega-tive images used are not available. The performances of thefinal rectangular (R-HOG) and circular (C-HOG) detectorsare very similar, with C-HOG having the slight edge. Aug-menting R-HOG with primitive bar detectors (oriented 2ndderivatives R2-HOG) doubles the feature dimension butfurther improves the performance (by 2% at 104 FPPW).Replacing the linear SVM with a Gaussian kernel one im-proves performance by about 3% at 104 FPPW, at the costof much higher run times1. Using binary edge voting (EC-HOG) instead of gradient magnitude weighted voting (C-HOG) decreases performance by 5% at 104 FPPW, whileomitting orientation information decreases it by much more,even if additional spatial or radial bins are added (by 33% at104 FPPW, for both edges (E-ShapeC) and gradients (G-ShapeC)). PCA-SIFT also performs poorly. One reason isthat, in comparison to [11], many more (80 of 512) principalvectors have to be retained to capture the same proportion ofthe variance. This may be because the spatial registration isweaker when there is no keypoint detector.

6 Implementation and Performance StudyWe now give details of our HOG implementations and

systematically study the effects of the various choices on de-1We use the hard examples generated by linear R-HOG to train the ker-

nel R-HOG detector, as kernel R-HOG generates so few false positives thatits hard example set is too sparse to improve the generalization significantly.

pos = { ... ... }

neg = { ... background patches ... }

Descriptor Cues

input image weightedpos wtsweightedneg wts

avg. grad outside in block

The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important

Histograms of Oriented Gradients for Human Detection p. 11/13

SVM Sliding window detector

Dalal & Triggs (CVPR05)

HOG

Sliding window detection

Compute HOG of the whole image at multiple resolutions Score every subwindow of the feature pyramid Apply non-maxima suppression


























Image pyramid HOG feature pyramid

pb+Q`2(A, T) = w (A, T)

Detection

p number of locations p ~ 250,000 per image

Detection


test set has ~ 5000 images

>> 1.3x109 windows to classify

Detection




typically only ~ 1,000 true positive locations

Detection




typically only ~ 1,000 true positive locations

Extremely unbalanced binary classification

Dalal & Triggs detector on INRIA3.5 Overview of Results 27

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

isio

n

RecallPrecision different descriptors on INRIA static person database

Ker. RHOGLin. RHOGLin. R2HogWaveletPCASIFTLin. EShapeC

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

ision

RecallPrecision descriptors on INRIA static+moving person database

RHOG + IMHmdRHOGWavelet

(a) (b)

Fig. 3.6. The performance of selected detectors on the INRIA static (left) and static+moving(right) person data sets. For both of the data sets, the plots show the substantial overall gainsobtained by using HOG features rather than other state-of-the-art descriptors. (a) Comparesstatic HOG descriptors with other state of the art descriptors on INRIA static person data set.(b) Compares combined the static and motion HOG, the static HOG and the wavelet detectorson the combined INRIA static and moving person data set.

[2001] but also includes both 1st and 2nd-order derivative filters at 45 interval and the corre-sponding 2nd derivative xy filter. It yields AP of 0.53. Shape contexts based on edges (E-ShapeC)perform considerably worse with an AP of 0.25. However, Chapter 4 will show that generalisedshape contexts [Mori and Malik 2003], which like standard shape contexts compute circularblocks with cells shaped over a log-polar grid, but which use both image gradients and orienta-tion histograms as in R-HOG, give similar performance. This highlights the fact that orientationhistograms are very effective at capturing the information needed for object recognition.

For the video sequences we compare our combined static andmotion HOG, static HOG, andHaar wavelet detectors. The detectors were trained and tested on training and test portions ofthe combined INRIA static and moving person data set. Details on how the descriptors and thedata sets were combined are presented in Chapter 6. Figure 3.6(b) summarises the results. TheHOG-based detectors again significantly outperform the wavelet based one, but surprisinglythe combined static and motion HOG detector does not seem to offer a significant advantageover the static HOG one: The static detector gives an AP of 0.553 compared to 0.527 for themotion detector. These results are surprising and disappointing because Sect. 6.5.2, where weused DET curves (c.f . Sect. B.1) for evaluations, shows that for exactly the same data set, theindividual window classifier for the motion detector gives significantly better performance thanthe static HOG window classifier with false positive rates about one order of magnitude lowerthan those for the static HOG classifier. We are not sure what is causing this anomaly and arecurrently investigating it. It seems to be linked to the threshold used for truncating the scoresin the mean shift fusion stage (during non-maximum suppression) of the combined detector.

AP = 75% (79% in my implementation)

Very good Declare victory and go home?

Dalal & Triggs on PASCAL VOC 2007

AP = 12%(using my implementation)

Descriptor Cues

input image weightedpos wtsweightedneg wts

avg. grad outside in block

The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important

Histograms of Oriented Gradients for Human Detection p. 11/13

How can we do better?

Revisit an old idea: part-based models (pictorial structures)- Fischler & Elschlager 73, Felzenszwalb & Huttenlocher 00

Combine with modern features and machine learning

Part-based models

Parts local appearance templates Springs spatial connections between parts (geom. prior)

Image: [Felzenszwalb and Huttenlocher 05]

Part-based models

Local appearance is easier to model than the global appearance- Training data shared across deformations- part can be local or global depending on resolution

Generalizes to previously unseen configurations

General formulation

= (,)

= (, . . . , )

(, . . . , ) v1v2 ppart locations in the image

(or feature pyramid)

Part configuration score function

p

score(, . . . , ) =

=()

(,)

(, )

Part match scores

spring costs

v1v2Highest scoring configurations

Part configuration score function

Objective: maximize score over p1,...,pn hn configurations! (h = |P|, about 250,000) Dynamic programming

- If G = (V,E) is a tree, O(nh2) general algorithm O(nh) with some restrictions on dij

score(, . . . , ) =

=()

(,)

(, )

Part match scores

spring costs

Star-structured deformable part models

test image star model detection

root part

Recall the Dalal & Triggs detector

HOG feature pyramid Linear filter / sliding-window detector SVM training to learn parameters w



























b+Q`2(A, T) = w (A, T)p

D&T + parts

Add parts to the Dalal & Triggs detector- HOG features- Linear filters / sliding-window detector- Discriminative training







Abstract











1

[FMR CVPR08][FGMR PAMI10]

p0

z


root

Sliding window DPM score function







Abstract











1

p0

z

Spring costsFilter scores

x = (TR, . . . , TM)

score(A, Ty) = maxTR,...,TMMB=yKB(A, TB)

MB=R/B(Ty, TB)


root

Detection in a slide

+

xxx

...

...

...

model

response of root filter

transformed responses

responses of part filters

feature map feature map at 2x resolution

detection scores for each root location

low value high value

color encoding of filter response values

root filter

1-st part filter n-th part filter

test image

max

[() (, )]

What are the parts?

Aspect soup

General philosophy: enrich models to better represent the data

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054

Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.

BottleCar

BicycleSofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent cheap placements, and dark values represent expensive placements.

in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP a factor of two above thebest previous score. Finally, we trained a model with parts

but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.

We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-

6





Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090


BottleCar

BicycleSofa





6





Oxford .262 .409 .393 .432 .375 .334TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090


BottleCar

BicycleSofa





6

Mixture models

Data driven: aspect, occlusion modes, subclasses

FMR CVPR 08: AP = 0.27 (person) FGMR PAMI 10: AP = 0.36 (person)

(a) Car component 1 (initial parts)

(b) Car component 1 (trained parts)

(c) Car component 2 (initial parts)

(d) Car component 2 (trained parts)

(e) Car component 3 (initial parts)

(f) Car component 3 (trained parts)

Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).

62








62








62

Pushmipullyu?

Good generalization properties on Doctor Dolittles farm

This was supposed todetect horses

( + ) / 2 =

Latent orientation

Unsupervised left/right orientation discovery

FGMR PAMI 10: AP = 0.36 (person)voc-release5: AP = 0.45 (person)Publicly available code for the whole system: current voc-release5

0.42

0.47

0.57

horse AP

Summary of results
































Abstract











1

[DT05]AP 0.12

[FMR08]AP 0.27 [FGMR10]

AP 0.36 [GFM voc-release5]AP 0.45

[GFM11]AP 0.49

Part 2: DPM parameter learning

? ?

??

???

??

?

?

?

component 1 component 2

fixed model structure


? ?

??

???

??

?

?

?


fixed model structure training images y

+1


? ?

??

???

??

?

?

?



+1

-1


? ?

??

???

??

?

?

?



+1

-1Parameters to learn: biases (per component) deformation costs (per part) filter weights

Linear parameterization

Spring costsFilter scores

x = (TR, . . . , TM)

score(A, Ty) = maxTR,...,TMMB=yKB(A, TB)

MB=R/B(Ty, TB)

KB(A, TB) = wB (A, TB)

/B(Ty, TB) = dB (/tk, /vk, /t, /v)

Filter scores

Spring costs

b+Q`2(A, Ty) = maxx w (A, (Ty, x))

Positive examples (y = +1)

We want7w(t) = maxxw(t)w (t, x)

to score >= +1

w(t) includes all z with more than 70% overlap with ground truth

x specifies an image and bounding box

person

Negative examples (y = -1)

x specifies an image and a HOG pyramid location p0

We want7w(t) = maxxw(t)w (t, x)

to score

Typical dataset

300 8,000 positive examples

500 million to 1 billion negative examples(not including latent configurations!)

Large-scale*

*unless someone from google is here

How we learn parameters: latent SVM

1(w) = Rkwk + *Bmax{y, R vB7w(tB)}


1(w) = Rkwk + *BS

max{y, R maxxw(t)

w (tB, x)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}



1(w) = Rkwk + *BS

max{y, R maxxw(t)

w (tB, x)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

w

+ score

z1

z2 z3

z4

convex



w

score

z1

z2 z3

z4

1(w) = Rkwk + *BS

max{y, R maxxw(t)

w (tB, x)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

w

+ score

z1

z2 z3

z4

convexconcave :(


Observations

w

score

z1

z2 z3

z4

w

+ score

z1

z2 z3

z4

convexconcave :(

Latent SVM objective is convex in the negatives

but not in the positives

>> semi-convex

Convex upper bound on loss

w

score

z1

z2 z3

z4

w (current)

w

score

z1

ZPi = z2

z3

z4

w (current)

max{y, R maxxw(t)

w (tB, x)}

max{y, Rw (tB,wSB)}convex

Auxiliary objective

Let ZP = {ZP1, ZP2, ... }

1(w,wS) = Rkwk + *BS

max{y, Rw (tB,wSB)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

Auxiliary objective

Let ZP = {ZP1, ZP2, ... }


max{y, Rw (tB,wSB)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

Note that 1(w,wT) minwS 1(w,wS) = 1(w)

Auxiliary objective

Let ZP = {ZP1, ZP2, ... }


max{y, Rw (tB,wSB)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

w = minw,wS1(w,wS) = minw 1(w)and

Note that 1(w,wT) minwS 1(w,wS) = 1(w)

Auxiliary objective

w = minw,wS1(w,wS) = minw 1(w)

This isnt any easier to optimize

Auxiliary objective



Find stationary point by coordinate descent on 1(w,wS)

Auxiliary objective




Initialization: either by picking a w(0) (or ZP)

Auxiliary objective





Step 1:wSB = argmax

xw(tB)w(i) (tB, x) B S

Auxiliary objective





Step 1:

Step 2:w(i+R) = argmin

w1(w,wS)

wSB = argmaxxw(tB)

w(i) (tB, x) B S

Step 1

This is just detection:

+

xxx

...

...

...

model

response of root filter

transformed responses

responses of part filters

feature map feature map at 2x resolution

detection scores for each root location

low value high value

color encoding of filter response values

root filter

1-st part filter n-th part filter

test image

wSB = argmaxxw(tB)

w(i) (tB, x) B S

Step 2

minwRkw

k + *BS

max{y, Rw (tB,wSB)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

Convex

Step 2

minwRkw

k + *BS

max{y, Rw (tB,wSB)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

Convex

Similar to a structural SVM

Step 2

minwRkw

k + *BS

max{y, Rw (tB,wSB)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

Convex


But, recall 500 million to 1 billion negative examples!

Step 2

minwRkw

k + *BS

max{y, Rw (tB,wSB)}

+ *BL

max{y, R+ maxxw(t)

w (tB, x)}

Convex


But, recall 500 million to 1 billion negative examples!

Can be solved by a working set method bootstrapping data mining constraint generation requires a bit of engineering to make this fast

Comments

Latent SVM is mathematically equivalent to MI-SVM (Andrews et al. NIPS 2003)

Latent SVM can be written as a latent structural SVM (Yu and Joachims ICML 2009)

natural optimization algorithm is concave-convex procedure similar to, but not exactly the same as, coordinate descent

xi1

bag of instances for xi

xi2xi3

z1

z2z3

latent labels for xi

What about the model structure?

? ?

??

???

??

?

?

?



+1

-1Model structure # components # parts per component root and part filter shapes part anchor locations

Learning model structure

Split positives by aspect ratio

Warp to common size

Train Dalal & Triggs model for each aspect ratio on its own


Use D&T filters as initial w for LSVM training

Merge components

Root filter placement and component choice are latent


Add parts to cover high-energy areas of root filters

Continue training model with LSVM


without orientation clustering

with orientation clustering


In summary repeated application of LSVM training to models of increasing complexity structure learning involves many heuristics (and vision insight!)

Dpm Slides

Documents

Transcript of Dpm Slides

Apostila DPM

Sommaire - DPM

DPM Installation

DPM C8 Manual

Fluent DPM Chp19

DPM Brochure 2011

DPM...DPM DPM DPM Data storage and recall capabilities 120-hours of graph and list trends 48-hours of full disclosure waveform review NIBP recall for 1,000 most recent …

DATA PRIVACY MANAGERpit.or.kr/product/DPM.pdf · 제조사 소개 II. DPM 개요 • 제조사 : Randtronics Pty Limited, 호주 시드니 • Products : DPM File, DPM Token, DPM

Data and Process Modelling - 8c.BPMN - analytic modelingmontali/teaching/1516/dpm/slides/8c.bpmn... · DataandProcessModelling 8c.BPMN-analyticmodeling MarcoMontali KRDB Research

DPM portdolio

Baugeologische Stellungnahme - Startseite...Für die Mittelschwere Rammsondierung (DPM) gelten folgende maßgebliche Schlagzahlen N10. Lagerung DPM (N10) Konsistenz DPM (N10) sehr

UPDATE - cdn.ymaws.com · John Fawcett, DPM Edwin Hart III, DPM Robert W. Herpen, DPM Marc Karpo, DPM Neal Kramer, DPM ... on towards recommendation of comprehensive treatment plans.

DPM 9350/00 - User Manual - Dictation Machines · • DPM 9350/00 User Manual, DPM 9350/00 Quick Reference Accessories available from your vendor • Power Supply LFH 9145 • DPM

Grid Computing ESI 2011 - EPN Campus€¦ · DPM: user's point of view /dpm /domain /home. DPM. head node. file (uid, gid1, …) DPM. disk servers. DPM Name Server – Namespace –

Dpm presentation

Catalogo DPM

Dpm investorppt july2013

JOIN US FOR THE NATIONAL - apma.org · John N. Evans, DPM Ryan H. Fitzgerald, DPM Robert Frimmel, DPM Alison Garten, DPM ... Guido LaPorta, DPM Jeffrey D. Lehrman, DPM Melissa J.

Apostila Dpm - Pjm

Dpm Health Effects