MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Object Recognition and Scene Understanding

student presentationMIT

Template matching and histograms

Nicolas Pinto

Introduction

Hostsa guy...

(who has big arms)

HostsAntonio T...

(who knows a lot about vision)

a guy...

(who has big arms)

HostsAntonio T...

a frog...

(who has big eyes)

a guy...

(who has big arms)

HostsAntonio T...

a frog...

(who has big eyes)and thus should knowa lot about vision...

a guy...

(who has big arms)

3 papers

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

lowe@cs.ubc.ca

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

Lowe(1999)

3 papers

David G. Lowe

lowe@cs.ubc.ca

Abstract

1. Introduction

ification.

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

Nalal and Triggs (2005)

3 papers

David G. Lowe

lowe@cs.ubc.ca

Abstract

1. Introduction

ification.

Lowe(1999)

3 papers

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu

David McAllesterToyota Technological Institute at Chicago

mcallester@tti-c.org

Deva RamananUC Irvine

dramanan@ics.uci.edu

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6, 10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

Felzenszwalb et al.(2008)

David G. Lowe

lowe@cs.ubc.ca

Abstract

1. Introduction

ification.

Lowe(1999)

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computation

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution

Scale-Invariant Feature Transform (SIFT)

adapted from Kucuktunc

Scale-Invariant Feature Transform (SIFT)

adapted from Brown, ICCV 2003

SIFT local features are

invariant...

adapted from David Lee

like me they are robust...

... to changes in illumination, noise, viewpoint, occlusion, etc.

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

how to build them

2. find their dominant orientation

how to build them

3. compute their descriptor

how to build them

4. match them on other images

keypoints are taken as maxima/minima

of a DoG pyramid

in this settings, extremas are invariant to scale...

a DoG (Difference of Gaussians) pyramidis simple to compute...

even him can do it!

before after

adapted from Pallus and Fleishman

then we just have to find neighborhood extremasin this 3D DoG space

if a pixel is an extremain its neighboring regionhe becomes a candidate

keypoint

too manykeypoints?

adapted from wikipedia

too manykeypoints?

1. remove low contrast

too manykeypoints?

2. remove edges

too manykeypoints?

2. remove edges

each selected keypoint is assigned to one or more“dominant” orientations...

... this step is important to achieve rotation invariance

How?using the DoG pyramid to achieve scale invariance:

a. compute image gradient magnitude and orientation

b. build an orientation histogram

c. keypoint’s orientation(s) = peak(s)

adapted from Ofir Pele

c. keypoint’s orientation(s) = peak(s)

* the peak ;-)

SIFT descriptor= a set of orientation histograms

4x4 array x 8 bins= 128 dimensions (normalized)

16x16 neighborhood of pixel gradients

4. match them on other images

How to atch?

nearest neighbor

How to atch?

nearest neighborhough transform voting

How to atch?

nearest neighborhough transform votingleast-squares fit

How to atch?

nearest neighborhough transform votingleast-squares fitetc.

SIFT is great!

Text\\ invariant to affine transformations

SIFT is great!

\\ easy to understand

SIFT is great!

\\ easy to understand

\\ fast to compute

Extension example: Spatial Pyramid Matching using SIFT

Beyond Bags of Features: Spatial Pyramid Matchingfor Recognizing Natural Scene Categories

Svetlana Lazebnik1

slazebni@uiuc.edu1Beckman Institute

University of Illinois

Cordelia Schmid2

Cordelia.Schmid@inrialpes.fr2INRIA Rhone-AlpesMontbonnot, France

Jean Ponce1,3

ponce@cs.uiuc.edu3Ecole Normale Superieure

Paris, France

CVPR 2006

David G. Lowe

lowe@cs.ubc.ca

Abstract

Lowe(1999)

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution

first of all, let me put this paper in context

histograms of local image measurement have been quite successful

Swain & Ballard 1991 - Color Histograms

Schiele & Crowley 1996 - Receptive Fields Histograms

Lowe 1999 - SIFT

Schneiderman & Kanade 2000 - Localized Histograms of Wavelets

Leung & Malik 2001 - Texton Histograms

Belongie et al. 2002 - Shape Context

Dalal & Triggs 2005 - Dense Orientation Histograms

λ λ λ

tons of “feature sets” have been proposed

features

Gravrila & Philomen 1999 - Edge Templates + Nearest Neighbor

Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al. 2002 - Haar Wavelets + SVM

Viola & Jones 2001 - Rectangular Differential Features + AdaBoost

Mikolajczyk et al. 2004 - Parts Based Histograms + AdaBoost

Ke & Sukthankar 2004 - PCA-SIFT

localizing humans in images is achallenging task...

difficult!

Wide variety of articulated poses

Variable appearance/clothing

Complex backgrounds

Unconstrained illuminations

Occlusions

Different scales

Approach

•robust feature set (HOG)

Approach

•simple classifier (linear SVM)

Approach

•simple classifier (linear SVM)

•fast detection (sliding window)

adapted from Bill Triggs

• Gamma normalization

• Space: RGB, LAB or Gray

• Method: SQRT or LOG

• Filtering with simple masks

uncentered

centered

cubic-corrected

diagonal

uncentered

centered

cubic-corrected

diagonal

* centered performs the best

• Filtering with simple masks

uncentered

centered

cubic-corrected

diagonal

remember SIFT ?

...after filtering, each “pixel” represents an oriented gradient...

...pixels are regrouped in “cells”, they cast a weighted vote for an orientation histogram...

HOG (Histogram of Oriented Gradients)

a window can be represented like that

then, cells are locally normalized using overlapping “blocks”

they used two types of blocks

• rectangular

• similar to SIFT (but dense)

they used two types of blocks

• rectangular

• similar to SIFT (but dense)

• circular

• similar to Shape Context

and four different types of block normalization

like SIFT, they gain invariance...

...to illuminations, small deformations, etc.

finally, a sliding window is

classified by a simple linear SVM

during the learning phase, the algorithm “looked” for hard examples

Training

adapted from Martial Hebert

average gradients

positive weights negative weights

Example

adapted from Bill Triggs

Example

adapted from Martial Hebert

Further Development

• Detection on Pascal VOC (2006)

Further Development

• Human Detection in Movies (ECCV 2006)

Further Development

• US Patent by MERL (2006)

Further Development

• US Patent by MERL (2006)

• Stereo Vision HoG (ICVES 2008)

Extension example:

Pyramid HoG++

Extension example:

Pyramid HoG++

Extension example:

Pyramid HoG++

A simple demo...

VIDEO HERE

A simple demo...

VIDEO HERE

so, it doesn’t work ?!?

no no, it works...

...it just doesn’t work well...