Gary Bradski, Intel, Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

61
Stanford CS223B Computer Vision, Winter 2006 Lecture 14: Object Detection and Classification Using Machine Learning Gary Bradski, Intel, Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado “Who will be strong and stand with me? Beyond the barricade, Is there a world you long to see?” -- Enjolras, Do you hear the people sing? Le Miserables

description

Stanford CS223B Computer Vision, Winter 2006 Lecture 14: Object Detection and Classification Using Machine Learning. Gary Bradski, Intel, Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado. “Who will be strong and stand with me? Beyond the barricade, - PowerPoint PPT Presentation

Transcript of Gary Bradski, Intel, Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Page 1: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Stanford CS223B Computer Vision, Winter 2006

Lecture 14: Object Detection and

Classification Using Machine Learning

Gary Bradski, Intel, StanfordCAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

“Who will be strong and stand with me? Beyond the barricade, Is there a world you long to see?”

-- Enjolras, Do you hear the people sing? Le Miserables

Page 2: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

This guy is wearing a haircutThis guy is wearing a haircutcalled a “Mullet”called a “Mullet”

Fast, accurate and Fast, accurate and general object general object recognition …recognition …

Page 3: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Find the Mullets…

Rapid Learning and Generalization

Page 4: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Approaches to Recognition

Geometric

Non-Geo

Local Global

Patches/Ulman

Histograms/SchieleHMAX/Poggio

Constellation/Perona Eigen Objects/TurkShape models

MRF/Freeman, Murphy

We’ll see a few of these …

features

rela

tion

s

Page 5: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces Find a new coordinate system that best captures the scatter of the data. Eigen vectors point in the direction of scatter, ordered of the magnitude

of the eigen values. We can typically prune the number of eigen vectors to a few dozen.

GlobalGlobal

Page 6: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces, the algorithm

The database

2

2

1

Ng

g

g

2

2

1

Ne

e

e

2

2

1

Nh

h

h

2

2

1

Nf

f

f

2

2

1

Nc

c

c

2

2

1

Na

a

a

2

2

1

Nd

d

d

2

2

1

Nb

b

b

[slide credit: Alexander Roth]

Assumptions: Square images with W=H=N M is the number of images in the databaseP is the number of persons in the database

GlobalGlobal

Page 7: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces, the algorithm

Then subtract it from the training faces

22

2

1

2

1

NN

m

m

m

m

h

h

h

h

22

2

1

2

1

NN

m

m

m

m

e

e

e

e

22

2

1

2

1

NN

m

m

m

m

f

f

f

f

22

2

1

2

1

NN

m

m

m

m

g

g

g

g

22

2

1

2

1

NN

m

m

m

m

d

d

d

d

22

2

1

2

1

NN

m

m

m

m

a

a

a

a

22

2

1

2

1

NN

m

m

m

m

b

b

b

b

22

2

1

2

1

NN

m

m

m

m

c

c

c

c

[slide credit: Alexander Roth]

We compute the average face

8 with,1

222

222

111

M

hba

hba

hba

Mm

NNN

GlobalGlobal

Page 8: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces, the algorithm

Now we build the matrix which is N2 by M

The covariance matrix which is N2 by N2

TAAC

mmmmmmmm hgfedcbaA

[slide credit: Alexander Roth]

Find eigenvalues of the covariance matrix– The matrix is very large

– The computational effort is very big

We are interested in at most M eigenvalues– We can reduce the dimension of the matrix

TAAC

GlobalGlobal

Page 9: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenvalue Theorem

Define dimension N2 by N2

dimension M by M (e.g., 8 by 8) Let be an eigenvector of : Then is eigenvector of : Proof:

AAL

AACT

T

)(

)(

)(

)()(

Av

vA

LvA

AvAA

AvAAAvCT

T

v L

)()( AvAvC Av C

vLv

[slide credit: Alexander Roth]

GlobalGlobal

This vast dimensionality reduction is what makes the whole thing work.

Page 10: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces, the algorithm

Compute another matrix which is M by M:

Find the M eigenvalues and eigenvectors– Eigenvectors of C and L are equivalent

Build matrix V from the eigenvectors of L

AAL T

[slide credit: Alexander Roth]

Eigenvectors of C are linear combination of image space with the eigenvectors of L

Eigenvectors represent the variation in the faces

VAU

GlobalGlobal

Page 11: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces, the algorithm

Compute for each face its projection onto the face space

Compute the between-class threshold

)(8 mT hU

)(7 mT gU

)(6 mT fU

)(5 mT eU

)(2 mT dU

)(3 mT cU

)(2 mT bU

)(1 mT aU

Mjiji ....1,for}max{2

1

[slide credit: Alexander Roth]

GlobalGlobal

Page 12: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Example

Photobook, MIT

[Note: sharper]

Example set Eigenfaces

Normalized Eigenfaces

GlobalGlobal

Page 13: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces, the algorithm in use To recognize a face, subtract the average face from it

2 2

1 1

2 2

m

N N

r m

r mr

r m

2

2

1

Nr

r

r

22

2

1

2

1

NN

m

m

m

m

r

r

r

r

[slide credit: Alexander Roth]

Compute its projection onto the face space

Compute the distance in the face space between the face and all known faces

)( mT rU

Miii ...1for22

Distinguish between– If it’s not a face– If it’s a new face– If it’s a known face

}min{ and

),...,1(, and

i

i Mi

GlobalGlobal

Beyond uses in recognition, Eigen “backgrounds” can be very effective for background subtraction.

Page 14: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Eigenfaces, the algorithm Problems with eigenfaces – spurious “scatter”

– Different illumination– Different head pose– Different alignment– Different facial expression

[slide credit: Alexander Roth]

Fisherfaces may beat … Developed in 1997 by P.Belhumeur et al. Based on Fisher’s LDA Faster than eigenfaces, in some cases Has lower error rates Works well even if different illumination Works well even if different facial express.

GlobalGlobal

Page 15: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Global/local feature mix

[image credit: Kevin Murphy]

Global-noGeoGlobal-noGeo

Global works OK, still used, but local now seems to outperform.

Recent mix of local and global: – Use global features to bias local features with no internal

geometric dependencies: Murphy, Torralba & Freeman (03)

Page 16: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Use local features to find objectsGlobal-noGeoGlobal-noGeo

Filter bank

Image

ncorrelatio normalized

nconvolutio *

patch

Gaussian within bounding box

Trainingx positiveO negative

Object bounding box

[image credit: Kevin Murphy]

Page 17: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Global feature: Back to neural nets: Propagate Mixture Density Networks*

Final output

Iteration

Uses “boostedrandom fields” tolearn graph structure

[slide credit: Kevin Murphy]

Global-noGeoGlobal-noGeo

* C. M. Bishop. Mixture density networks. Technical Report NCRG 4288, Neural ComputingResearch Group, Department of Computer Science, Aston University, 1994

Fe

atu

re u

se

d:

Ste

era

ble

pyr

am

id

tra

nsf

orm

atio

n u

sin

g 4

orie

nta

tion

s a

nd

2

sca

les;

Im

ag

e d

ivid

ed

into

4x4

grid

, a

vera

ge

en

erg

y co

mp

ute

d in

ea

ch c

ha

nn

el

yie

lds

12

8 f

ea

ture

s.

PC

A d

ow

n t

o 8

0.

Page 18: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Example of context focus

The algorithm knows where to focus for objects

Global-noGeoGlobal-noGeo

[image credit: Kevin Murphy]

Page 19: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Results Performance is boosted by knowing context

Global-noGeoGlobal-noGeo

[image credit: Kevin Murphy]

Page 20: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Completely Local: Color Histograms Swain and Ballard ’91 took the normalized r,g,b color histogram of

objects:

and noted the tolerance to 3D rotation, partial occlusions etc:

Local-noGeoLocal-noGeo

[image credit: Swain & Ballard]

Page 21: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Color Histogram Matching Objects were recognized based on their histogram intersection:

Yielding excellent results over 30 objects:

The problem is, color varies markedly with lighting …

Local-noGeoLocal-noGeo

[image credit: Swain & Ballard]

Page 22: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Local-noGeoLocal-noGeo

Scheile and Crowley used derivative type features instead:

And a probabilistic matching rule:

Local Feature Histogram Matching

[image credit: Scheile & Crowley]

• For multiple objects:

Page 23: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Again with impressive performance results, much more tolerant to lighting:

Problem is: Histograms suffer exponential blow up with number of features

Local Feature Histogram ResultsLocal-noGeoLocal-noGeo

[image credit: Scheile & Crowley]

30 of 100f objects

Page 24: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Local features, for example:– Lowe’s SIFT– Malik’s Shape Context– Poggio’s HMAX– von der Malsburg’s Gabor Jets– Yokono’s Gaussian Derivative Jets

Adding patches thereof seems to work great, but they are of high dimensionality.

Idea: Encode in Hierarchy: – Overview some techniques...

Local Features

Page 25: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Convolutional Neural NetworksYann LeCun

Broke all the HIPs code(Human Interaction Proofs)from Yahoo, MSN, E-Bay …

Local-HierarchyLocal-Hierarchy

[image credit: LeCun]

Page 26: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Fragment Based Hierarchy Shimon Ullman

Top down and bottom up hierarchy

http://www.wisdom.weizmann.ac.il/~vision/research.html See also Perona’s group work on hierarchical feature models of objects http://www.vision.caltech.edu/html-files/publications.html

Local-HierarchyLocal-Hierarchy

[image credit: Ullman et al]

Page 27: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Constellation ModelPerona’s

Fro

m:

Ro

b F

erg

us h

ttp://

ww

w.r

obot

s.ox

.ac.

uk/%

7Efe

rgus

/

Feature detector results: Bayesian Decision basedThe shape model. The mean location is indicated by the cross, with the ellipse showing the uncertainty in location. The number by each part is the probability of that part being present.

The appearance model closest to the mean of the appearance density of each part

Recognition Result:

Se

e a

lso

Pe

ron

a’s

gro

up

wo

rk o

n h

iera

rch

ica

l fe

atu

re

mo

de

ls o

f o

bje

cts

htt

p:/

/ww

w.v

isio

n.c

alte

ch.e

du

/htm

l-file

s/p

ub

lica

tion

s.h

tml

Local-HierarchyLocal-Hierarchy

[image credit: Perona et al]

Page 28: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Joijic and Frey

Scene description as hierarchy of sprites

Local-HierarchyLocal-Hierarchy

[image credit: Joijic et al]

Page 29: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Jeff Hawkins, Dileep George Modular hierarchical spatial temporal memory

Hierarchy Module

Results Templates Good Classifications Bad Classifications

In (D) Out (E)

Local-HierarchyLocal-Hierarchy

[image credit: George, Hawkins]

Page 30: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Peter Bock’s ALISAAn explicit Cognitive Model

Local-HierarchyLocal-Hierarchy

Histogram based

[image credit: Bock et al]

Page 31: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

ALISA Labeling 2 ScenesLocal-HierarchyLocal-Hierarchy

[image credit: Bock et al]

Page 32: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Local-HierarchyLocal-HierarchyHMAX from the “Standard Model”Maximilian Riesenhuber and Tomaso Poggio

Basic building blocksIn object recognition hierarchy

Modulated by attention

Pick this up momentarily, first, a little on trees and boosting …[image credit: Riesenhuber et al]

Page 33: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Bayesian NetworksLibrary:

PNL

Statistical LearningLibrary:

MLL

• K-means

• Decision trees

• Agglomerative clustering• Spectral clustering

• K-NN

• Dependency Nets

• Boosted decision trees

Machine Learning – Many TechniquesLibraries from Intel

Modeless Model based

Uns

uper

vise

dS

uper

vise

d• Multi-Layer Perceptron

• SVM• BayesNets: Classification

• Tree distributions

Key:• Optimized• Implemented• Not implemented

• BayesNets: Parameter fitting• Inference

• Kernel density est.

• PCA

• Physical Models

• Influence diagrams

• Bayesnet structure learning

• Logistic Regression

• Kalman Filter

• HMM

• Adaptive Filters

• Radial Basis • Naïve Bayes• ARTMAP

• Gaussian Fitting

• Assoc. Net.

• ART• Kohonen Map

• Random Forests.

• MART

• CART

• Diagnostic Bayesnet

• Histogram density est.

focus

Page 34: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Machine Learning

INPUT OUTPUTf

Example Uses of Prediction:- Insurance risk prediction - Parameters that impact yields- Gene classification by function- Topics of a document. . .

Find a function that describes given dataand predicts unknown data

overfit

underfit

just right

X

y f

Learn a model/function

That maps input to output

Specific example: prediction, using a decision tree => => =>

Page 35: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Binary Recursive Decision TreesLeo Breiman’s “CART”*

Data setData set

maximal purity splits

Perfect purity, but…

overfit

underfit

X

y f

*Classification And Regression Tree

At Each Level:At Each Level: Find the variable (predictor) and its threshold.

– That splits the data into 2 groups– With maximal purity within each group

All variables/predictors are considered at every level.

Data of different types, eachcontaining a vector of “predictors”

Page 36: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

just right

Data setData set

Prune to avoid over fitting usingcomplexity cost measure

Binary Recursive Decision TreesLeo Breiman’s “CART”*

At Each Level:At Each Level: Find the variable (predictor) and its threshold.

– That splits the data into 2 groups– With maximal purity within each group

All variables/predictors are considered at every level.

overfit

x

y f

Page 37: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Consider a Face Detector via Decision Stumps

Data setData set

maximal purity splits: Thresh = N

For each rectangle combination region: Find the threshold

– That splits the data into 2 groups (face, non-face) – With maximal purity within each group

Face and non-face data that he features can be tried on

Bar detector works well for “nose” a face detecting stump.

It doesn’tdetectcars.

Consider a tree “Stump” – just one split.It selects the single most discriminative feature …

See Appendix for Viola, Jones’s feature generator: Intregral Images

Page 38: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

We use “Boosting” to Select a “Forest of Stumps”

G iven exam ple im ages (x1 ,y1) , … , (xn ,yn) w here y i = 0, 1 for negative and positive exam ples respectively.

In itialize w eights w 1 ,i = 1 /(2m ), 1 /(2 l) for train ing exam ple i, w here m and l are the num ber of negatives and positives respectively.

For t = 1 … T 1) N orm alize w eights so that w t is a d istribution 2) For each feature j train a classifier h j and evaluate its error j w ith respect to w t. 3) C hose the classifier h j w ith low est error. 4) U pdate w eights according to :

1,,1

i

titit ww

w here e i = 0 is x i is classified correctly, 1 o therw ise, and

1 t

t

t

T he final strong classifier is:

otherwise

xxhT

t

T

t ttt h0

2

1)(1)( 1 1 , w here )

1log(

t

t

Each stump is a selected feature plus a split threshold

Gentle Boost:

Page 39: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

For efficient calculation, form a Detection Cascade

A boosted cascade is assembled such that at each node, non-object regions stop further processing.

If the detection of each node is high (~99.9%), at cost of a high false positive rate (say 50% of everything detected as “object), and if the nodes are independent,

.6.9 and 98.0 :get we

cascade node 20 afor then so, If . falsePos and detect

are rates positive false anddetection overall then the

7

11

efd

fdn

ii

n

ii

Rapid Object Detection using a Boosted Cascade of Simple Features - Viola, Jones (2001)  

Page 40: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Improvements to Cascade

J. Wu, J. M. Rehg, and M. D. Mullin just do one Boosting round, then select from the feature pool as needed:

Kobi Levi and Yair Weiss just used better features (gradient histograms) to cut training needs by an order of magnitude.

Let’s focus on better features and descriptors …

Viola, Jones Wu, Rehg, Mullin

[image credit: Wu et al]

Page 41: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

The Standard Model of Visual CortexBiologically Motivated Features

S1 layer:Gabor at 4 orientations

C1 layer:Local Spatial Max

Inter layer:Dictionary of Patches of C1

S2 Layer:Radial Basis fit to it’s patch template over the whole image

C2 Layer:Max S2 Response .8 .4 .9 .2 .6

Classifier (SVM, Boosting, …)

Thomas Serre, Lior Wolf and Tomaso Poggio used the model of the human visual cortex developed in Riesenhuber’s lab:

First 5 chosen features from Boosting

[image credit: Serre et al]

Page 42: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

The Standard Model of Visual CortexBiologically Motivated Features

Results in state of the art/top performance:

[image credit: Serre et al]

Seems to handily beat SIFT features:

Page 43: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Yokonos’ Generalization toThe Standard Model of Visual Cortex

Used Gaussian Derivates: 3 orders X 3 scales X 4 orientations = 36 base features:

Similar to Standard Model’s Gabor base filters.

[image credit: Yokono et al]

Page 44: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Yokonos’ Generalization toThe Standard Model of Visual Cortex

Created a local spatial jet, oriented to the gradient at the largest scale at the center pixel:

Since Gabor has ringing spatial extent ~ still approximately similar to standard model.

[image credit: Yokono et al]

Page 45: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Yokonos’ Generalization toThe Standard Model of Visual Cortex

Full system:

[image credit: Yokono et al]

~S1, C1:Features memorized from positive samples at Harris corner interest points.

~S2:Dictionary of learned features is measured (normalized cross correlation) against all interest points in the image.

~C2:The maximum normalized cross correlation scores are arranged in a feature vector

Classifier:Again: SVM, Boosting, …

Page 46: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Yokonos’ Generalization toThe Standard Model of Visual Cortex

Excellent Results:

[image credit: Yokono et al]

CBCL Database ROC curve for 1200 Stumps:

SVM with 1 to 5 training images beats other techniques:

Page 47: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Yokonos’ Generalization toThe Standard Model of Visual Cortex

Excellent Results:

[image credit: Yokono et al]

AIBO Dog in articulated poses:

Some features chosen:

ROC Curve:

Page 48: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Brash Claim

In the high 90% performance under lighting, articulation, scale and 3D rotation. – The classifier inside humans is unlikely to be much more accurate.

We are not that far from raw human level performance. – By 2015 I predict.

Base classifier is embedded in larger system that makes it more reliable:– Attention– Color constancy features– Context– Temporal filtering– Sensor fusion

Page 49: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Back to Kevin Murphy: Context:

[slide credit: Kevin Murphy]

Missing

Page 50: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

We know there is a keyboard present in this scene even if we cannot see it clearly.

We know there is no keyboard present in this scene

… even if there is one indeed.[slide credit: Kevin Murphy]

MissingContext

Page 51: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Attention

Change blindness

Missing

Farm Truck

Page 52: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Call for a Program: Generalize Standard Model Even Further

Detect:– DOG– Harris Corner

Descriptors:– SIFT– Steerable– Gabor

Dictionary:– All descriptors– Subset– Clustered

Image Level Scoring:– Histogram– Max Correlation– Max Probability

Classifier:– SVM– Boosting– K-NN …

Embedding:– Attention, active vision– Context: Scene, 3D inference– Sensor fusion/association– Motion

Research Framework

Loca

l

Glo

bal

Page 53: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Call for a Program: Generalize Standard Model Even Further

Ashutosh Saxena, Chung and Ng learned depth using local features in an MRF (similar to Kevin Murphy).

Ashutosh also has a robot picking up novel objects from local features. Together with active vision, active manipulation, context – Now is a good time for vision systems!

[image credit: Saxena et al]Apply to “Stanley II” and to STAIR

Page 54: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Summary: Mix local with globalGeneralize Standard Model Even Further

Detect:– DOG– Harris Corner

Descriptors:– SIFT– Steerable– Gabor

Dictionary:– All descriptors– Subset– Clustered

Image Level Scoring:– Histogram– Max Correlation– Max Probability

Classifier:– SVM– Boosting– K-NN …

Embedding:– Attention, active vision– Context: Scene, 3D inference– Sensor fusion/association– Motion

Research Framework

Loca

l

Glo

bal

Page 55: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Bibliography for this lecture Papers for this lecture:1. R. Fergus, P. Perona and A.Zisserman, “Object Class Recognition by Unsupervised Scale-Invariant Learning”,

CVPR 03.

2. M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, Vol. 3, No. 1, 1991.

3. Serre, T., L. Wolf and T. Poggio. Object Recognition with Features Inspired by Visual Cortex. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society Press, San Diego, June 2005.

4. Jerry Jun Yokono & Tomaso Poggio, “Boosting a Biologically Inspired Local Descriptor for Geometry-free Face and Full Multi-view 3D Object Recognition”, AI Memo 3005-023 CBCL Memo 254, July 2005

5. J. Wu, J. M. Rehg, and M. D. Mullin, “Learning a Rare Event Detection Cascade by Direct Feature Selection” Proc. Advances in Neural Information Processing Systems 16 (NIPS*2003), MIT Press, 2004

6. J. Wu, M. D. Mullin, and J. M. Rehg, “Linear Asymmetric Classifier for Face Detection”, International Conference on Machine Learning (ICML 05), pages 993-1000, Bonn, Germany, August 2005

7. Kobi Levi and Yair Weiss, “Learning Object Detection from a Small Number of Examples: The Importance of Good Features” International Conference on Computer Vision and Pattern Recognition (CVPR) 2004.

8. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. CVPR, pages 511–518, 2001.

9. B. Schiele and JL Crowley. Probabilistic object recognition using multidimensional receptive field histograms. submitted to ICPR'96

10. M. J. Swain and D. H. Ballard, "Color Indexing," International Journal of Computer Vision, vol. 7, pp. 11-32, 1991.

11. Antonio Torralba, Kevin Murphy and William Freeman , “Contextual Models for Object Detection using Boosted Random Fields ”, NIPS 2004.

12. Kevin Murphy, Antonio Torralba, Daniel Eaton, William Freeman, “Object detection and localization using local and global features”, Sicily workshop on object recognition, 2005

13. M. Riesenhuber and T. Poggio. How visual cortex recognizes objects: The tale of the standard model. The Visual Neurosciences, 2:1640–1653, 2003.

14. A. Saxena, S.H. Chung, A.Y. Ng, “Learning depth from Single Monocular Images”, NIPS 2005

Page 56: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Feature set generators

Backup Slides

Page 57: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

3 rectangular features types:

• two-rectangle feature type (horizontal/vertical)

• three-rectangle feature type

• four-rectangle feature type

Using a 24x24 pixel base detection window, with all the possible combination of horizontal and vertical location and scale of these feature types the full set of features has 49,396 features.

The motivation behind using rectangular features, as opposed to more expressive steerable filters is due to their extreme computational efficiency.

Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision

Intregral Images -- a Feature Set Generator

Page 58: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Define an “Integral image” Def: The integral image at location (x,y), is the sum of the pixel values above and to the left of (x,y), inclusive.

Using the following two recurrences, where i(x,y) is the pixel value of original image at the given location and s(x,y) is the cumulative column sum, we can calculate the integral image representation of the

image in a single pass.

(x,y)

(0,0)

x

yPaul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision

Sum

s(x,y) = s(x,y-1) + i(x,y)

ii(x,y) = ii(x-1,y) + s(x,y)

Page 59: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Allows rapid evaluation of rectangular features

Using the integral image representation one can compute the value of any rectangular sum in constant time.

For example the integral sum inside rectangle D we can compute as:

ii(4) + ii(1) – ii(2) – ii(3)

As a result: two-, three-, and four-rectangular features can be computed with 6, 8 and 9 array references respectively.

Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt ICCV 2001 Workshop on Statistical and Computation Theories of Vision

Page 60: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

Intregal Image Example

0 8 6 1

1 5 9 0

0 7 5 0

2 8 9 2

0 8 - -

1 14 - -

1 - - -

4 - - -

0 8 14 -

1 14 29 -

1 21 41 -

4 32 61 -

0 8 14 15

1 14 29 30

1 21 41 42

4 32 61 64

Image

Intregal Image

Can calculate in one pass.

Page 61: Gary Bradski, Intel,  Stanford CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado

Sebastian Thrun & Gary Bradski Stanford University CS223B Computer Vision

0 8 6 1

1 5 9 0

0 7 5 0

2 8 9 2

Intregal Image Example

0 8 14 15

1 14 29 30

1 21 41 42

4 32 61 64

Image Intregal Image

Find sum

5+9+7+5+8+9=43 61+0-(14+4)=43