Visual Recognition: Objects, Actions and Scenes

74
Includes slides from: O. Chum, K. Grauman, S. Lazebnik, B. Leibe, D. Lowe, J. Philbin, J. Ponce, D. Nister, J. Sivic, N. Snavely and A. Zisserman Lecture 2 : Local invariant image features Ivan Laptev [email protected] INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d’Informatique, Ecole Normale Supérieure, Paris, France University of Trento July 7-10, 2014 Trento, Italy Visual Recognition: Objects, Actions and Scenes

Transcript of Visual Recognition: Objects, Actions and Scenes

Page 1: Visual Recognition: Objects, Actions and Scenes

Includes slides from: O. Chum, K. Grauman, S. Lazebnik, B. Leibe, D. Lowe, J. Philbin,

J. Ponce, D. Nister, J. Sivic, N. Snavely and A. Zisserman

Lecture 2:

Local invariant image features

Ivan Laptev

[email protected]

INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548

Laboratoire d’Informatique, Ecole Normale Supérieure, Paris, France

University of Trento

July 7-10, 2014

Trento, Italy

Visual Recognition:

Objects, Actions and Scenes

Page 2: Visual Recognition: Objects, Actions and Scenes

Image matching and recognition with local features

The goal: establish correspondence between two or more

images

Image points x and x’ are in correspondence if they are

projections of the same 3D scene point X.Images courtesy A. Zisserman 

x

 

x'

 

XPP

/

Page 3: Visual Recognition: Objects, Actions and Scenes

Example I: Wide baseline matching and 3D reconstruction

Establish correspondence between two (or more) images.

[Schaffalitzky and Zisserman ECCV 2002]

Page 4: Visual Recognition: Objects, Actions and Scenes

Example I: Wide baseline matching and 3D reconstruction

Establish correspondence between two (or more) images.

[Schaffalitzky and Zisserman ECCV 2002]

X

Page 5: Visual Recognition: Objects, Actions and Scenes

[Agarwal, Snavely, Simon, Seitz, Szeliski, ICCV’09] –

Building Rome in a Day

57,845 downloaded images, 11,868 registered images. This video: 4,619 images.

Page 6: Visual Recognition: Objects, Actions and Scenes

Example II: Object recognition

[D. Lowe, 1999]

Establish correspondence between the target image and

(multiple) images in the model database.

Target

image

Model

database

Page 7: Visual Recognition: Objects, Actions and Scenes

Find these landmarks ...in these images and 1M more

Example III: Visual search

Given a query image, find images depicting the same place /

object in a large unordered image collection.

Page 8: Visual Recognition: Objects, Actions and Scenes

Establish correspondence between the query image and all

images from the database depicting the same object / scene.

Query image

Database image(s)

Page 9: Visual Recognition: Objects, Actions and Scenes

Bing visual scan

Mobile visual search

and others… Snaptell.com, Millpix.com

Page 10: Visual Recognition: Objects, Actions and Scenes

Applications

Take a picture of a product or advertisement

find relevant information on the web

[Pixee – Milpix]

Page 11: Visual Recognition: Objects, Actions and Scenes

Applications

Finding stolen/missing objects in a large collection…

Page 12: Visual Recognition: Objects, Actions and Scenes

Applications

Copy detection for images and videos

Search in 200h of videoQuery video

Page 13: Visual Recognition: Objects, Actions and Scenes

13K. Grauman, B. Leibe

Sony Aibo – Robotics

• Recognize docking station

• Communicate with visual cards

• Place recognition

• Loop closure in SLAM

Slide credit: David Lowe

Applications

Page 14: Visual Recognition: Objects, Actions and Scenes

Why is it difficult?

Want to establish correspondence despite possibly large changes in scale, viewspoint, lighting and partial occlusion

ViewpointScale

Lighting Occlusion

… and the image collection can be very large (e.g. 1M images)

Page 15: Visual Recognition: Objects, Actions and Scenes

Compute scale / affine co-variant local features

Estimate pairwise best matches between local features

Enforce geometric constraints between local features

How does it work?

Approach:

Page 16: Visual Recognition: Objects, Actions and Scenes

Compute scale / affine co-variant local features

Estimate pairwise best matches between local features

Enforce geometric constraints between local features

How does it work?

Approach:

Page 17: Visual Recognition: Objects, Actions and Scenes

Compute scale / affine co-variant local features

Estimate pairwise best matches between local features

Enforce geometric constraints between local features

How does it work?

Approach:

Page 18: Visual Recognition: Objects, Actions and Scenes

Why extract features?

• Motivation: panorama stitching• We have two images – how do we combine them?

Slide: S. Lazebnik

Page 19: Visual Recognition: Objects, Actions and Scenes

Why extract features?

• Motivation: panorama stitching• We have two images – how do we combine them?

Step 1: extract features

Step 2: match features

Slide: S. Lazebnik

Page 20: Visual Recognition: Objects, Actions and Scenes

Why extract features?

• Motivation: panorama stitching• We have two images – how do we combine them?

Step 1: extract features

Step 2: match features

Step 3: align imagesSlide: S. Lazebnik

Page 21: Visual Recognition: Objects, Actions and Scenes

Characteristics of good features

• Repeatability• The same feature can be found in several images despite geometric

and photometric transformations

• Saliency• Each feature is distinctive

• Compactness and efficiency• Many fewer features than image pixels

• Locality• A feature occupies a relatively small area of the image; robust to

clutter and occlusion

Slide: S. Lazebnik

Page 22: Visual Recognition: Objects, Actions and Scenes

A hard feature matching problem

NASA Mars Rover images

Slide: S. Lazebnik

Page 23: Visual Recognition: Objects, Actions and Scenes

NASA Mars Rover images

with SIFT feature matches

Figure by Noah Snavely

Answer below (look for tiny colored squares…)

Slide: S. Lazebnik

Page 24: Visual Recognition: Objects, Actions and Scenes

Corner Detection: Basic Idea

• We should easily recognize the point by looking through a small window

• Shifting a window in any direction should give a large change in intensity

“edge”:

no change

along the edge

direction

“corner”:

significant

change in all

directions

“flat” region:

no change in

all directions

Source: A. Efros

Page 25: Visual Recognition: Objects, Actions and Scenes

Corner Detection: Mathematics

Change in appearance of window W for the shift [u,v]:

I(x, y)E(u, v)

E(3,2)

Wyx

yxIvyuxIvuE),(

2)],(),([),(

Slide: S. Lazebnik

Page 26: Visual Recognition: Objects, Actions and Scenes

Corner Detection: Mathematics

I(x, y)E(u, v)

E(0,0)

Change in appearance of window W for the shift [u,v]:

Wyx

yxIvyuxIvuE),(

2)],(),([),(

Slide: S. Lazebnik

Page 27: Visual Recognition: Objects, Actions and Scenes

Corner Detection: Mathematics

We want to find out how this function behaves for

small shiftsE(u, v)

Change in appearance of window W for the shift [u,v]:

Wyx

yxIvyuxIvuE),(

2)],(),([),(

Slide: S. Lazebnik

Page 28: Visual Recognition: Objects, Actions and Scenes

Corner Detection: Mathematics

• First-order Taylor approximation for small

motions [u, v]:

• Let’s plug this into

v

uIIyxI

vIuIyxI

vIuIyxIvyuxI

yx

yx

yx

),(

),(

sorder termhigher ),(),(

Wyx

yxIvyuxIvuE),(

2)],(),([),(

Slide: S. Lazebnik

Page 29: Visual Recognition: Objects, Actions and Scenes

Corner Detection: Mathematics

v

u

III

IIIvu

v

uII

yxIv

uIIyxI

yxIvyuxIvuE

yyx

yxx

Wyx

Wyx

yx

Wyx

yx

Wyx

2

2

),(

),(

2

),(

2

),(

2

)],(),([

)],(),([),(

Slide: S. Lazebnik

Page 30: Visual Recognition: Objects, Actions and Scenes

Corner Detection: Mathematics

The quadratic approximation simplifies to

where M is a second moment matrix computed from image

derivatives:

v

uMvuvuE ),(

Wyx yyx

yxx

III

IIIM

),(2

2

Slide: S. Lazebnik

Page 31: Visual Recognition: Objects, Actions and Scenes

• The surface E(u,v) is locally approximated by a

quadratic form. Let’s try to understand its shape.

• Specifically, in which directions

does it have the smallest/greatest

change?

Interpreting the second moment matrix

v

uMvuvuE ][),(

Wyx yyx

yxx

III

IIIM

),(2

2

E(u, v)

Slide: S. Lazebnik

Page 32: Visual Recognition: Objects, Actions and Scenes

First, consider the axis-aligned case

(gradients are either horizontal or vertical)

If either a or b is close to 0, then this is not a corner,

so look for locations where both are large.

Interpreting the second moment matrix

Wyx yyx

yxx

III

IIIM

),(2

2

b

a

0

0

Slide: S. Lazebnik

Page 33: Visual Recognition: Objects, Actions and Scenes

Consider a horizontal “slice” of E(u, v):

Interpreting the second moment matrix

This is the equation of an ellipse.

const][

v

uMvu

Slide: S. Lazebnik

Page 34: Visual Recognition: Objects, Actions and Scenes

Visualization of second moment matrices

Slide: S. Lazebnik

Page 35: Visual Recognition: Objects, Actions and Scenes

Visualization of second moment matrices

Slide: S. Lazebnik

Page 36: Visual Recognition: Objects, Actions and Scenes

Consider a horizontal “slice” of E(u, v):

Interpreting the second moment matrix

This is the equation of an ellipse.

RRM

2

11

0

0

The axis lengths of the ellipse are determined by the

eigenvalues and the orientation is determined by R

direction of the

slowest change

direction of the

fastest change

(max)-1/2

(min)-1/2

const][

v

uMvu

Diagonalization of M:

Slide: S. Lazebnik

Page 37: Visual Recognition: Objects, Actions and Scenes

Interpreting the eigenvalues

1

2

“Corner”

1 and 2 are large,

1 ~ 2;

E increases in all

directions

1 and 2 are small;

E is almost constant

in all directions

“Edge”

1 >> 2

“Edge”

2 >> 1

“Flat”

region

Classification of image points using eigenvalues

of M:

Slide: S. Lazebnik

Page 38: Visual Recognition: Objects, Actions and Scenes

Corner response function

“Corner”

R > 0

“Edge”

R < 0

“Edge”

R < 0

“Flat”

region

|R| small

2

2121

2 )()(trace)det( MMR

α: constant (0.04 to 0.06)

Slide: S. Lazebnik

Page 39: Visual Recognition: Objects, Actions and Scenes

Harris detector: Steps

1. Compute Gaussian derivatives at each pixel

2. Compute second moment matrix M in a

Gaussian window around each pixel

3. Compute corner response function R

4. Threshold R

5. Find local maxima of response function

(nonmaximum suppression)

C.Harris and M.Stephens. “A Combined Corner and Edge Detector.” Proceedings of the 4th Alvey Vision Conference: pages 147—151, 1988.

Slide: S. Lazebnik

Page 40: Visual Recognition: Objects, Actions and Scenes

Harris Detector: Steps

Slide: S. Lazebnik

Page 41: Visual Recognition: Objects, Actions and Scenes

Harris Detector: Steps

Compute corner response R

Slide: S. Lazebnik

Page 42: Visual Recognition: Objects, Actions and Scenes

Harris Detector: Steps

Find points with large corner response: R>threshold

Slide: S. Lazebnik

Page 43: Visual Recognition: Objects, Actions and Scenes

Harris Detector: Steps

Take only the points of local maxima of R

Slide: S. Lazebnik

Page 44: Visual Recognition: Objects, Actions and Scenes

Harris Detector: Steps

Slide: S. Lazebnik

Page 45: Visual Recognition: Objects, Actions and Scenes

Invariance and covariance

• We want corner locations to be invariant to photometric

transformations and covariant to geometric transformations

• Invariance: image is transformed and corner locations do not change

• Covariance: if we have two transformed versions of the same image,

features should be detected in corresponding locations

Slide: S. Lazebnik

Page 46: Visual Recognition: Objects, Actions and Scenes

Affine intensity change

• Only derivatives are used =>

invariance to intensity shift I I + b

• Intensity scaling: I a I

R

x (image coordinate)

threshold

R

x (image coordinate)

Partially invariant to affine intensity change

I a I + b

Slide: S. Lazebnik

Page 47: Visual Recognition: Objects, Actions and Scenes

Image translation

• Derivatives and window function are shift-invariant

Corner location is covariant w.r.t. translation

Slide: S. Lazebnik

Page 48: Visual Recognition: Objects, Actions and Scenes

Image rotation

Second moment ellipse rotates but its shape

(i.e. eigenvalues) remains the same

Corner location is covariant w.r.t. rotation

Slide: S. Lazebnik

Page 49: Visual Recognition: Objects, Actions and Scenes

Scaling

All points will

be classified

as edges

Corner

Corner location is not covariant to scaling!

Slide: S. Lazebnik

Page 50: Visual Recognition: Objects, Actions and Scenes
Page 51: Visual Recognition: Objects, Actions and Scenes

Blob detection

Slide: S. Lazebnik

Page 52: Visual Recognition: Objects, Actions and Scenes

Feature detection with scale selection

• We want to extract features with characteristic

scale that is covariant with the image

transformation

Slide: S. Lazebnik

Page 53: Visual Recognition: Objects, Actions and Scenes

Blob detection: basic idea

• To detect blobs, convolve the image with a

“blob filter” at multiple scales and look for

maxima of filter response in the resulting

scale space

Slide: S. Lazebnik

Page 54: Visual Recognition: Objects, Actions and Scenes

Blob filter

Laplacian of Gaussian: Circularly symmetric

operator for blob detection in 2D

2

2

2

22

y

g

x

gg

Slide: S. Lazebnik

Page 55: Visual Recognition: Objects, Actions and Scenes

Recall: Edge detection

gdx

df

f

gdx

d

Source: S. Seitz

Edge

Derivative

of Gaussian

Edge = maximum

of derivative

Page 56: Visual Recognition: Objects, Actions and Scenes

Edge detection, Take 2

gdx

df

2

2

f

gdx

d2

2

Edge

Second derivative

of Gaussian

(Laplacian)

Edge = zero crossing

of second derivative

Source: S. Seitz

Page 57: Visual Recognition: Objects, Actions and Scenes

From edges to blobs

• Edge = ripple

• Blob = superposition of two ripples

Spatial selection: the magnitude of the Laplacian

response will achieve a maximum at the center of

the blob, provided the scale of the Laplacian is

“matched” to the scale of the blob

maximum

Slide: S. Lazebnik

Page 58: Visual Recognition: Objects, Actions and Scenes

Scale-space blob detector: Example

Slide: S. Lazebnik

Page 59: Visual Recognition: Objects, Actions and Scenes

Scale-space blob detector: Example

Slide: S. Lazebnik

Page 60: Visual Recognition: Objects, Actions and Scenes

Scale-space blob detector

1. Convolve image with scale-normalized

Laplacian at several scales

2. Find maxima of squared Laplacian response

in scale-space

Slide: S. Lazebnik

Page 61: Visual Recognition: Objects, Actions and Scenes

Scale-space blob detector: Example

Slide: S. Lazebnik

Page 62: Visual Recognition: Objects, Actions and Scenes

• Approximating the Laplacian with a

difference of Gaussians:

2 ( , , ) ( , , )xx yyL G x y G x y

( , , ) ( , , )DoG G x y k G x y

(Laplacian)

(Difference of Gaussians)

Efficient implementation

Slide: S. Lazebnik

Page 63: Visual Recognition: Objects, Actions and Scenes

Efficient implementation

David G. Lowe. "Distinctive image features from scale-invariant

keypoints.” IJCV 60 (2), pp. 91-110, 2004.

Slide: S. Lazebnik

Page 64: Visual Recognition: Objects, Actions and Scenes

From feature detection to feature description

• Scaled and rotated versions of the same

neighborhood will give rise to blobs that are related

by the same transformation

• What to do if we want to compare the appearance of

these image regions?

• Normalization: transform these regions into same-

size circles

• Problem: rotational ambiguity

Slide: S. Lazebnik

Page 65: Visual Recognition: Objects, Actions and Scenes

Eliminating rotation ambiguity

• To assign a unique orientation to circular

image windows:• Create histogram of local gradient directions in the patch

• Assign canonical orientation at peak of smoothed histogram

0 2 p

Slide: S. Lazebnik

Page 66: Visual Recognition: Objects, Actions and Scenes

SIFT features

• Detected features with characteristic scales

and orientations:

David G. Lowe. "Distinctive image features from scale-invariant

keypoints.” IJCV 60 (2), pp. 91-110, 2004. Slide: S. Lazebnik

Page 67: Visual Recognition: Objects, Actions and Scenes

SIFT descriptors

David G. Lowe. "Distinctive image features from scale-invariant

keypoints.” IJCV 60 (2), pp. 91-110, 2004. Slide: S. Lazebnik

Page 68: Visual Recognition: Objects, Actions and Scenes

Invariance vs. covariance

Invariance:• features(transform(image)) = features(image)

Covariance:• features(transform(image)) = transform(features(image))

Covariant detection => invariant descriptionSlide: S. Lazebnik

Page 69: Visual Recognition: Objects, Actions and Scenes

Affine adaptation

• Affine transformation approximates viewpoint

changes for roughly planar objects and

roughly orthographic cameras

Slide: S. Lazebnik

Page 70: Visual Recognition: Objects, Actions and Scenes

Affine adaptation

RRIII

IIIyxwM

yyx

yxx

yx

2

11

2

2

, 0

0),(

direction of

the slowest

change

direction of the

fastest change

(max)-1/2

(min)-1/2

Consider the second moment matrix of the window

containing the blob:

const][

v

uMvu

Recall:

This ellipse visualizes the “characteristic shape” of the

windowSlide: S. Lazebnik

Page 71: Visual Recognition: Objects, Actions and Scenes

Affine adaptation example

Scale-invariant regions (blobs)

Slide: S. Lazebnik

Page 72: Visual Recognition: Objects, Actions and Scenes

Affine adaptation example

Affine-adapted blobs

Slide: S. Lazebnik

Page 73: Visual Recognition: Objects, Actions and Scenes

Software

VLFeat: Vision Library Features http://www.vlfeat.org/

(will be used in this course)

– Local image features (Harris,SIFT, MSER, …)

– Local image descriptors (SIFT, LBP, …)

– Feature encodig (VLAD, Fisher)

– Machine learning tools (k-means, GMM, SVM)

– Matlab and C interfaces

Page 74: Visual Recognition: Objects, Actions and Scenes

References

– Schmid, C. and Mohr, R. (1997). Local grayvalue invariants for image

retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence

19(5): 530–535.

– Lindeberg, T. (1998). Feature detection with automatic scale selection,

International Journal of Computer Vision 30(2): 77–116.

– Lowe, D. (2004). Distinctive image features from scale-invariant keypoints,

International Journal of Computer Vision.

– Mikolajczyk, K. and Schmid, C. (2002). An affine invariant interest point

detector, Proc. Seventh European Conference on Computer Vision, Vol.

2350 of Lecture Notes in Computer Science, Springer Verlag, Berlin,

Copenhagen, Denmark, pp. I:128–142.

– Mikolajczyk, K. and Schmid, C. (2003). A performance evaluation of local

descriptors, Proc. Computer Vision and Pattern Recognition, pp. II: 257–

263.

– Sivic, J. and Zisserman, A. (2003). Video google: A text retrieval approach to

object matching in videos, Proc. Ninth International Conference on

Computer Vision, Nice, France, pp. 1470–1477.

– VLFeat (Vision Library Features) http://www.vlfeat.org/