Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

69
Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding http://people.csail.mit.edu/torralba/courses/ 6.870/6.870.recognition.htm

Transcript of Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Page 1: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Lecture 4

Explicit and implicit 3D object models

6.870 Object Recognition and Scene Understanding http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm

Page 2: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Monday

Recognition of 3D objects

• Presenter: Alec Rivers

• Evaluator:

Page 3: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

2D frontal face detection

Amazing how far they have gotten with so little…

Page 4: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

People have the bad taste of not being rotationally symmetric

Examples of un-collaborative subjects

Page 5: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Objects are not flat*

*In the old days, some toy makers and few people working on face detection suggested that flat objects could be a good approximation to real objects.

Page 6: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Solution to deal with 3D variations:“do not deal with it”

“not”-Dealing with rotations and pose:

Train a different model for each view.

The combined detector is invariant to pose variations without an explicit 3D model.

Page 7: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

viewpoints

Need to detect Nclasses * Nviews * Nstyles, in clutter.Lots of variability within classes, and across viewpoints.

Object classes

And why should we stop with pose?Let’s do the same with styles, lighting conditions, etc, etc, etc…

So, how many classifiers?

Page 8: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Depth without objectsRandom dot stereograms (Bela Julesz)

Julesz, 1971

3D is so important for humans that we decided to grow two eyes in front of the face instead of having one looking to the front and another to the back.(this is not something that Julesz said… but he could, maybe he did)

Page 9: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Objects 3D shape priors

by H Bülthoff Max-Planck-Institut für biologische Kybernetik in Tübingen

Video taken from http://www.michaelbach.de/ot/fcs_hollow-face/index.html

Page 10: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

3D drives perception of important object attributes

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

Page 11: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

3D drives perception of important object attributes

Frederick Kingdom, Ali Yoonessi and Elena Gheorghiu of McGill Vision Research unit.

The two Towers of Pisa

Page 12: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

It is not all about objects

3D percept is driven by the scene, which imposes its ruling to the objects

Page 13: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Class experiment

Page 14: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Class experiment

Experiment 1: draw a horse (the entire body, not just the head) in a white piece of paper.

Do not look at your neighbor! You already know how a horse looks like… no need to cheat.

Page 15: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Class experiment

Experiment 2: draw a horse (the entire body, not just the head) but this time chose a viewpoint as weird as possible.

Page 16: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Anonymous participant

Page 17: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

3D object categorization

Wait: object categorization in humans is not invariant to 3D pose

Page 18: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

3D object categorization

by Greg Robbins

Despite we can categorize all three pictures as being views of a horse, the three pictures do not look as being equally typical views of horses. And they do not seem to be recognizable with the same easiness.

Page 19: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Observations about pose invariancein humans

• Canonical perspective

• Priming effects

Two main families of effects have been observed:

Page 20: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Perspective

From Vision Science, Palmer

Experiment (Palmer, Rosch & Chase 81): participants are shown views of an object and are asked to rate “how much each one looked like the objects they depict”(scale; 1=very much like, 7=very unlike)

5

2

Page 21: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Perspective

From Vision Science, Palmer

Examples of canonical perspective:

In a recognition task, reaction time correlated with the ratings.

Canonical views are recognized faster at the entry level.

Why?

Page 22: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Viewpoint

• Frequency hypothesis

• Maximal information hypothesis

Page 23: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Viewpoint

Frequency hypothesis: easiness of recognition is related to the number of times we have see the objects from each viewpoint.

For a computer, using its Google memory, a horse looks like:

It is not a uniform sampling on viewpoints (some artificial datasets might contain non natural statistics)

Page 24: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Viewpoint

Frequency hypothesis: easiness of recognition is related to the number of times we have see the objects from each viewpoint.

Can you think of some examples in which this hypothesis might be wrong?

Page 25: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Viewpoint

Maximal information hypothesis: Some views provide more information than others about the objects.

From Vision Science, Palmer

Best views tend to show multiple sides of the object.

Can you think of some examples in which this hypothesis might be wrong?

Page 26: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Viewpoint

Maximal information hypothesis:

Clocks are preferred as purely frontal

Page 27: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Canonical Viewpoint• Frequency hypothesis

• Maximal information hypothesis

Probably both are correct. Edelman & Bulthoff 92: created new objects to control familiarity.

1- When presenting all view points with the same frequency, observers had preference for specific viewpoints. 2- When few viewpoints were presented, recognition was better for previously seen viewpoints.

Page 28: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Observations about pose invariancein humans

• Canonical perspective

• Priming effects

Two main families of effects have been observed:

Page 29: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Priming effects

Priming paradigm: recognition of an object is faster the second time that you see it.

Biederman & Gerhardstein 93

Page 30: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Priming effects

Same exemplars

Different exemplars

Biederman & Gerhardstein 93

Page 31: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Priming effects

Biederman & Gerhardstein 93

Page 32: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Object representations

Explicit 3D models: use volumetric representation. Have an explicit model of the 3D geometry of the object.

Appealing but hard to get it to work…

Page 33: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Object representations

Implicit 3D models: matching the input 2D view to view-specific representations.

Not very appealing but somewhat easy to get it to work*…

* we all know what I mean by “work”

Page 34: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Object representations

Implicit 3D models: matching the input 2D view to view-specific representations.

The object is represented as a collection of 2D views (maybe the most frequent views seen in the past).

Tarr & Pinker (89) show people are faster at recognizing previously seen views, as if they were storing them. People were also able to recognize unseen views, so they also generalize to new views. It is not just template matching.

Page 35: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Why do I explain all this?

• As we build systems and develop algorithms it is good to:– Get inspiration from what others have thought– Get intuitions about what can work, and how

things can fail.

Page 36: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Explicit 3D model

Object Recognition in the Geometric Era: a Retrospective, Joseph L. Mundy

Page 37: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Explicit 3D model

Not all explicit 3D models were disappointing.

For some object classes, with accurate geometric and appearance models, it is possible to get remarkable results.

Page 38: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

A Morphable Model for the Synthesis of 3D Faces

Blanz & Vetter, Siggraph 99

Page 39: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .
Page 40: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

A Morphable Model for the Synthesis of 3D Faces

Blanz & Vetter, Siggraph 99

Page 41: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

We have not achieved yet the same level of description for other object classes

Page 42: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Implicit 3D models

Page 43: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Aspect Graphs

“The nodes of the graph represent object views that are adjacent to each other on the unit sphere of viewing directions but differ in some significant way. The most common view relationship in aspect graphs is based on the topological structure of the view, i.e., edges in the aspect graph arise from transitions in the graph structure relating vertices, edges and faces of the projected object.” Joseph L. Mundy

Page 44: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Aspect Graphs

Page 45: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Revisit invariants as a local description of 3D objects: Indeed, although smooth surfaces are almost never planar in the large, they are always planar in the small

3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial Constraints. F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce, IJCV 2006

Page 46: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Two steps:

1. Detection of salient image regions

2. Extraction of a descriptor around the detected locations

Page 47: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Two steps:

1. Detection of salient image regions (Garding and Lindeberg, 96; Mikolajczyk and Schmid, 02)

a) an elliptical image region is deformed to maximize the isotropy of the corresponding brightness pattern. b) its characteristic scale is determined as a local extreme of the normalized Laplacian in scale space.

c) the Harris (1988) operator is used to refine the position of the ellipse’s center.

The elliptical region obtained at convergence can be shown to be covariant under affine transformations.

Page 48: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Page 49: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Page 50: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Page 51: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Page 52: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Each region is represented withthe SIFT descriptor.

Page 53: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patchesA coherent 3D interpretation of all the matches is obtained using a formulation derived fromstructure-from-motion and RANSAC to deal with outliers.

Page 54: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Affine patches

Page 55: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Patch-based single view detector

Car modelScreen model

Vidal-Naquet, Ullman (2003)

Page 56: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

For a single view

First we collect a set of part templates from a set of training objects.

Vidal-Naquet, Ullman (2003)

Page 57: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Extended fragments

View-Invariant Recognition Using Corresponding Object FragmentsE. Bart, E. Byvatov, & S. Ullman

Page 58: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Extended fragments

View-Invariant Recognition Using Corresponding Object FragmentsE. Bart, E. Byvatov, & S. Ullman

Page 59: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Extended fragments

View-Invariant Recognition Using Corresponding Object FragmentsE. Bart, E. Byvatov, & S. Ullman

Page 60: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Extended fragments

Extended patches are extracted using short sequences.

Use Lucas-Kanade motion estimation to track patches across the sequence.

Page 61: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Learning

Once a large pool of extended fragments is created, there is a training stage to select the most informative fragments.

For each fragment evaluate:

Select the fragment B with

In the subsequent rounds, use

Class label Fragment present/absent

All these operations are easy to compute. It is just counting.

If C and Fare independent,then I(C,F) = 0

Page 62: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

1

0

1

1

0

0

0

0

0

1

C

1

1

1

1

1

0

0

0

0

0

F

P(C=1, F=1) = 3 / 10

P(C=1, F=0) =

P(C=0, F=1) =

P(C=0, F=0) =

Page 63: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Training without sequencesChallenges:

- We do not know which fragments are in correspondence (we can not use motion estimation due to strong transformation)

Fragments that are in correspondence will have detections that are correlated across viewpoints.

The same approach can be used for arbitrary transformations

Bart & Ullman

Page 64: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Shared features for Multi-view object detection

Viewinvariantfeatures

Viewspecificfeatures

Training does not require having different views of the same object.

Torralba, Murphy, Freeman. PAMI 07

Page 65: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Sharing is not a tree. Depends also on 3D symmetries.

Shared features for Multi-view object detection

Torralba, Murphy, Freeman. PAMI 07

Page 66: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Multi-view object detection

Strong learner H response for car as function of assumed view angle Torralba, Murphy, Freeman. PAMI 07

Page 67: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Voting schemes

Towards Multi-View Object Class DetectionAlexander ThomasVittorio FerrariBastian LeibeTinne TuytelaarsBernt SchieleLuc Van Gool

Page 68: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Viewpoint-Independent Object Class Detection using 3D Feature Maps

Training dataset: synthetic objects

Features

Voting scheme and detectionEach cluster casts votes for the voting bins of the discrete poses contained in its internal list.

Liebelt, Schmid, Schertler. CVPR 2008

Page 69: Lecture 4 Explicit and implicit 3D object models 6.870 Object Recognition and Scene Understanding .

Monday

Recognition of 3D objects

• Presenter: Alec Rivers

• Evaluator: