Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

72
Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1

Transcript of Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Page 1: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

1

Tamara BergObject Recognition – BoF models

790-133Recognizing People, Objects, & Actions

Page 2: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

2

Topic Presentations

• Hopefully you have met your topic presentations group members?

• Group 1 – see me to run through slides this week or Monday at the latest (I’m traveling Thurs/Friday). Send me links to 2-3 papers for the class to read.

• Sign up for class google group (790-133). To find the group go to groups.google.com and search for 790-133 (sorted by date). Use this to post/answer questions related to the class.

Page 3: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

3

ObjectBag of

‘features’

Bag-of-features models

source: Svetlana Lazebnik

Page 4: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

4

Exchangeability

• De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

Page 5: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

5

Origin 2: Bag-of-words models

US Presidential Speeches Tag Cloudhttp://chir.ag/phernalia/preztags/

• Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)

source: Svetlana Lazebnik

Page 6: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

6

Bag of words for text

· Represent documents as a “bags of words”

Page 7: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

7

Example

• Doc1 = “the quick brown fox jumped”• Doc2 = “brown quick jumped fox the”

Would a bag of words model represent these two documents differently?

Page 8: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

8

Bag of words for images

· Represent images as a “bag of features”

Page 9: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

9

Bag of features: outline1. Extract features

source: Svetlana Lazebnik

Page 10: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

10

Bag of features: outline1. Extract features2. Learn “visual vocabulary”

source: Svetlana Lazebnik

Page 11: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

11

Bag of features: outline1. Extract features2. Learn “visual vocabulary”3. Represent images by frequencies of

“visual words”

source: Svetlana Lazebnik

Page 12: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

12

2. Learning the visual vocabulary

Clustering

Slide credit: Josef Sivic

Page 13: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

13

2. Learning the visual vocabulary

Clustering

Slide credit: Josef Sivic

Visual vocabulary

Page 14: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

14

K-means clustering (reminder)• Want to minimize sum of squared Euclidean

distances between points xi and their nearest cluster centers mk

Algorithm:• Randomly initialize K cluster centers• Iterate until convergence:

• Assign each data point to the nearest center• Recompute each cluster center as the mean of all points assigned

to it

k

ki

ki mxMXDcluster

clusterinpoint

2)(),(

source: Svetlana Lazebnik

Page 15: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

15

Example visual vocabulary

Fei-Fei et al. 2005

Page 16: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Image Representation

• For a query image Extract features

Associate each feature with the nearest cluster center (visual word)

Accumulate visual word frequencies over the image

Visual vocabulary

xx

x x

x x

x

x

x

x

Page 17: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

17

3. Image representation

…..

freq

uenc

y

codewords

source: Svetlana Lazebnik

Page 18: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

18

4. Image classification

…..

freq

uenc

y

codewords

source: Svetlana Lazebnik

Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

CAR

Page 19: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Image Categorization

Choose from many categories

What is this? helicopter

Page 20: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Image Categorization

Choose from many categories

What is this?

SVM/NBCsurka et al (Caltech 4/7)

Nearest NeighborBerg et al (Caltech 101)

Kernel + SVMGrauman et al (Caltech 101)

Multiple Kernel Learning + SVMsVarma et al (Caltech 101)…

Page 21: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

21

Visual Categorization with Bags of KeypointsGabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray

Page 22: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

22

Data

• Images in 7 classes: faces, buildings, trees, cars, phones, bikes, books

• Caltech 4 dataset: faces, airplanes, cars (rear and side), motorbikes, background

Page 23: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

23

Method

Steps:– Detect and describe image patches.– Assign patch descriptors to a set of predetermined

clusters (a visual vocabulary).– Construct a bag of keypoints, which counts the

number of patches assigned to each cluster.– Apply a classifier (SVM or Naïve Bayes), treating the

bag of keypoints as the feature vector– Determine which category or categories to assign to

the image.

Page 24: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

24

Bag-of-Keypoints Approach

Interesting Point Detection

Key PatchExtraction

FeatureDescriptors Bag of Keypoints Multi-class

Classifier

5.1

.

.

.

5.0

1.0

Slide credit: Yun-hsueh Liu

Page 25: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

25

SIFT Descriptors

Interesting Point Detection

Key PatchExtraction

FeatureDescriptors Bag of Keypoints Multi-class

Classifier

5.1

.

.

.

5.0

1.0

Slide credit: Yun-hsueh Liu

Page 26: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

26

Bag of Keypoints (1)

• Construction of a vocabulary– Kmeans clustering find “centroids” (on all the descriptors we find from all the training images) – Define a “vocabulary” as a set of “centroids”, where every centroid represents

a “word”.

Interesting Point Detection

Key PatchExtraction

FeatureDescriptors Bag of Keypoints Multi-class

Classifier

Slide credit: Yun-hsueh Liu

Page 27: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

27

Bag of Keypoints (2)

• Histogram– Counts the number of occurrences of different visual words in each image

Interesting Point Detection

Key PatchExtraction

FeatureDescriptors Bag of Keypoints Multi-class

Classifier

Slide credit: Yun-hsueh Liu

Page 28: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

28

Multi-class Classifier

• In this paper, classification is based on conventional machine learning approaches– Support Vector Machine (SVM)– Naïve Bayes

Interesting Point Detection

Key PatchExtraction

FeatureDescriptors Bag of Keypoints Multi-class

Classifier

Slide credit: Yun-hsueh Liu

Page 29: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

29

SVM

Page 30: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Reminder: Linear SVM

x1

x2 Margin

wT x + b = 0

wT x + b = -1w

T x + b = 1

x+

x+

x-

Support Vectors

Slide credit: Jinwei GuSlide 30 of 113

( ) Tg b x w x

( ) 1Ti iy b w x

21minimize

2w

s.t.

Page 31: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

31

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function becomes:

SV

( ) ( ) ( ) ( )T Ti i

i

g b b

x w x x x

No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.

A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

( , ) ( ) ( )Ti j i jK x x x x

Slide credit: Jinwei Gu

Page 32: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

32

Nonlinear SVMs: The Kernel Trick

Linear kernel:

2

2( , ) exp( )

2i j

i jK

x x

x x

( , ) Ti j i jK x x x x

( , ) (1 )T pi j i jK x x x x

0 1( , ) tanh( )Ti j i jK x x x x

Examples of commonly-used kernel functions:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:

Sigmoid:

Slide credit: Jinwei Gu

Page 33: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

34

SVM for image classification

• Train k binary 1-vs-all SVMs (one per class)• For a test instance, evaluate with each

classifier• Assign the instance to the class with the

largest SVM output

Page 34: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

35

Naïve Bayes

Page 35: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

36

Naïve Bayes Model

C – Class F - Features

We only specify (parameters): prior over class labels

how each feature depends on the class

Page 36: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

37

Slide from Dan Klein

Example:

Page 37: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

38

Slide from Dan Klein

Page 38: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

39

Slide from Dan Klein

Page 39: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

40

Percentage of documents in training set labeled as spam/ham

Slide from Dan Klein

Page 40: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

41

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Slide from Dan Klein

Page 41: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

42

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words).

Slide from Dan Klein

Page 42: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

43

Classification

The class that maximizes:

Page 43: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

44

Classification

• In practice

Page 44: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

45

Classification

• In practice– Multiplying lots of small probabilities can result in

floating point underflow

Page 45: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

46

Classification

• In practice– Multiplying lots of small probabilities can result in

floating point underflow– Since log(xy) = log(x) + log(y), we can sum log

probabilities instead of multiplying probabilities.

Page 46: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

47

Classification

• In practice– Multiplying lots of small probabilities can result in

floating point underflow– Since log(xy) = log(x) + log(y), we can sum log

probabilities instead of multiplying probabilities.– Since log is a monotonic function, the class with

the highest score does not change.

Page 47: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

48

Classification

• In practice– Multiplying lots of small probabilities can result in

floating point underflow– Since log(xy) = log(x) + log(y), we can sum log

probabilities instead of multiplying probabilities.– Since log is a monotonic function, the class with

the highest score does not change.– So, what we usually compute in practice is:

Page 48: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

49

Naïve Bayes on images

Page 49: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

50

Naïve Bayes

C – Class F - Features

We only specify (parameters): prior over class labels

how each feature depends on the class

Page 50: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

51

Naive Bayes Parameters

Problem: Categorize images as one of k object classes using Naïve Bayes classifier:– Classes: object categories (face, car, bicycle, etc)– Features – Images represented as a histogram of

visual words. are visual words.

treated as uniform. learned from training data – images labeled

with category. Probability of a visual word given an image category.

Page 51: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

52

Multi-class classifier –Naïve Bayes (1)

• Let V = {vi}, i = 1,…,N, be a visual vocabulary, in which each vi represents a visual word (cluster centers) from the feature space.

• A set of labeled images I = {Ii } .

• Denote Cj to represent our Classes, where j = 1,..,M

• N(t,i) = number of times vi occurs in image Ii

• Compute P(Cj|Ii):

Slide credit: Yun-hsueh Liu

Page 52: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

53

Multi-class Classifier –Naïve Bayes (2)

• Goal - Find maximum probability class Cj:

• In order to avoid zero probability, use Laplace smoothing:

Slide credit: Yun-hsueh Liu

Page 53: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Results

Page 54: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

55

Results

Page 55: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

56

Results

Page 56: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

57

Results

Results on Dataset 2

Page 57: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

58

Results

Page 58: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

59

Results

Page 59: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

60

Results

Page 60: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Thoughts?

• Pros?

• Cons?

Page 61: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

62

Related BoF modelspLSA, LDA, …

Page 62: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

63

pLSA

wordtopicdocument

Page 63: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

64

pLSA

Page 64: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

67

pLSA on images

Page 65: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

68

Discovering objects and their location in imagesJosef Sivic, Bryan C. Russell, Alexei A. Efros, Andrew Zisserman, William T. Freeman

Documents – ImagesWords – visual words (vector quantized SIFT descriptors)Topics – object categories

Images are modeled as a mixture of topics (objects).

Page 66: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

69

Goals

They investigate three areas: – (i) topic discovery, where categories are

discovered by pLSA clustering on all available images.

– (ii) classification of unseen images, where topics corresponding to object categories are learnt on one set of images, and then used to determine the object categories present in another set.

– (iii) object detection, where you want to determine the location and approximate segmentation of object(s) in each image.

Page 67: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

70

(i) Topic Discovery

Most likely words for 4 learnt topics (face, motorbike, airplane, car)

Page 68: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

71

(ii) Image Classification

Confusion table for unseen test images against pLSA trained on images containing four object categories, but no background images.

Page 69: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

72

(ii) Image Classification

Confusion table for unseen test images against pLSA trained on images containing four object categories, and background images. Performance is not quite as good.

Page 70: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

73

(iii) Topic Segmentation

Page 71: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

74

(iii) Topic Segmentation

Page 72: Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

75

(iii) Topic Segmentation