Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs ›...

61
Object Recognition in Images and Video: Advanced Bag-of-Words http://www.micc.unifi.it/bagdanov/obrec Prof. Andrew D. Bagdanov Dipartimento di Ingegneria dell’Informazione Università degli Studi di Firenze andrew.bagdanov AT unifi.it 27 April 2017 Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words 27 April 2017 1 / 61

Transcript of Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs ›...

Page 1: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Object Recognition in Images and Video:Advanced Bag-of-Words

http://www.micc.unifi.it/bagdanov/obrec

Prof. Andrew D. Bagdanov

Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di Firenzeandrew.bagdanov AT unifi.it

27 April 2017

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 1 / 61

Page 2: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Outline

1 Comments

2 Overview

3 The Bag-of-Words Model

4 Interlude

5 Advanced BOW: Spatial Pyramids

6 Advanced BOW: Sparse Coding

7 Advanced BOW: Fisher Vectors

8 Detection: Deformable Part Models

9 Discussion

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 2 / 61

Page 3: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Comments

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 3 / 61

Page 4: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Comments: Final Exam

For the exam (your presentations), we can be flexible.We’re all busy, and I recognize that.For those in Florence: we can arrange to do your presentations at anytime (preferably by mid-June).For those outside of Florence: we can also do the final exam bySkype, if that’s easier.I strongly desire to finish all exam presentations by mid-June.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 4 / 61

Page 5: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Overview

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 5 / 61

Page 6: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Overview

In this lesson I will first pickup where I left off with an explanation ofthe basic Bag-of-Words (BOW) model.Then, I will explain three extensions of this basic model (which is whythe lecture is called Advanced Bag-of-Words).And then I will cover an article on detection (which is only looselyconnected with the Bag-of-Words).Since you have all read the articles, I will cover the details briefly,then I will open the floor for discussion and questions.Together, we will work to reach a deeper understanding of eachcontribution.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 6 / 61

Page 7: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

The Bag-of-Words Model

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 7 / 61

Page 8: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Three Magic Ingredients

Now we will shift our discussion to one of the first Big Breakthroughsin modern object recognition. Visual Categorization with Bags of

Keypoints, Gabriella Csurka, Christopher R. Dance, Lixin Fan, JuttaWillamowski, Cédric Bray. In: European Conference on ComputerVision (ECCV), 2004.These ideas were developed independently, in many places, at thesame time.This paper is one of the first, and in my opinion the simplestexplanation of the basic Bag-of-Words pipeline.Again returning to our analogy with text retrieval, we now have areasonably invariant way to describe local image structure.However, we still don’t have a concept corresponding to words.SIFT features are 128-dimensional vectors, which are not discreteenough to use in a TF*IDF model.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 8 / 61

Page 9: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Feature Quantization

Key idea: use clustering to identify groups of SIFT points using atraining set.The centers are used as a visual vocabulary – words in our model.All SIFT descriptors extracted from training or test images arequantized to the closest visual word in our vocabulary.We have gone from an infinite class of SIFT descriptors, to a finiteclass of visual words.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 9 / 61

Page 10: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Feature Pooling

One last problem: the number of SIFT descriptor is variable: eachimage will yield a different number of points.Also, the order of points (for comparison, for example) is crucial.This problem makes it hard to apply standard, machine learningtechniques to our representation (e.g. SVM, naive-Bayes, nearestneighbor, etc).The solution: like in text retrieval, use pooling to build a fixed-lengthdescriptor of images that is invariant to descriptor order.Our descriptor is a histogram of frequencies of visual wordoccurrences in the image.To compare images we can now use: inner products (like TF*IDF),SVMs, and a vast array of tried and true classifiers.This last point is most important: given a training set of imageslabeled with object categories, we can train classifiers to recognizeobjects in unseen test images.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 10 / 61

Page 11: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

The Bag-of-Words Model

This full pipeline is best explained graphically

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 11 / 61

Page 12: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

The Bag-of-Words Model

Csurka et al. demonstrated the BOW approach on a dataset with 7object categories.They extract BOW descriptors from training images and train amulticlass, one-versus-all, linear SVM for each.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 12 / 61

Page 13: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

The Bag-of-Words

The punchline: the results on this challenging dataset are impressive.The approach uses a small vocabulary of 1000 visual words (in textretrieval, 100K+ word dictionaries are common).It also uses an extremely simple linear SVM for classification.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 13 / 61

Page 14: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

The Bag-of-Words

Added bonus: visual words are semantically meaningful (note, thisexample from Csurka et al. is highly cherry-picked):

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 14 / 61

Page 15: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

The Bag-of-Words

Another bonus: the one-versus-all SVM architecture can recognizemultiple object categories in images.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 15 / 61

Page 16: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

BOW: Discussion

DiscussionLike the SIFT descriptor, it is hard to overstate the impact andinfluence the Bag-of-Words model has had on the development ofmodern object recognition.It is a hallmark result, despite its extreme simplicity (in hindsight).The paper of Csurka et al. was the first to demonstrate theplausibility of efficient, accurate, and robust object recognition overmany categories with extreme visual variance.Clearly, this simple BOW model was only the beginning.The next ten years of computer vision was dominated by incrementalimprovements and refinements of this model.In the next lecture we will head of in that direction with a survey ofadvanced Bag-of-Words models that came after.Note: see the course website for the required reading for next week.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 16 / 61

Page 17: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Interlude

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 17 / 61

Page 18: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Interlude: Ten Years of Progress

The Bag-of-Words was quickly adopted by the community as řthe*method for object recognition.There rapidly followed a series of many, many improvements over thebasic BOW model.In the following, we will look at some important examples.With the adoption of the BOW model, and the growing interest inobject recognition, there were also established several standardbenchmark datasets.Many of these datasets were developed in the context of internationalcompetitions:

PASCAL VOC: five years of competitions, many high-qualitybenchmark datasets.ImageNet: first version with 1000 object categories, over 1M images.Current version has 10M+ images.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 18 / 61

Page 19: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Advanced BOW: Spatial Pyramids

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 19 / 61

Page 20: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

BOW is an Orderless Image Representation

The next paper we will examine is:Beyond bags of features: Spatial pyramid matching for recognizingnatural scene categories, S Lazebnik, C Schmid, J Ponce. In:Computer Vision and Pattern Recognition (CVPR), 2006.The motivation behind this work is that the Bag-of-Words is anorderless image representation.We can think of the BOW histogramming process as marginalizingspatial information away.Assume we have encoded an image in quantized visual words – thatis, each spatial location is represented by a single integer representingthe cluster center the SIFT descriptor is closest to.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 20 / 61

Page 21: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

BOW is an Orderless Image Representation

Assume we have K = 1000 visual words in our vocabulary, and let Iqbe the quantized image (i.e. each location is represented by aninteger index).Let δi be the (curried) Kronecker delta function:

δi(j) ={

1 ifi = j0 otherwise

We can express the image as a one-hot discrete of field of vectors,and the histogram as a sum over the field:

I1-hot(x , y) = [δi(I(x , y))]1000i=1

H(I1-hot) =∑

x

∑y

I1-hot(x , y)

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 21 / 61

Page 22: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

BOW is an Orderless Image Representation

Feature: we have a fixed-length representation of images.Feature: this representation has strong invariance.Bug: we lose all spatial coherence in this representation – it’s tooinvariant).

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 22 / 61

Page 23: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Impose Some Order

The main idea:revisit global non-invariant representations based on aggregatingstatistics of local features; anduse kernel-based recognition that computes rough geometriccorrespondence on a global scale.

Once you sweep away the suppercazzola in the paper, the method isquite simple:

repeatedly subdivide the image; andcompute histograms of local features at increasingly fine resolutions.

The spatial pyramid technique is simple and extremely effective.After its publication, it became ubiquitous in nearly all BOWpipelines.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 23 / 61

Page 24: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: A Technical Aside

The Support Vector Machine (SVM) is the standard classifier BOW.The linear SVM objective function is to find w that minimizes:[

1n

n∑i=1

max (0, 1 − yi (w · xi + b))

]+ λ‖w‖2.

This can be rewritten as a constrained optimization problem:

minimize 1n

n∑i=1

ζi + λ‖w‖2

subject to yi (w · xi + b) ≥ 1 − ζi and ζi ≥ 0, for all i .

And the dual formulation:

maximize f (c1 . . . cn) =n∑

i=1

ci − 12

n∑i=1

n∑j=1

yi ci (xi · xj)yjcj ,

subject ton∑

i=1

ci yi = 0, and 0 ≤ ci ≤ 12nλ

for all i .

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 24 / 61

Page 25: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: A Technical Aside

The ci in the dual are formulated so that we can write the classifiervector as:

w̃ =n∑

i=1ciyixi .

Kernel trick: embed our feature vectors xi in a Hilbert Space φ(xi).

maximize f (c1 . . . cn) =n∑

i=1ci − 1

2

n∑i=1

n∑j=1

yicik(~xi , ~xj)yjcj

subject ton∑

i=1ciyi = 0, and 0 ≤ ci ≤ 1

2nλfor all i .

Where k(xi , yi) = φ(xi) · φ(yi).So, we never have to actually embed out features, we just need tocompute the kernel matrix of all pairs of inner products.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 25 / 61

Page 26: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: A Technical AsideSome popular kernels

Linear:

k(x, y) = x · y

Gaussian RBF:

k(x, y) = exp(

−‖x − x′‖2

2σ2

)

χ2:

k(x, y) = 12

d∑i=1

(xi − yi)2

xi + yi

Exponential χ2:

k(x, y) = exp(−β(1χ2(x, y)))Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 26 / 61

Page 27: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Structured Matching

The inspiration for Spatial Pyramids comes from a technique calledpyramid matching for measuring similarity between n-dimensionalpoint sets X and Y .It constructs a sequence of grids at resolutions 0, . . . , L such that thegrid at level ` has 2` cells along each of the d dimensions.Thus, there is a total of D = 2d` cells at each level.Finally, let H`

X and H`Y denote the histograms of X and Y at level `.

So: H`X and H`

Y are the number of from X and Y that fall into theith cell of the grid.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 27 / 61

Page 28: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Structured Matching

The main tool in defining Spatial Pyramids is the HistogramIntersection Kernel:

I(H`X , H`

Y ) =D∑

i=1min(H`

X (i), H`Y (i))

What we’re doing here is simply counting the number of points thathit the same cells in the dyadic spatial decomposition.If we notice that the matches found at level ` also include all matchesat finer levels, we can write the pyramid match kernel:

κL(X , Y ) = IL +L−1∑`=0

12L−`

(I` − I`−1)

= 12L I0 +

L∑`=1

12L−`+1 I`

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 28 / 61

Page 29: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Structured Matching

To extend this idea to the BOW model, we can quantize all salientlocations in the image (SIFTs descriptors).Let Xm and Ym represent the spatial locations of all visual words oftype m in images X and Y , respectively.Then, we can write the Spatial Pyramid Match Kernel as:

KL(X , Y ) =M∑

m=1κL(Xm, Ym)

Note that κL is just a weighted sum of histogram intersections.Note also that, for positive numbers, c min(a, b) = min(ca, cb).Thus, we can write the above as a single histogram intersection ofconcatenations of appropriately weighted channel histograms at alllevels.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 29 / 61

Page 30: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: What’s Going On

This is the diagram from the paper:

But this is how people usually think of the Spatial Pyramid:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 30 / 61

Page 31: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Experiments

New Trend: More Data is Better.Scenes-15: Fifteen categories with strong inter-class variability,intra-class similarity.200–300 images per category.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 31 / 61

Page 32: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Experiments

New Trend: More Data is Better.Caltech-101: 101 categories with strong inter-class variability,intra-class similarity.31–800 images per category.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 32 / 61

Page 33: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Experiments

Results are impressive and interesting on Scenes-15:

And on Caltech-101:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 33 / 61

Page 34: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Spatial Pyramids: Reflections

Despite the simplicity of the method, it consistently achieves animprovement over an orderless image representation.This also despite the fact that it works not by constructing explicitobject models, but by using global cues as indirect evidence about thepresence of an object.This is not a trivial accomplishment, given that a well-designedbag-of-features method can outperform more sophisticatedapproaches based on parts and relations.As I mentioned before, the Spatial Pyramid technique became astandard trick to significantly and consistently improve results forBOW models.In the next two papers, we will look at more sophisticated codingtechniques.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 34 / 61

Page 35: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Advanced BOW: Sparse Coding

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 35 / 61

Page 36: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Locality-constrained Linear Coding

The original BOW model uses global pooling of descriptors, hence asingle, global image representation.Spatial Pyramids add some spatial structure to the imagerepresentation.Can we improve the way features themselves are coded before poolinginto the final image represenation?We will look at one approach to better feature coding in our nextpaper:

Locality-constrained linear coding for image classification, J Wang J Yang,K Yu, F Lv, T Huang, Y Gong. In: Computer Vision and PatternRecognition (CVPR), 2010.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 36 / 61

Page 37: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Locality-constrained Linear Coding

Some ovservations:BOW + SPM works really well.But: requires non-linear SVMs to achieve state-of-the-art performance.As datasets grow larder, computing and storing the kernel matrix forsolving the dual SVM formulation is onerous.We hope: for a better feature encoding that allows us to achievestate-of-the-art results, but with linear SVMs.

The key insight is to use the codebook (visual vocabulary) moreevvectively.This is done through sparse coding to encode local features.Followed by max pooling (as opposed to average pooling) to arrive atthe global image description.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 37 / 61

Page 38: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Basic Definitions

Let X = [x1, . . . , xN ] ∈ RD×N

Given a set of codewords B = [b1, . . . , bM ] ∈ RD×M ,We want to encode each xi into an M-dimensional code.Vector Quantization is used in the BOW (resulting in a set of 1-hotcodes C = [c1, . . . , cM ] ∈ RD×N :

Sparse coding can be used instead (reference 22):

This leads to lower quantization loss by using more elements of thecodebook to encode local features.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 38 / 61

Page 39: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Paying a Price

LLC proposes to use an additional locality (in feature space)constraint:

Where locality is smoothly modeled with a an exponential:

So: code features, but pay a cost for using codewords far from thedescriptor we are encoding.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 39 / 61

Page 40: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: What’s Going On

Here is a comparison:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 40 / 61

Page 41: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: What’s Going On

The full pipeline:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 41 / 61

Page 42: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Implementation

In practice, solving the constrained optimization problem for everydescriptor is too costly.Solution: select the k nearest codewords in feature space, and solve aconstrained least-squares problem using only k codewords.Codebook optimization: maybe k-means doesn’t yield an optimalcodebook for LLC codeing:

Section 3 gives an iterative algorithm for building a codebook.In practice, everyone uses k-means.

Classification: the LLC embedding is rich enough to allow it toperform well with linear SVMs as classifiers.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 42 / 61

Page 43: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Results

Extract dense HOG descriptors (8-pixel stride), at three scales.Use k = 5 for approximate LLC encoding.They also use a Spatial Pyramid (but don’t explain the configuration).The authors consider two pooling methods to arrive at the final imagerepresentation:

sum-pooling: just sum all codes (this is the BOW pooling).max-pooling: take the maximum coefficient for each codeword.Then report results with: max-pooling with L2 normalization (sincethey use linear SVMs).

Use a linear SVM to train one-versus-all classifiers for each category.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 43 / 61

Page 44: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Results

On Caltech-101:

And on Caltech-256:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 44 / 61

Page 45: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Results

A new dataset (which became the benchmark for object recognition).The PASCAL Visual Object Categorization (VOC) competitionsdefined the state-of-the-art for five years.20 object categories, high-quality annotations, recognition,segmentation, detections, etc.Introduced use of average precision to evaluate object recognition.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 45 / 61

Page 46: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Results

Results on PASCAL 2007:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 46 / 61

Page 47: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

LLC: Reflections

The LLC encoding technique takes a different approach to enrichingthe image representation.It uses sparse codes, but this codes are sparse in that only localcodewords can contribute to the encoding of features.Global pooling is done using a max operation, which helps ensureglobal quasi-sparsity.The resulting codes can be used with linear SVMs, which is a hugewin for large datasets.Beats other BOW/SPM approaches, and achieves results comparableto more complex methods at the state-of-the-art.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 47 / 61

Page 48: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Advanced BOW: Fisher Vectors

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 48 / 61

Page 49: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: Clusters are Not Points

The main observation in Fisher Vector coding is that the quantizationprocess is imprecise.More precisely: clusters are distributions of points.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 49 / 61

Page 50: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: Start with a Generative Model

Let X = { xt , t = 1 . . . T } be the descriptors extracted from animage.Assume there is a generation process for X modeled by a probabilitydensity function uλ with parameters λ.X can be characterized by the gradient:

GXλ = 1

T ∇λ log uλ(X )

Why the gradient? Because it describes the contribution of eachparameter to the generation process (also, the gradient of the datalog likelihood is the Fisher Score).Plus, the dimensionality depends only on the number of parameters inλ and not on the number of patches in the image.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 50 / 61

Page 51: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: Then define a kernel

A natural kernel to use for gradients of generative model likelihoods isthe Fisher kernel:

K (X , Y ) = GX ′λ F −1

λ GYλ

Where F −1λ is the Fisher Information Matrix:

Fλ = Ex∼uλ[∇λ log uλ(x)∇λ log uλ(x)′]

= L′λLλ (by Cholesky decomposition of symmetric and p.d. L)

So, we can rewrite K (X , Y ) as dot-products between normalizedvectors:

GXλ = LλGX

λ

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 51 / 61

Page 52: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: The Fisher Vector

This GXλ is the Fisher Vector of image X .

Big win: learning a kernel classifier with the non-linear kernel K is equivalent tolearning a linear classifier on Fisher Vectors.Implementation:

Assume the generative model is a mixture of Gaussians.And assume that all xi are generated independently:

GXλ = 1

T

T∑t=1

∇λ log uλ(xt)

Compute gradients with respect to mean and diagonal covariance of all mixturecomponents.Final descriptor is the concatenation of:

GXµ,i = 1

T√wi

T∑t=1

γt(i)(xt − µt

σi

)GX

σ,i = 1T√wi

T∑t=1

γt(i)(

(xt − µt)2

σ2i

− 1)

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 52 / 61

Page 53: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: What’s Going On

The Fisher Vector is a weighted average of gradients.These gradients are defined at every point in descriptor space.First compute the FV for each individual descriptor.Then average pool the vectors to computer the FV encoding for theimage.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 53 / 61

Page 54: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: Improvements

There are really only two improvements to the Fisher Vector proposedin this paper.Improvement 1: L2 normalize the Fisher Vectors before training anSVM (not surprising).Power normalization: Also known as "signed square-root" or theHellinger kernel, it compensates for bursty features:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 54 / 61

Page 55: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: Baseline Comparison

It has become standard practice to do a baseline comparison.In these experiments, you want to evaluate the contribution of yourwork.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 55 / 61

Page 56: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: Us versus Them

The proof is in the pudding.PASCAL 2007:

Caltech 256:

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 56 / 61

Page 57: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

FV: Reflections

The Fisher Vector is an alternative coding method.Instead of quantizing descriptors, you represent each descriptor with agradient.This gradient represents the relationship between a local descriptorand all clusters in a generative model.This encoding can significantly improve performance over the BOWframework.Bonus: everything is linear, and we can use efficient solvers – even forlarge-scale datasets.The Fisher Vector was the state-of-the-art in Bag-of-Features codinguntil 2012.

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 57 / 61

Page 58: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Detection: Deformable Part Models

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 58 / 61

Page 59: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Deformable Part Models

[OTHER PRESENTATIONS]

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 59 / 61

Page 60: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Discussion

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 60 / 61

Page 61: Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs › 02_advancedBOW.pdf · 2020-02-13 · Prof. Andrew D. Bagdanov (DINFO) Object Recognition in

Discussion

Discuss

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 61 / 61