Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs ›...

Object Recognition in Images and Video:Advanced Bag-of-Words

http://www.micc.unifi.it/bagdanov/obrec

Prof. Andrew D. Bagdanov

Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di Firenzeandrew.bagdanov AT unifi.it

27 April 2017

Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 1 / 61

http://www.micc.unifi.it/bagdanov/obrec

Outline

1 Comments

2 Overview

3 The Bag-of-Words Model

4 Interlude

5 Advanced BOW: Spatial Pyramids

6 Advanced BOW: Sparse Coding

7 Advanced BOW: Fisher Vectors

8 Detection: Deformable Part Models

9 Discussion


Comments


Comments: Final Exam

For the exam (your presentations), we can be flexible.We’re all busy, and I recognize that.For those in Florence: we can arrange to do your presentations at anytime (preferably by mid-June).For those outside of Florence: we can also do the final exam bySkype, if that’s easier.I strongly desire to finish all exam presentations by mid-June.


Overview


Overview

In this lesson I will first pickup where I left off with an explanation ofthe basic Bag-of-Words (BOW) model.Then, I will explain three extensions of this basic model (which is whythe lecture is called Advanced Bag-of-Words).And then I will cover an article on detection (which is only looselyconnected with the Bag-of-Words).Since you have all read the articles, I will cover the details briefly,then I will open the floor for discussion and questions.Together, we will work to reach a deeper understanding of eachcontribution.


The Bag-of-Words Model


Three Magic Ingredients

Now we will shift our discussion to one of the first Big Breakthroughsin modern object recognition. Visual Categorization with Bags of

Keypoints, Gabriella Csurka, Christopher R. Dance, Lixin Fan, JuttaWillamowski, Cédric Bray. In: European Conference on ComputerVision (ECCV), 2004.These ideas were developed independently, in many places, at thesame time.This paper is one of the first, and in my opinion the simplestexplanation of the basic Bag-of-Words pipeline.Again returning to our analogy with text retrieval, we now have areasonably invariant way to describe local image structure.However, we still don’t have a concept corresponding to words.SIFT features are 128-dimensional vectors, which are not discreteenough to use in a TF*IDF model.


Feature Quantization

Key idea: use clustering to identify groups of SIFT points using atraining set.The centers are used as a visual vocabulary – words in our model.All SIFT descriptors extracted from training or test images arequantized to the closest visual word in our vocabulary.We have gone from an infinite class of SIFT descriptors, to a finiteclass of visual words.


Feature Pooling

One last problem: the number of SIFT descriptor is variable: eachimage will yield a different number of points.Also, the order of points (for comparison, for example) is crucial.This problem makes it hard to apply standard, machine learningtechniques to our representation (e.g. SVM, naive-Bayes, nearestneighbor, etc).The solution: like in text retrieval, use pooling to build a fixed-lengthdescriptor of images that is invariant to descriptor order.Our descriptor is a histogram of frequencies of visual wordoccurrences in the image.To compare images we can now use: inner products (like TF*IDF),SVMs, and a vast array of tried and true classifiers.This last point is most important: given a training set of imageslabeled with object categories, we can train classifiers to recognizeobjects in unseen test images.



This full pipeline is best explained graphically



Csurka et al. demonstrated the BOW approach on a dataset with 7object categories.They extract BOW descriptors from training images and train amulticlass, one-versus-all, linear SVM for each.


The Bag-of-Words

The punchline: the results on this challenging dataset are impressive.The approach uses a small vocabulary of 1000 visual words (in textretrieval, 100K+ word dictionaries are common).It also uses an extremely simple linear SVM for classification.


The Bag-of-Words

Added bonus: visual words are semantically meaningful (note, thisexample from Csurka et al. is highly cherry-picked):


The Bag-of-Words

Another bonus: the one-versus-all SVM architecture can recognizemultiple object categories in images.


BOW: Discussion

DiscussionLike the SIFT descriptor, it is hard to overstate the impact andinfluence the Bag-of-Words model has had on the development ofmodern object recognition.It is a hallmark result, despite its extreme simplicity (in hindsight).The paper of Csurka et al. was the first to demonstrate theplausibility of efficient, accurate, and robust object recognition overmany categories with extreme visual variance.Clearly, this simple BOW model was only the beginning.The next ten years of computer vision was dominated by incrementalimprovements and refinements of this model.In the next lecture we will head of in that direction with a survey ofadvanced Bag-of-Words models that came after.Note: see the course website for the required reading for next week.


http://www.micc.unifi.it/bagdanov/obrec/

Interlude


Interlude: Ten Years of Progress

The Bag-of-Words was quickly adopted by the community as řthe*method for object recognition.There rapidly followed a series of many, many improvements over thebasic BOW model.In the following, we will look at some important examples.With the adoption of the BOW model, and the growing interest inobject recognition, there were also established several standardbenchmark datasets.Many of these datasets were developed in the context of internationalcompetitions:

PASCAL VOC: five years of competitions, many high-qualitybenchmark datasets.ImageNet: first version with 1000 object categories, over 1M images.Current version has 10M+ images.


Advanced BOW: Spatial Pyramids


BOW is an Orderless Image Representation

The next paper we will examine is:Beyond bags of features: Spatial pyramid matching for recognizingnatural scene categories, S Lazebnik, C Schmid, J Ponce. In:Computer Vision and Pattern Recognition (CVPR), 2006.The motivation behind this work is that the Bag-of-Words is anorderless image representation.We can think of the BOW histogramming process as marginalizingspatial information away.Assume we have encoded an image in quantized visual words – thatis, each spatial location is represented by a single integer representingthe cluster center the SIFT descriptor is closest to.



Assume we have K = 1000 visual words in our vocabulary, and let Iqbe the quantized image (i.e. each location is represented by aninteger index).Let δi be the (curried) Kronecker delta function:

δi(j) ={

1 ifi = j0 otherwise

We can express the image as a one-hot discrete of field of vectors,and the histogram as a sum over the field:

I1-hot(x , y) = [δi(I(x , y))]1000i=1

H(I1-hot) =∑

x

∑y

I1-hot(x , y)



Feature: we have a fixed-length representation of images.Feature: this representation has strong invariance.Bug: we lose all spatial coherence in this representation – it’s tooinvariant).


Spatial Pyramids: Impose Some Order

The main idea:revisit global non-invariant representations based on aggregatingstatistics of local features; anduse kernel-based recognition that computes rough geometriccorrespondence on a global scale.

Once you sweep away the suppercazzola in the paper, the method isquite simple:

repeatedly subdivide the image; andcompute histograms of local features at increasingly fine resolutions.

The spatial pyramid technique is simple and extremely effective.After its publication, it became ubiquitous in nearly all BOWpipelines.


Spatial Pyramids: A Technical Aside

The Support Vector Machine (SVM) is the standard classifier BOW.The linear SVM objective function is to find w that minimizes:[

1n

n∑i=1

max (0, 1 − yi (w · xi + b))

]+ λ‖w‖2.

This can be rewritten as a constrained optimization problem:

minimize 1n

n∑i=1

ζi + λ‖w‖2

subject to yi (w · xi + b) ≥ 1 − ζi and ζi ≥ 0, for all i .

And the dual formulation:

maximize f (c1 . . . cn) =n∑

i=1

ci − 12

n∑i=1

n∑j=1

yi ci (xi · xj)yjcj ,

subject ton∑

i=1

ci yi = 0, and 0 ≤ ci ≤ 12nλ

for all i .


Spatial Pyramids: A Technical Aside

The ci in the dual are formulated so that we can write the classifiervector as:

w̃ =n∑

i=1ciyixi .

Kernel trick: embed our feature vectors xi in a Hilbert Space φ(xi).

maximize f (c1 . . . cn) =n∑

i=1ci − 1

2

n∑i=1

n∑j=1

yicik(~xi , ~xj)yjcj

subject ton∑

i=1ciyi = 0, and 0 ≤ ci ≤ 1

2nλfor all i .

Where k(xi , yi) = φ(xi) · φ(yi).So, we never have to actually embed out features, we just need tocompute the kernel matrix of all pairs of inner products.


Spatial Pyramids: A Technical AsideSome popular kernels

Linear:

k(x, y) = x · y

Gaussian RBF:

k(x, y) = exp(

−‖x − x′‖2

2σ2

)

χ2:

k(x, y) = 12

d∑i=1

(xi − yi)2

xi + yi

Exponential χ2:

k(x, y) = exp(−β(1χ2(x, y)))Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 26 / 61

Spatial Pyramids: Structured Matching

The inspiration for Spatial Pyramids comes from a technique calledpyramid matching for measuring similarity between n-dimensionalpoint sets X and Y .It constructs a sequence of grids at resolutions 0, . . . , L such that thegrid at level ` has 2` cells along each of the d dimensions.Thus, there is a total of D = 2d` cells at each level.Finally, let H`

X and H`Y denote the histograms of X and Y at level `.

So: H`X and H`

Y are the number of from X and Y that fall into theith cell of the grid.



The main tool in defining Spatial Pyramids is the HistogramIntersection Kernel:

I(H`X , H`

Y ) =D∑

i=1min(H`

X (i), H`Y (i))

What we’re doing here is simply counting the number of points thathit the same cells in the dyadic spatial decomposition.If we notice that the matches found at level ` also include all matchesat finer levels, we can write the pyramid match kernel:

κL(X , Y ) = IL +L−1∑`=0

12L−`

(I` − I`−1)

= 12L I0 +

L∑`=1

12L−`+1 I`



To extend this idea to the BOW model, we can quantize all salientlocations in the image (SIFTs descriptors).Let Xm and Ym represent the spatial locations of all visual words oftype m in images X and Y , respectively.Then, we can write the Spatial Pyramid Match Kernel as:

KL(X , Y ) =M∑

m=1κL(Xm, Ym)

Note that κL is just a weighted sum of histogram intersections.Note also that, for positive numbers, c min(a, b) = min(ca, cb).Thus, we can write the above as a single histogram intersection ofconcatenations of appropriately weighted channel histograms at alllevels.


Spatial Pyramids: What’s Going On

This is the diagram from the paper:

But this is how people usually think of the Spatial Pyramid:


Spatial Pyramids: Experiments

New Trend: More Data is Better.Scenes-15: Fifteen categories with strong inter-class variability,intra-class similarity.200–300 images per category.



New Trend: More Data is Better.Caltech-101: 101 categories with strong inter-class variability,intra-class similarity.31–800 images per category.



Results are impressive and interesting on Scenes-15:

And on Caltech-101:


Spatial Pyramids: Reflections

Despite the simplicity of the method, it consistently achieves animprovement over an orderless image representation.This also despite the fact that it works not by constructing explicitobject models, but by using global cues as indirect evidence about thepresence of an object.This is not a trivial accomplishment, given that a well-designedbag-of-features method can outperform more sophisticatedapproaches based on parts and relations.As I mentioned before, the Spatial Pyramid technique became astandard trick to significantly and consistently improve results forBOW models.In the next two papers, we will look at more sophisticated codingtechniques.


Advanced BOW: Sparse Coding


Locality-constrained Linear Coding

The original BOW model uses global pooling of descriptors, hence asingle, global image representation.Spatial Pyramids add some spatial structure to the imagerepresentation.Can we improve the way features themselves are coded before poolinginto the final image represenation?We will look at one approach to better feature coding in our nextpaper:

Locality-constrained linear coding for image classification, J Wang J Yang,K Yu, F Lv, T Huang, Y Gong. In: Computer Vision and PatternRecognition (CVPR), 2010.


Locality-constrained Linear Coding

Some ovservations:BOW + SPM works really well.But: requires non-linear SVMs to achieve state-of-the-art performance.As datasets grow larder, computing and storing the kernel matrix forsolving the dual SVM formulation is onerous.We hope: for a better feature encoding that allows us to achievestate-of-the-art results, but with linear SVMs.

The key insight is to use the codebook (visual vocabulary) moreevvectively.This is done through sparse coding to encode local features.Followed by max pooling (as opposed to average pooling) to arrive atthe global image description.


LLC: Basic Definitions

Let X = [x1, . . . , xN ] ∈ RD×N

Given a set of codewords B = [b1, . . . , bM ] ∈ RD×M ,We want to encode each xi into an M-dimensional code.Vector Quantization is used in the BOW (resulting in a set of 1-hotcodes C = [c1, . . . , cM ] ∈ RD×N :

Sparse coding can be used instead (reference 22):

This leads to lower quantization loss by using more elements of thecodebook to encode local features.


LLC: Paying a Price

LLC proposes to use an additional locality (in feature space)constraint:

Where locality is smoothly modeled with a an exponential:

So: code features, but pay a cost for using codewords far from thedescriptor we are encoding.


LLC: What’s Going On

Here is a comparison:


LLC: What’s Going On

The full pipeline:


LLC: Implementation

In practice, solving the constrained optimization problem for everydescriptor is too costly.Solution: select the k nearest codewords in feature space, and solve aconstrained least-squares problem using only k codewords.Codebook optimization: maybe k-means doesn’t yield an optimalcodebook for LLC codeing:

Section 3 gives an iterative algorithm for building a codebook.In practice, everyone uses k-means.

Classification: the LLC embedding is rich enough to allow it toperform well with linear SVMs as classifiers.


LLC: Results

Extract dense HOG descriptors (8-pixel stride), at three scales.Use k = 5 for approximate LLC encoding.They also use a Spatial Pyramid (but don’t explain the configuration).The authors consider two pooling methods to arrive at the final imagerepresentation:

sum-pooling: just sum all codes (this is the BOW pooling).max-pooling: take the maximum coefficient for each codeword.Then report results with: max-pooling with L2 normalization (sincethey use linear SVMs).

Use a linear SVM to train one-versus-all classifiers for each category.


LLC: Results

On Caltech-101:

And on Caltech-256:


LLC: Results

A new dataset (which became the benchmark for object recognition).The PASCAL Visual Object Categorization (VOC) competitionsdefined the state-of-the-art for five years.20 object categories, high-quality annotations, recognition,segmentation, detections, etc.Introduced use of average precision to evaluate object recognition.


LLC: Results

Results on PASCAL 2007:


LLC: Reflections

The LLC encoding technique takes a different approach to enrichingthe image representation.It uses sparse codes, but this codes are sparse in that only localcodewords can contribute to the encoding of features.Global pooling is done using a max operation, which helps ensureglobal quasi-sparsity.The resulting codes can be used with linear SVMs, which is a hugewin for large datasets.Beats other BOW/SPM approaches, and achieves results comparableto more complex methods at the state-of-the-art.


Advanced BOW: Fisher Vectors


FV: Clusters are Not Points

The main observation in Fisher Vector coding is that the quantizationprocess is imprecise.More precisely: clusters are distributions of points.


FV: Start with a Generative Model

Let X = { xt , t = 1 . . . T } be the descriptors extracted from animage.Assume there is a generation process for X modeled by a probabilitydensity function uλ with parameters λ.X can be characterized by the gradient:

GXλ = 1

T ∇λ log uλ(X )

Why the gradient? Because it describes the contribution of eachparameter to the generation process (also, the gradient of the datalog likelihood is the Fisher Score).Plus, the dimensionality depends only on the number of parameters inλ and not on the number of patches in the image.


FV: Then define a kernel

A natural kernel to use for gradients of generative model likelihoods isthe Fisher kernel:

K (X , Y ) = GX ′λ F −1

λ GYλ

Where F −1λ is the Fisher Information Matrix:

Fλ = Ex∼uλ[∇λ log uλ(x)∇λ log uλ(x)′]

= L′λLλ (by Cholesky decomposition of symmetric and p.d. L)

So, we can rewrite K (X , Y ) as dot-products between normalizedvectors:

GXλ = LλGX

λ


FV: The Fisher Vector

This GXλ is the Fisher Vector of image X .

Big win: learning a kernel classifier with the non-linear kernel K is equivalent tolearning a linear classifier on Fisher Vectors.Implementation:

Assume the generative model is a mixture of Gaussians.And assume that all xi are generated independently:

GXλ = 1

T

T∑t=1

∇λ log uλ(xt)

Compute gradients with respect to mean and diagonal covariance of all mixturecomponents.Final descriptor is the concatenation of:

GXµ,i = 1

T√wi

T∑t=1

γt(i)(xt − µt

σi

)GX

σ,i = 1T√wi

T∑t=1

γt(i)(

(xt − µt)2

σ2i

− 1)


FV: What’s Going On

The Fisher Vector is a weighted average of gradients.These gradients are defined at every point in descriptor space.First compute the FV for each individual descriptor.Then average pool the vectors to computer the FV encoding for theimage.


FV: Improvements

There are really only two improvements to the Fisher Vector proposedin this paper.Improvement 1: L2 normalize the Fisher Vectors before training anSVM (not surprising).Power normalization: Also known as "signed square-root" or theHellinger kernel, it compensates for bursty features:


FV: Baseline Comparison

It has become standard practice to do a baseline comparison.In these experiments, you want to evaluate the contribution of yourwork.


FV: Us versus Them

The proof is in the pudding.PASCAL 2007:

Caltech 256:


FV: Reflections

The Fisher Vector is an alternative coding method.Instead of quantizing descriptors, you represent each descriptor with agradient.This gradient represents the relationship between a local descriptorand all clusters in a generative model.This encoding can significantly improve performance over the BOWframework.Bonus: everything is linear, and we can use efficient solvers – even forlarge-scale datasets.The Fisher Vector was the state-of-the-art in Bag-of-Features codinguntil 2012.


Detection: Deformable Part Models


Deformable Part Models

[OTHER PRESENTATIONS]


Discussion


Discussion

Discuss


Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs ›...

Documents

Transcript of Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs ›...