Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs ›...
Transcript of Object Recognition in Images and Video: Advanced Bag-of-Words › bagdanov › pdfs ›...
Object Recognition in Images and Video:Advanced Bag-of-Words
http://www.micc.unifi.it/bagdanov/obrec
Prof. Andrew D. Bagdanov
Dipartimento di Ingegneria dell’InformazioneUniversità degli Studi di Firenzeandrew.bagdanov AT unifi.it
27 April 2017
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 1 / 61
Outline
1 Comments
2 Overview
3 The Bag-of-Words Model
4 Interlude
5 Advanced BOW: Spatial Pyramids
6 Advanced BOW: Sparse Coding
7 Advanced BOW: Fisher Vectors
8 Detection: Deformable Part Models
9 Discussion
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 2 / 61
Comments
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 3 / 61
Comments: Final Exam
For the exam (your presentations), we can be flexible.We’re all busy, and I recognize that.For those in Florence: we can arrange to do your presentations at anytime (preferably by mid-June).For those outside of Florence: we can also do the final exam bySkype, if that’s easier.I strongly desire to finish all exam presentations by mid-June.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 4 / 61
Overview
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 5 / 61
Overview
In this lesson I will first pickup where I left off with an explanation ofthe basic Bag-of-Words (BOW) model.Then, I will explain three extensions of this basic model (which is whythe lecture is called Advanced Bag-of-Words).And then I will cover an article on detection (which is only looselyconnected with the Bag-of-Words).Since you have all read the articles, I will cover the details briefly,then I will open the floor for discussion and questions.Together, we will work to reach a deeper understanding of eachcontribution.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 6 / 61
The Bag-of-Words Model
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 7 / 61
Three Magic Ingredients
Now we will shift our discussion to one of the first Big Breakthroughsin modern object recognition. Visual Categorization with Bags of
Keypoints, Gabriella Csurka, Christopher R. Dance, Lixin Fan, JuttaWillamowski, Cédric Bray. In: European Conference on ComputerVision (ECCV), 2004.These ideas were developed independently, in many places, at thesame time.This paper is one of the first, and in my opinion the simplestexplanation of the basic Bag-of-Words pipeline.Again returning to our analogy with text retrieval, we now have areasonably invariant way to describe local image structure.However, we still don’t have a concept corresponding to words.SIFT features are 128-dimensional vectors, which are not discreteenough to use in a TF*IDF model.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 8 / 61
Feature Quantization
Key idea: use clustering to identify groups of SIFT points using atraining set.The centers are used as a visual vocabulary – words in our model.All SIFT descriptors extracted from training or test images arequantized to the closest visual word in our vocabulary.We have gone from an infinite class of SIFT descriptors, to a finiteclass of visual words.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 9 / 61
Feature Pooling
One last problem: the number of SIFT descriptor is variable: eachimage will yield a different number of points.Also, the order of points (for comparison, for example) is crucial.This problem makes it hard to apply standard, machine learningtechniques to our representation (e.g. SVM, naive-Bayes, nearestneighbor, etc).The solution: like in text retrieval, use pooling to build a fixed-lengthdescriptor of images that is invariant to descriptor order.Our descriptor is a histogram of frequencies of visual wordoccurrences in the image.To compare images we can now use: inner products (like TF*IDF),SVMs, and a vast array of tried and true classifiers.This last point is most important: given a training set of imageslabeled with object categories, we can train classifiers to recognizeobjects in unseen test images.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 10 / 61
The Bag-of-Words Model
This full pipeline is best explained graphically
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 11 / 61
The Bag-of-Words Model
Csurka et al. demonstrated the BOW approach on a dataset with 7object categories.They extract BOW descriptors from training images and train amulticlass, one-versus-all, linear SVM for each.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 12 / 61
The Bag-of-Words
The punchline: the results on this challenging dataset are impressive.The approach uses a small vocabulary of 1000 visual words (in textretrieval, 100K+ word dictionaries are common).It also uses an extremely simple linear SVM for classification.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 13 / 61
The Bag-of-Words
Added bonus: visual words are semantically meaningful (note, thisexample from Csurka et al. is highly cherry-picked):
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 14 / 61
The Bag-of-Words
Another bonus: the one-versus-all SVM architecture can recognizemultiple object categories in images.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 15 / 61
BOW: Discussion
DiscussionLike the SIFT descriptor, it is hard to overstate the impact andinfluence the Bag-of-Words model has had on the development ofmodern object recognition.It is a hallmark result, despite its extreme simplicity (in hindsight).The paper of Csurka et al. was the first to demonstrate theplausibility of efficient, accurate, and robust object recognition overmany categories with extreme visual variance.Clearly, this simple BOW model was only the beginning.The next ten years of computer vision was dominated by incrementalimprovements and refinements of this model.In the next lecture we will head of in that direction with a survey ofadvanced Bag-of-Words models that came after.Note: see the course website for the required reading for next week.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 16 / 61
Interlude
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 17 / 61
Interlude: Ten Years of Progress
The Bag-of-Words was quickly adopted by the community as řthe*method for object recognition.There rapidly followed a series of many, many improvements over thebasic BOW model.In the following, we will look at some important examples.With the adoption of the BOW model, and the growing interest inobject recognition, there were also established several standardbenchmark datasets.Many of these datasets were developed in the context of internationalcompetitions:
PASCAL VOC: five years of competitions, many high-qualitybenchmark datasets.ImageNet: first version with 1000 object categories, over 1M images.Current version has 10M+ images.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 18 / 61
Advanced BOW: Spatial Pyramids
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 19 / 61
BOW is an Orderless Image Representation
The next paper we will examine is:Beyond bags of features: Spatial pyramid matching for recognizingnatural scene categories, S Lazebnik, C Schmid, J Ponce. In:Computer Vision and Pattern Recognition (CVPR), 2006.The motivation behind this work is that the Bag-of-Words is anorderless image representation.We can think of the BOW histogramming process as marginalizingspatial information away.Assume we have encoded an image in quantized visual words – thatis, each spatial location is represented by a single integer representingthe cluster center the SIFT descriptor is closest to.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 20 / 61
BOW is an Orderless Image Representation
Assume we have K = 1000 visual words in our vocabulary, and let Iqbe the quantized image (i.e. each location is represented by aninteger index).Let δi be the (curried) Kronecker delta function:
δi(j) ={
1 ifi = j0 otherwise
We can express the image as a one-hot discrete of field of vectors,and the histogram as a sum over the field:
I1-hot(x , y) = [δi(I(x , y))]1000i=1
H(I1-hot) =∑
x
∑y
I1-hot(x , y)
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 21 / 61
BOW is an Orderless Image Representation
Feature: we have a fixed-length representation of images.Feature: this representation has strong invariance.Bug: we lose all spatial coherence in this representation – it’s tooinvariant).
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 22 / 61
Spatial Pyramids: Impose Some Order
The main idea:revisit global non-invariant representations based on aggregatingstatistics of local features; anduse kernel-based recognition that computes rough geometriccorrespondence on a global scale.
Once you sweep away the suppercazzola in the paper, the method isquite simple:
repeatedly subdivide the image; andcompute histograms of local features at increasingly fine resolutions.
The spatial pyramid technique is simple and extremely effective.After its publication, it became ubiquitous in nearly all BOWpipelines.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 23 / 61
Spatial Pyramids: A Technical Aside
The Support Vector Machine (SVM) is the standard classifier BOW.The linear SVM objective function is to find w that minimizes:[
1n
n∑i=1
max (0, 1 − yi (w · xi + b))
]+ λ‖w‖2.
This can be rewritten as a constrained optimization problem:
minimize 1n
n∑i=1
ζi + λ‖w‖2
subject to yi (w · xi + b) ≥ 1 − ζi and ζi ≥ 0, for all i .
And the dual formulation:
maximize f (c1 . . . cn) =n∑
i=1
ci − 12
n∑i=1
n∑j=1
yi ci (xi · xj)yjcj ,
subject ton∑
i=1
ci yi = 0, and 0 ≤ ci ≤ 12nλ
for all i .
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 24 / 61
Spatial Pyramids: A Technical Aside
The ci in the dual are formulated so that we can write the classifiervector as:
w̃ =n∑
i=1ciyixi .
Kernel trick: embed our feature vectors xi in a Hilbert Space φ(xi).
maximize f (c1 . . . cn) =n∑
i=1ci − 1
2
n∑i=1
n∑j=1
yicik(~xi , ~xj)yjcj
subject ton∑
i=1ciyi = 0, and 0 ≤ ci ≤ 1
2nλfor all i .
Where k(xi , yi) = φ(xi) · φ(yi).So, we never have to actually embed out features, we just need tocompute the kernel matrix of all pairs of inner products.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 25 / 61
Spatial Pyramids: A Technical AsideSome popular kernels
Linear:
k(x, y) = x · y
Gaussian RBF:
k(x, y) = exp(
−‖x − x′‖2
2σ2
)
χ2:
k(x, y) = 12
d∑i=1
(xi − yi)2
xi + yi
Exponential χ2:
k(x, y) = exp(−β(1χ2(x, y)))Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 26 / 61
Spatial Pyramids: Structured Matching
The inspiration for Spatial Pyramids comes from a technique calledpyramid matching for measuring similarity between n-dimensionalpoint sets X and Y .It constructs a sequence of grids at resolutions 0, . . . , L such that thegrid at level ` has 2` cells along each of the d dimensions.Thus, there is a total of D = 2d` cells at each level.Finally, let H`
X and H`Y denote the histograms of X and Y at level `.
So: H`X and H`
Y are the number of from X and Y that fall into theith cell of the grid.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 27 / 61
Spatial Pyramids: Structured Matching
The main tool in defining Spatial Pyramids is the HistogramIntersection Kernel:
I(H`X , H`
Y ) =D∑
i=1min(H`
X (i), H`Y (i))
What we’re doing here is simply counting the number of points thathit the same cells in the dyadic spatial decomposition.If we notice that the matches found at level ` also include all matchesat finer levels, we can write the pyramid match kernel:
κL(X , Y ) = IL +L−1∑`=0
12L−`
(I` − I`−1)
= 12L I0 +
L∑`=1
12L−`+1 I`
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 28 / 61
Spatial Pyramids: Structured Matching
To extend this idea to the BOW model, we can quantize all salientlocations in the image (SIFTs descriptors).Let Xm and Ym represent the spatial locations of all visual words oftype m in images X and Y , respectively.Then, we can write the Spatial Pyramid Match Kernel as:
KL(X , Y ) =M∑
m=1κL(Xm, Ym)
Note that κL is just a weighted sum of histogram intersections.Note also that, for positive numbers, c min(a, b) = min(ca, cb).Thus, we can write the above as a single histogram intersection ofconcatenations of appropriately weighted channel histograms at alllevels.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 29 / 61
Spatial Pyramids: What’s Going On
This is the diagram from the paper:
But this is how people usually think of the Spatial Pyramid:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 30 / 61
Spatial Pyramids: Experiments
New Trend: More Data is Better.Scenes-15: Fifteen categories with strong inter-class variability,intra-class similarity.200–300 images per category.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 31 / 61
Spatial Pyramids: Experiments
New Trend: More Data is Better.Caltech-101: 101 categories with strong inter-class variability,intra-class similarity.31–800 images per category.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 32 / 61
Spatial Pyramids: Experiments
Results are impressive and interesting on Scenes-15:
And on Caltech-101:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 33 / 61
Spatial Pyramids: Reflections
Despite the simplicity of the method, it consistently achieves animprovement over an orderless image representation.This also despite the fact that it works not by constructing explicitobject models, but by using global cues as indirect evidence about thepresence of an object.This is not a trivial accomplishment, given that a well-designedbag-of-features method can outperform more sophisticatedapproaches based on parts and relations.As I mentioned before, the Spatial Pyramid technique became astandard trick to significantly and consistently improve results forBOW models.In the next two papers, we will look at more sophisticated codingtechniques.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 34 / 61
Advanced BOW: Sparse Coding
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 35 / 61
Locality-constrained Linear Coding
The original BOW model uses global pooling of descriptors, hence asingle, global image representation.Spatial Pyramids add some spatial structure to the imagerepresentation.Can we improve the way features themselves are coded before poolinginto the final image represenation?We will look at one approach to better feature coding in our nextpaper:
Locality-constrained linear coding for image classification, J Wang J Yang,K Yu, F Lv, T Huang, Y Gong. In: Computer Vision and PatternRecognition (CVPR), 2010.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 36 / 61
Locality-constrained Linear Coding
Some ovservations:BOW + SPM works really well.But: requires non-linear SVMs to achieve state-of-the-art performance.As datasets grow larder, computing and storing the kernel matrix forsolving the dual SVM formulation is onerous.We hope: for a better feature encoding that allows us to achievestate-of-the-art results, but with linear SVMs.
The key insight is to use the codebook (visual vocabulary) moreevvectively.This is done through sparse coding to encode local features.Followed by max pooling (as opposed to average pooling) to arrive atthe global image description.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 37 / 61
LLC: Basic Definitions
Let X = [x1, . . . , xN ] ∈ RD×N
Given a set of codewords B = [b1, . . . , bM ] ∈ RD×M ,We want to encode each xi into an M-dimensional code.Vector Quantization is used in the BOW (resulting in a set of 1-hotcodes C = [c1, . . . , cM ] ∈ RD×N :
Sparse coding can be used instead (reference 22):
This leads to lower quantization loss by using more elements of thecodebook to encode local features.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 38 / 61
LLC: Paying a Price
LLC proposes to use an additional locality (in feature space)constraint:
Where locality is smoothly modeled with a an exponential:
So: code features, but pay a cost for using codewords far from thedescriptor we are encoding.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 39 / 61
LLC: What’s Going On
Here is a comparison:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 40 / 61
LLC: What’s Going On
The full pipeline:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 41 / 61
LLC: Implementation
In practice, solving the constrained optimization problem for everydescriptor is too costly.Solution: select the k nearest codewords in feature space, and solve aconstrained least-squares problem using only k codewords.Codebook optimization: maybe k-means doesn’t yield an optimalcodebook for LLC codeing:
Section 3 gives an iterative algorithm for building a codebook.In practice, everyone uses k-means.
Classification: the LLC embedding is rich enough to allow it toperform well with linear SVMs as classifiers.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 42 / 61
LLC: Results
Extract dense HOG descriptors (8-pixel stride), at three scales.Use k = 5 for approximate LLC encoding.They also use a Spatial Pyramid (but don’t explain the configuration).The authors consider two pooling methods to arrive at the final imagerepresentation:
sum-pooling: just sum all codes (this is the BOW pooling).max-pooling: take the maximum coefficient for each codeword.Then report results with: max-pooling with L2 normalization (sincethey use linear SVMs).
Use a linear SVM to train one-versus-all classifiers for each category.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 43 / 61
LLC: Results
On Caltech-101:
And on Caltech-256:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 44 / 61
LLC: Results
A new dataset (which became the benchmark for object recognition).The PASCAL Visual Object Categorization (VOC) competitionsdefined the state-of-the-art for five years.20 object categories, high-quality annotations, recognition,segmentation, detections, etc.Introduced use of average precision to evaluate object recognition.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 45 / 61
LLC: Results
Results on PASCAL 2007:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 46 / 61
LLC: Reflections
The LLC encoding technique takes a different approach to enrichingthe image representation.It uses sparse codes, but this codes are sparse in that only localcodewords can contribute to the encoding of features.Global pooling is done using a max operation, which helps ensureglobal quasi-sparsity.The resulting codes can be used with linear SVMs, which is a hugewin for large datasets.Beats other BOW/SPM approaches, and achieves results comparableto more complex methods at the state-of-the-art.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 47 / 61
Advanced BOW: Fisher Vectors
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 48 / 61
FV: Clusters are Not Points
The main observation in Fisher Vector coding is that the quantizationprocess is imprecise.More precisely: clusters are distributions of points.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 49 / 61
FV: Start with a Generative Model
Let X = { xt , t = 1 . . . T } be the descriptors extracted from animage.Assume there is a generation process for X modeled by a probabilitydensity function uλ with parameters λ.X can be characterized by the gradient:
GXλ = 1
T ∇λ log uλ(X )
Why the gradient? Because it describes the contribution of eachparameter to the generation process (also, the gradient of the datalog likelihood is the Fisher Score).Plus, the dimensionality depends only on the number of parameters inλ and not on the number of patches in the image.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 50 / 61
FV: Then define a kernel
A natural kernel to use for gradients of generative model likelihoods isthe Fisher kernel:
K (X , Y ) = GX ′λ F −1
λ GYλ
Where F −1λ is the Fisher Information Matrix:
Fλ = Ex∼uλ[∇λ log uλ(x)∇λ log uλ(x)′]
= L′λLλ (by Cholesky decomposition of symmetric and p.d. L)
So, we can rewrite K (X , Y ) as dot-products between normalizedvectors:
GXλ = LλGX
λ
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 51 / 61
FV: The Fisher Vector
This GXλ is the Fisher Vector of image X .
Big win: learning a kernel classifier with the non-linear kernel K is equivalent tolearning a linear classifier on Fisher Vectors.Implementation:
Assume the generative model is a mixture of Gaussians.And assume that all xi are generated independently:
GXλ = 1
T
T∑t=1
∇λ log uλ(xt)
Compute gradients with respect to mean and diagonal covariance of all mixturecomponents.Final descriptor is the concatenation of:
GXµ,i = 1
T√wi
T∑t=1
γt(i)(xt − µt
σi
)GX
σ,i = 1T√wi
T∑t=1
γt(i)(
(xt − µt)2
σ2i
− 1)
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 52 / 61
FV: What’s Going On
The Fisher Vector is a weighted average of gradients.These gradients are defined at every point in descriptor space.First compute the FV for each individual descriptor.Then average pool the vectors to computer the FV encoding for theimage.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 53 / 61
FV: Improvements
There are really only two improvements to the Fisher Vector proposedin this paper.Improvement 1: L2 normalize the Fisher Vectors before training anSVM (not surprising).Power normalization: Also known as "signed square-root" or theHellinger kernel, it compensates for bursty features:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 54 / 61
FV: Baseline Comparison
It has become standard practice to do a baseline comparison.In these experiments, you want to evaluate the contribution of yourwork.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 55 / 61
FV: Us versus Them
The proof is in the pudding.PASCAL 2007:
Caltech 256:
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 56 / 61
FV: Reflections
The Fisher Vector is an alternative coding method.Instead of quantizing descriptors, you represent each descriptor with agradient.This gradient represents the relationship between a local descriptorand all clusters in a generative model.This encoding can significantly improve performance over the BOWframework.Bonus: everything is linear, and we can use efficient solvers – even forlarge-scale datasets.The Fisher Vector was the state-of-the-art in Bag-of-Features codinguntil 2012.
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 57 / 61
Detection: Deformable Part Models
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 58 / 61
Deformable Part Models
[OTHER PRESENTATIONS]
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 59 / 61
Discussion
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 60 / 61
Discussion
Discuss
Prof. Andrew D. Bagdanov (DINFO) Object Recognition in Images and Video: Advanced Bag-of-Words27 April 2017 61 / 61