22nd June 2009IEEE Computer Society Conference on
Computer Vision and Pattern Recognition 2009
Center for Machine PerceptionCzech Technical University in Prague
Efficient Representation of Local Geometry for Large Scale Object Retrieval
Michal PerďochOndřej Chum and Jiří Matas
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 2/20
Large Scale Object Retrieval
Large (web) scale “real-time” search involves millions(billions) of images
Indexing structure should fit into RAM, failing to do so results in a order of magnitude increase in response time
1 bit per feature requires ~1.16GB in a retrieval system with 5 million images, ~2000 features per image
Scalability and the cost of the search engine is determined by memory footprint -> memory is the limiting factor
Methods based on local features Are able to retrieve objects, not only images Robust to occlusions and background changes
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 3/20
Image vs. Object Retrieval
Image query Object query
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 4/20
Object Retrieval vs. Global Methods
Query
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 5/20
Image Representation by Local Features
Local features = appearance + geometry Bag of words approaches focus on appearance. Geometry benefits object retrieval (in spatial verification)
[Philbin’07]
The problem: How to efficiently store geometric information of local features?
Currently, the straightforward representation of local geometry requires roughly 4-5x more memory than the appearance
I.e. ~32bits for appearance vs. ~160bits for geometry.
Our solution: a learnt geometric vocabulary
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 6/20
?
Bag of words, Video Google [Šivic’03]
Image Representation by Local Features
Keypoint Detection
…
Local Appearance
SIFT Description [Lowe’04]
Visual Vocabulary
Visual Words
word1, word2, word8, ...
word948534 ,word998125
graffiti
graffiti
Local Geometry
graffiti
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 7/20
Inverted fileInverted file
+TF-IDF weights
Object Retrieval with Spatial Verification
word123...
1000000treesbark
128...
948534
998125
128...
948534
998125
128...
948534
998125bike
128...
948534
998125
graffiti
TF-IDF ranking 0.85 0.81...
0.0130.001...
graffitigraffiti2
barkall_souls
Geometry
bikesbikesbarkgraffiti
Final ranking
210138...
3 1
graffitigraffiti2
barkall_souls
1481179...
948534
998423query
query
LO-RANSAC
H Ai Aj
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 8/20
Coordinates x0 and an ellipse E
ellipse E can be decomposed up to an arbitrary rotation Q
We fix ambiguity in Q by choosing
Local Geometry of Affine Features
R is given by a dominant orientation as in [Lowe’04] Local affine transformation is defined by a single correspondence
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 9/20
Geometry CompressionRepresenting transformation Ai
Ai (independent of rotation R), captures the shape of the ellipse
Let a transformation Bi be a representant of Ai, ideally
However, to compress the representation, B will be chosen from a vocabulary and shared by multiple Ai, so that
How to measure and minimize error E of H?
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 10/20
Minimizing Geometric Error of H
using simple manipulations
Frobenius norm can be used in space of transformations as a similarity
We capture the quality of H by integrating reprojection error over unit circle
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 11/20
K-means-like ClusteringFor a set of transformations A of cardinality N we are looking for a best set B of K representative candidates Bj
that minimizes error E
over all possible assignments Aj.1. assignment step
2. refinement step
which leads to a closed form solution.3. go to 1, until no reassignments occurred or error is below threshold
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 12/20
Geometric “vocabularies” Assuming that the scale and shape of the ellipse are independent,
scale can be extracted and quantized separately Each geometric vocabulary is characterized by two numbers (x, y)
and denoted SxEyx – number of bits for encoding scaley – number of bits for encoding ellipse shape, K = 2y
Geometric vocabularies learnt on a set of local geometries of 10M affine covariant points (for visualisation scale was quantized separately)
y = 2, K = 22 y = 3, K = 23 y = 4, K = 24 y = 5, K = 25
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 13/20
S0E8 S4E4 S8E0 S4E12 256 ellipses 16 scales, 16 ellipses 256 scales, circles 16 scales, 4k ellipses
Geometric “vocabularies”
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 14/20
Efficient Geometry Representation
Coordinates xi, yi are quantized separately and encoded using 16bits
Local Geometry
x1,y1,B1
x2,y2,B5
x3,y3,B3...
xN,yN,BNgraffiti
Geom. Vocabulary
graffiti
Keypoint Detection
…
Local Appearance
SIFT Description [Lowe’04]
Visual Vocabulary
Visual Words
word1, word2, word8, ...
word948534 ,word998125
graffiti
graffiti
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 15/20
Gravity Vector Assumption Transformation A has one eigenvector (0,1)> pointing down in the
image – we call it a gravity vector Surprisingly, upright direction is often preserved (R = I) Gravity vector assumption was used solely in spatial verification
[Philbin’07]
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 16/20
Gravity Vector AssumptionPerformance comparison
SIFT dominant orientation [Lowe’04] produces 50% more descriptorsRobustness of gravity vector assumption
Robustness of SIFT to small imprecision in orientation Correct final geometry due to LO-step in RANSAC
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 17/20
Experiments – Datasets and Protocol Oxford Buildings Dataset (Ox5K) [Philbin’07]
5062 images of buildings collected from Flickr, 55 queries with ground truth, 11 landmarks each with 5 different queries
mAP – mean Average Precision, average area below the precision-recall curves for all queries
Additional 100k of distractor images, together dataset (Ox105K)
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 18/20
Two different visual vocabularies, trained on Oxford5K and ~6000 images of tourist spots in Paris, protocol from [Philbin’07]
Spatial verification with “8bit per feature” geometric vocabulary for ellipse shape performs almost as exact geometry
It might be necessary to use more challenging datasets to highlight the benefit of affine shape. In Oxford buildings dataset 86% of all GT pairs have anisotropy under 1.5 (72% is under 1.25) 90% of all GT pairs have skew under 20° (76% is under 10°)
Experiments – Geometric Vocabularies
No geometry
Full geometry
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 19/20
Experiments – Performance Comparison Oxford Buildings Dataset
Proposed method outperforms all state of the art methods Geometry representation S0E8 was used
Perďoch M., Chum O.,Matas J.: Efficient Representation of Local Geometry for Large Scale Object Retrieval 20/20
ConclusionsContributions
We have proposed a method for learning an efficient, compact representation of local geometry
Method achieved state of the art results (0.916 mAP) on Oxford buildings dataset and competitive results on INRIA Holidays dataset.
Proposed method requires only 45.2 bits/feature on average Local geometry in 24 bits (16 bits for position, 8 bits for shape) Visual word labels and TF/IDF weights in 21.2 bits
We have shown that the use of gravity vector assumption in feature description significantly improves recognition rate and further reduces the memory footprint
Future work Compare performance of scale vs. affine covariant points on
standard dataset
Top Related