Similarity Search
description
Transcript of Similarity Search
Database & InformationSystems Group
University of Basel
October 10th, 2007
Similarity Search
Michael SpringmannPhD Seminar
October 11th, 2007
October 10th, 2007Michael Springmann - Database & Information Systems Group 2
ProjectsI. DELOS (EU FP6)
Network of Excellence on Digital Libraries http://www.delos.info/ Task 1.6 Management of and
Access to Virtual Electronic Health Records Task 1.8 DelosDLMS
II. DILIGENT (EU FP6) A Digital Library Infrastructure on
Grid Enabled Technology Work Package 1.4 Index & Search –
Feature Extraction ARTE Scenario
October 10th, 2007Michael Springmann - Database & Information Systems Group 3
What is similarity search?From a collection, return a ranked
list of items for a given reference object.
ReferenceObject
1. 0.9992. 0.873 3. 0.722 4. 0.712
5. 0.503 6. 0.442 7. 0.392
October 10th, 2007Michael Springmann - Database & Information Systems Group 4
Steps to compute similarityI. Define query (reference
object)
II. Select feature to use for comparison
III. Extract feature of reference object
IV. Compare feature with each element of collection
V. Return (subset) of ranked list
e.g. Color Histogram
203 236 172 210 78
d
iii ba
0
2
e.g. 5-NN
October 10th, 2007Michael Springmann - Database & Information Systems Group 5
Similarity Search: Media TypesI. Image – Color, Texture, ShapeII. Text – TF/IDF, Edit DistanceIII. Audio – Spectrum, Rhythm, Beat, PitchIV. Video Sequences – Visual, Subtitles / Audio
Transcripts, (rich) Meta Data Combinations of several types Complex Documents
High dimensional feature vectors
October 10th, 2007Michael Springmann - Database & Information Systems Group 6
Goals
Effectiveness Theme: Find good/better results! Measure: Quality, e.g. for benchmark
collections Precession, Recall, MAP Question:
How can we find better results w.r.t. the information need of the user?
Efficiency Theme: Retrieve the results fast! Measure: Execution time Question:
How can we achieve this with algorithmic optimizations?
October 10th, 2007Michael Springmann - Database & Information Systems Group 7
Similarity Search: What it is...
A way to order / rank things
May help to group objectsLimitations:1. Feature matches categorization criterion2. No sharp borders
October 10th, 2007Michael Springmann - Database & Information Systems Group 8
October 10th, 2007Michael Springmann - Database & Information Systems Group 9
October 10th, 2007Michael Springmann - Database & Information Systems Group 10
ISIS (Interactive Similarity Search)I. Originated at ETH Zurich, continued at UMIT
and UNIBASII. VA-File can handle collections of size >
600.000 images while still achieving interactive answering times
III. Used image features: Color Moments, Texture Moments
IV. Global and 5 Fuzzy Regions
October 10th, 2007Michael Springmann - Database & Information Systems Group 11
5 Fuzzy Regions
October 10th, 2007Michael Springmann - Database & Information Systems Group 12
Similarity Search: What it is ... and what it ain‘t?A way to order /
rank things
May help to group objectsLimitations:1. Feature matches categorization criterion2. No sharp borders
Feature extraction will not find out:One person sleeping ...
at least not without application specific adjustments / training
October 10th, 2007Michael Springmann - Database & Information Systems Group 15
ImageCLEF (http://www.imageclef.org)
II. Object Retrieval TaskPASCAL Visual Object, 2617 images, 4754 object in realistic scenes. Main challenge: Pure visual, not pre-segmented.
IV. Medical Automatic Annotation TaskIRMA Database, 11.000 medical images, annotated with IRMA Code (116 classes). Main challenge: Pure visual, classification domain specific.
III. Medical Image Retrievalc@simage, PEIR, MIR, PathoPic, mypacs.net: > 70.000 images, heterogeneous case notes in XML
I. Ad-hoc photographic retrieval taskIAPR TC-12 Benchmark, 20.000 (tourist) images, multi-lingual descriptions. Main challenge: Short annotations.
1123-127-500-000
October 10th, 2007Michael Springmann - Database & Information Systems Group 16
IRMA Code Classification Example
1123-127-500-000
Technical code (T) describes the image modality, e.g. 1 = x-ray,11 = plain radiography,112 = analog,1123 = high beam energy
Directional code (D) models body orientation, here: anteroposterior (AP, coronal), supine
Anatomical code (A) refers to the body region examined, here: chest
Biological code (B) describes the biological system examined.O always means unspecific and therefore is always followed by other Os or -.
4 independent axes:
October 10th, 2007Michael Springmann - Database & Information Systems Group 18
Image Distortion Model (IDM)Uses reduced size images of at most 32 pixels
width/height
Corresponding pixels
October 10th, 2007Michael Springmann - Database & Information Systems Group 19
Edge Detection (Sobel Filter)
October 10th, 2007Michael Springmann - Database & Information Systems Group 20
Efficiency: Speeding up IDMAlgorithmic optimization
Idea: Only k ≤ 5 of the 10.000 reference images are used for subsequent kNN classification. Early termination of distance computation of unused images Base decision on threshold derived from best k images seen so far
Pixels not evaluated due to exceeded threshold
October 10th, 2007Michael Springmann - Database & Information Systems Group 21
Early Termination Strategy - Experimental results
0
2000000000
4000000000
6000000000
8000000000
10000000000
12000000000
14000000000
0 2000 4000 6000 8000 10000
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%pixels max
IDM
L1
IDM %
L1 %
For IDM: Less than 30% of all pixels need
to get evaluated
October 10th, 2007Michael Springmann - Database & Information Systems Group 22
Speaking of numbers…I. Original RWTH Aachen implementation of IDM requires for
X×32, IDM (gradients, 5×5 window, 3×3 context) about 190 seconds per sample (= comparison) on a standard Pentium 4 PC running at 2.6GHz.
II. Using L2-Distance in a Sieve function, they reduced to 16.8 seconds – but this causes a slight degradation of results.
III. Our Java implementation takes for same window & context area on standard Pentium 4 PC 2.4 GHz only 16.0 seconds using the threshold (no degradation). L2-Distance can benefit of threshold – our Sieve function implementation takes only 2.0 seconds per sample.
IV. We cached all features in main memory (only 60 MB). Reading directly from disk takes in total less than 5 seconds. Since performed in parallel to computation, penalty for IDM is only about 0.3 seconds, Sieve function becomes I/O-bound.
October 10th, 2007Michael Springmann - Database & Information Systems Group 23
Multithreading - ImplementationI. Several Java Worker Threads, each computes
similarity between one reference image and query.II. Dispatcher keeps track of distance threshold for early
termination.III. IDM with early termination takes 4.3 seconds on
Fujitsu-Siemens Celsius M450 Workstation (Intel Core 2 Duo E6600, 2.4 GHz) – and only 1.5 seconds on IBM xSeries 445, 8x Intel Xeon MP 2.8 GHz.
IV. Opens possibility for optimizing second goal…
October 10th, 2007Michael Springmann - Database & Information Systems Group 24
Effectiveness: Adjusting IDMUses reduced size images of at most 32 pixels
width/height
Corresponding pixels
October 10th, 2007Michael Springmann - Database & Information Systems Group 25
Multithreading - Results
Multithreading on 8-way Xeon MP 2.8 GHz Server
1
10
100
1000
10000
0 2 4 6 8 10 12 14 16
Number of used threads
Ov
era
ll e
xe
cu
tio
n t
ime
fo
r 1
,00
0 q
ue
rie
s
(min
ute
s)
IDM
IDM + Sieve
IDM W=3,LC=3
W=3, LC=3 + Sieve
Euclidean
IDM no maxSum
W=3,LC=3 no maxSum
October 10th, 2007Michael Springmann - Database & Information Systems Group 26
ImageCLEF 2007 Results
RWTH_miKNN: IDM + CCF + TTF
BLOOMSVM: SIFT + Pixels
RWTHi6SVM/ME: Image Patches
UFRSVM: Color Moments +Texture (DWT) + Edge Orientation
UNIBAS_DBISKNN: IDM
OHSUNeural Network: GISTSVM: SIFT
BIOMODDecission Trees: Random Subwindow
Use Machine Learning
Experts on Domain –Provided Dataset, won 2005
Use Machine Learning
No Machine Learning… yet
October 10th, 2007Michael Springmann - Database & Information Systems Group 27
What’s next?
Blobworld(http://elib.cs.berkeley.edu/blobworld/)
More expressive query definition!
Region of interest
October 10th, 2007Michael Springmann - Database & Information Systems Group 29
Query by Sketch (SNF Project)
October 10th, 2007Michael Springmann - Database & Information Systems Group 30
Compound Document Matching
i1
i2
Text 2
Im age 2
Im age 1
Text 1
i3
__ __ ______ __ ____ ___ ___ ____ _____ _____ __ ____ ___ _________ ____ __ __ _______ ___ ___ ___ ___ _____ __ _____ ____ ___ ___ ___ ____ ___ _____ ___ ___ __
__ __ __ _ __ ___ _____ ___ ___ ___ _ ____ __ _ _______ __ __ ___ __ ____ ____ _____ ____ __ __ __ ___ ____ ___ ___ ___ ___ __ _ _____ ____ ___
Im age 3
Text 4
Text 3
__ __ __ _ __ ___ ___ ___ __ ___ ___ _____ _ ____ __ _ _______ __ __ ___ __ __ ____ __ __ __________ ____ __ __ __ ___ __ ____ ___ ___ __ ___ ___ __ _ __
i4
i1
Im a g e 3
T e x t 1
Im a g e 1
Im a g e 2
i2
i3
_ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _
_ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Im a g e 4
T e x t 2
E.g. patient records
October 10th, 2007Michael Springmann - Database & Information Systems Group 31
Conclusion
I. Similarity Search allows for a variety of applications: New means for browsing, data mining, classification
II. Is computationally intensive Algorithmic optimization can speed up IDM by
factors 3.5-4.9 Multithreading / distributed execution
III. Query requires example object QbS may help