Similarity Search

Database & InformationSystems Group

University of Basel

October 10th, 2007

Similarity Search

Michael SpringmannPhD Seminar

October 11th, 2007

October 10th, 2007Michael Springmann - Database & Information Systems Group 2

ProjectsI. DELOS (EU FP6)

Network of Excellence on Digital Libraries http://www.delos.info/ Task 1.6 Management of and

Access to Virtual Electronic Health Records Task 1.8 DelosDLMS

II. DILIGENT (EU FP6) A Digital Library Infrastructure on

Grid Enabled Technology Work Package 1.4 Index & Search –

Feature Extraction ARTE Scenario


What is similarity search?From a collection, return a ranked

list of items for a given reference object.

ReferenceObject

1. 0.9992. 0.873 3. 0.722 4. 0.712

5. 0.503 6. 0.442 7. 0.392


Steps to compute similarityI. Define query (reference

object)

II. Select feature to use for comparison

III. Extract feature of reference object

IV. Compare feature with each element of collection

V. Return (subset) of ranked list

e.g. Color Histogram

203 236 172 210 78

d

iii ba

0

2

e.g. 5-NN


Similarity Search: Media TypesI. Image – Color, Texture, ShapeII. Text – TF/IDF, Edit DistanceIII. Audio – Spectrum, Rhythm, Beat, PitchIV. Video Sequences – Visual, Subtitles / Audio

Transcripts, (rich) Meta Data Combinations of several types Complex Documents

High dimensional feature vectors


Goals

Effectiveness Theme: Find good/better results! Measure: Quality, e.g. for benchmark

collections Precession, Recall, MAP Question:

How can we find better results w.r.t. the information need of the user?

Efficiency Theme: Retrieve the results fast! Measure: Execution time Question:

How can we achieve this with algorithmic optimizations?


Similarity Search: What it is...

A way to order / rank things

May help to group objectsLimitations:1. Feature matches categorization criterion2. No sharp borders


ISIS (Interactive Similarity Search)I. Originated at ETH Zurich, continued at UMIT

and UNIBASII. VA-File can handle collections of size >

600.000 images while still achieving interactive answering times

III. Used image features: Color Moments, Texture Moments

IV. Global and 5 Fuzzy Regions


5 Fuzzy Regions


Similarity Search: What it is ... and what it ain‘t?A way to order /

rank things

May help to group objectsLimitations:1. Feature matches categorization criterion2. No sharp borders

Feature extraction will not find out:One person sleeping ...

at least not without application specific adjustments / training


ImageCLEF (http://www.imageclef.org)

II. Object Retrieval TaskPASCAL Visual Object, 2617 images, 4754 object in realistic scenes. Main challenge: Pure visual, not pre-segmented.

IV. Medical Automatic Annotation TaskIRMA Database, 11.000 medical images, annotated with IRMA Code (116 classes). Main challenge: Pure visual, classification domain specific.

III. Medical Image Retrievalc@simage, PEIR, MIR, PathoPic, mypacs.net: > 70.000 images, heterogeneous case notes in XML

I. Ad-hoc photographic retrieval taskIAPR TC-12 Benchmark, 20.000 (tourist) images, multi-lingual descriptions. Main challenge: Short annotations.

1123-127-500-000


IRMA Code Classification Example

1123-127-500-000

Technical code (T) describes the image modality, e.g. 1 = x-ray,11 = plain radiography,112 = analog,1123 = high beam energy

Directional code (D) models body orientation, here: anteroposterior (AP, coronal), supine

Anatomical code (A) refers to the body region examined, here: chest

Biological code (B) describes the biological system examined.O always means unspecific and therefore is always followed by other Os or -.

4 independent axes:


Image Distortion Model (IDM)Uses reduced size images of at most 32 pixels

width/height

Corresponding pixels


Edge Detection (Sobel Filter)


Efficiency: Speeding up IDMAlgorithmic optimization

Idea: Only k ≤ 5 of the 10.000 reference images are used for subsequent kNN classification. Early termination of distance computation of unused images Base decision on threshold derived from best k images seen so far

Pixels not evaluated due to exceeded threshold


Early Termination Strategy - Experimental results

0

2000000000

4000000000

6000000000

8000000000

10000000000

12000000000

14000000000

0 2000 4000 6000 8000 10000

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%pixels max

IDM

L1

IDM %

L1 %

For IDM: Less than 30% of all pixels need

to get evaluated


Speaking of numbers…I. Original RWTH Aachen implementation of IDM requires for

X×32, IDM (gradients, 5×5 window, 3×3 context) about 190 seconds per sample (= comparison) on a standard Pentium 4 PC running at 2.6GHz.

II. Using L2-Distance in a Sieve function, they reduced to 16.8 seconds – but this causes a slight degradation of results.

III. Our Java implementation takes for same window & context area on standard Pentium 4 PC 2.4 GHz only 16.0 seconds using the threshold (no degradation). L2-Distance can benefit of threshold – our Sieve function implementation takes only 2.0 seconds per sample.

IV. We cached all features in main memory (only 60 MB). Reading directly from disk takes in total less than 5 seconds. Since performed in parallel to computation, penalty for IDM is only about 0.3 seconds, Sieve function becomes I/O-bound.


Multithreading - ImplementationI. Several Java Worker Threads, each computes

similarity between one reference image and query.II. Dispatcher keeps track of distance threshold for early

termination.III. IDM with early termination takes 4.3 seconds on

Fujitsu-Siemens Celsius M450 Workstation (Intel Core 2 Duo E6600, 2.4 GHz) – and only 1.5 seconds on IBM xSeries 445, 8x Intel Xeon MP 2.8 GHz.

IV. Opens possibility for optimizing second goal…


Effectiveness: Adjusting IDMUses reduced size images of at most 32 pixels

width/height

Corresponding pixels


Multithreading - Results

Multithreading on 8-way Xeon MP 2.8 GHz Server

1

10

100

1000

10000

0 2 4 6 8 10 12 14 16

Number of used threads

Ov

era

ll e

xe

cu

tio

n t

ime

fo

r 1

,00

0 q

ue

rie

s

(min

ute

s)

IDM

IDM + Sieve

IDM W=3,LC=3

W=3, LC=3 + Sieve

Euclidean

IDM no maxSum

W=3,LC=3 no maxSum


ImageCLEF 2007 Results

RWTH_miKNN: IDM + CCF + TTF

BLOOMSVM: SIFT + Pixels

RWTHi6SVM/ME: Image Patches

UFRSVM: Color Moments +Texture (DWT) + Edge Orientation

UNIBAS_DBISKNN: IDM

OHSUNeural Network: GISTSVM: SIFT

BIOMODDecission Trees: Random Subwindow

Use Machine Learning

Experts on Domain –Provided Dataset, won 2005

Use Machine Learning

No Machine Learning… yet


What’s next?

Blobworld(http://elib.cs.berkeley.edu/blobworld/)

More expressive query definition!

Region of interest


Query by Sketch (SNF Project)


Compound Document Matching

i1

i2

Text 2

Im age 2

Im age 1

Text 1

i3

__ __ ______ __ ____ ___ ___ ____ _____ _____ __ ____ ___ _________ ____ __ __ _______ ___ ___ ___ ___ _____ __ _____ ____ ___ ___ ___ ____ ___ _____ ___ ___ __

__ __ __ _ __ ___ _____ ___ ___ ___ _ ____ __ _ _______ __ __ ___ __ ____ ____ _____ ____ __ __ __ ___ ____ ___ ___ ___ ___ __ _ _____ ____ ___

Im age 3

Text 4

Text 3

__ __ __ _ __ ___ ___ ___ __ ___ ___ _____ _ ____ __ _ _______ __ __ ___ __ __ ____ __ __ __________ ____ __ __ __ ___ __ ____ ___ ___ __ ___ ___ __ _ __

i4

i1

Im a g e 3

T e x t 1

Im a g e 1

Im a g e 2

i2

i3

_ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Im a g e 4

T e x t 2

E.g. patient records


Conclusion

I. Similarity Search allows for a variety of applications: New means for browsing, data mining, classification

II. Is computationally intensive Algorithmic optimization can speed up IDM by

factors 3.5-4.9 Multithreading / distributed execution

III. Query requires example object QbS may help

Similarity Search

Documents

Transcript of Similarity Search