Similarity Search

27
Database & Information Systems Group University of Basel October 10 th, 2007 Similarity Search Michael Springmann PhD Seminar October 11 th , 2007

description

Michael Springmann PhD Seminar October 11 th , 2007. Similarity Search. Projects. DELOS (EU FP6) Network of Excellence on Digital Libraries http://www.delos.info/ Task 1.6 Management of and Access to Virtual Electronic Health Records Task 1.8 DelosDLMS DILIGENT (EU FP6) - PowerPoint PPT Presentation

Transcript of Similarity Search

Page 1: Similarity Search

Database & InformationSystems Group

University of Basel

October 10th, 2007

Similarity Search

Michael SpringmannPhD Seminar

October 11th, 2007

Page 2: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 2

ProjectsI. DELOS (EU FP6)

Network of Excellence on Digital Libraries http://www.delos.info/ Task 1.6 Management of and

Access to Virtual Electronic Health Records Task 1.8 DelosDLMS

II. DILIGENT (EU FP6) A Digital Library Infrastructure on

Grid Enabled Technology Work Package 1.4 Index & Search –

Feature Extraction ARTE Scenario

Page 3: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 3

What is similarity search?From a collection, return a ranked

list of items for a given reference object.

ReferenceObject

1. 0.9992. 0.873 3. 0.722 4. 0.712

5. 0.503 6. 0.442 7. 0.392

Page 4: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 4

Steps to compute similarityI. Define query (reference

object)

II. Select feature to use for comparison

III. Extract feature of reference object

IV. Compare feature with each element of collection

V. Return (subset) of ranked list

e.g. Color Histogram

203 236 172 210 78

d

iii ba

0

2

e.g. 5-NN

Page 5: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 5

Similarity Search: Media TypesI. Image – Color, Texture, ShapeII. Text – TF/IDF, Edit DistanceIII. Audio – Spectrum, Rhythm, Beat, PitchIV. Video Sequences – Visual, Subtitles / Audio

Transcripts, (rich) Meta Data Combinations of several types Complex Documents

High dimensional feature vectors

Page 6: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 6

Goals

Effectiveness Theme: Find good/better results! Measure: Quality, e.g. for benchmark

collections Precession, Recall, MAP Question:

How can we find better results w.r.t. the information need of the user?

Efficiency Theme: Retrieve the results fast! Measure: Execution time Question:

How can we achieve this with algorithmic optimizations?

Page 7: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 7

Similarity Search: What it is...

A way to order / rank things

May help to group objectsLimitations:1. Feature matches categorization criterion2. No sharp borders

Page 8: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 8

Page 9: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 9

Page 10: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 10

ISIS (Interactive Similarity Search)I. Originated at ETH Zurich, continued at UMIT

and UNIBASII. VA-File can handle collections of size >

600.000 images while still achieving interactive answering times

III. Used image features: Color Moments, Texture Moments

IV. Global and 5 Fuzzy Regions

Page 11: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 11

5 Fuzzy Regions

Page 12: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 12

Similarity Search: What it is ... and what it ain‘t?A way to order /

rank things

May help to group objectsLimitations:1. Feature matches categorization criterion2. No sharp borders

Feature extraction will not find out:One person sleeping ...

at least not without application specific adjustments / training

Page 13: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 15

ImageCLEF (http://www.imageclef.org)

II. Object Retrieval TaskPASCAL Visual Object, 2617 images, 4754 object in realistic scenes. Main challenge: Pure visual, not pre-segmented.

IV. Medical Automatic Annotation TaskIRMA Database, 11.000 medical images, annotated with IRMA Code (116 classes). Main challenge: Pure visual, classification domain specific.

III. Medical Image Retrievalc@simage, PEIR, MIR, PathoPic, mypacs.net: > 70.000 images, heterogeneous case notes in XML

I. Ad-hoc photographic retrieval taskIAPR TC-12 Benchmark, 20.000 (tourist) images, multi-lingual descriptions. Main challenge: Short annotations.

1123-127-500-000

Page 14: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 16

IRMA Code Classification Example

1123-127-500-000

Technical code (T) describes the image modality, e.g. 1 = x-ray,11 = plain radiography,112 = analog,1123 = high beam energy

Directional code (D) models body orientation, here: anteroposterior (AP, coronal), supine

Anatomical code (A) refers to the body region examined, here: chest

Biological code (B) describes the biological system examined.O always means unspecific and therefore is always followed by other Os or -.

4 independent axes:

Page 15: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 18

Image Distortion Model (IDM)Uses reduced size images of at most 32 pixels

width/height

Corresponding pixels

Page 16: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 19

Edge Detection (Sobel Filter)

Page 17: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 20

Efficiency: Speeding up IDMAlgorithmic optimization

Idea: Only k ≤ 5 of the 10.000 reference images are used for subsequent kNN classification. Early termination of distance computation of unused images Base decision on threshold derived from best k images seen so far

Pixels not evaluated due to exceeded threshold

Page 18: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 21

Early Termination Strategy - Experimental results

0

2000000000

4000000000

6000000000

8000000000

10000000000

12000000000

14000000000

0 2000 4000 6000 8000 10000

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%pixels max

IDM

L1

IDM %

L1 %

For IDM: Less than 30% of all pixels need

to get evaluated

Page 19: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 22

Speaking of numbers…I. Original RWTH Aachen implementation of IDM requires for

X×32, IDM (gradients, 5×5 window, 3×3 context) about 190 seconds per sample (= comparison) on a standard Pentium 4 PC running at 2.6GHz.

II. Using L2-Distance in a Sieve function, they reduced to 16.8 seconds – but this causes a slight degradation of results.

III. Our Java implementation takes for same window & context area on standard Pentium 4 PC 2.4 GHz only 16.0 seconds using the threshold (no degradation). L2-Distance can benefit of threshold – our Sieve function implementation takes only 2.0 seconds per sample.

IV. We cached all features in main memory (only 60 MB). Reading directly from disk takes in total less than 5 seconds. Since performed in parallel to computation, penalty for IDM is only about 0.3 seconds, Sieve function becomes I/O-bound.

Page 20: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 23

Multithreading - ImplementationI. Several Java Worker Threads, each computes

similarity between one reference image and query.II. Dispatcher keeps track of distance threshold for early

termination.III. IDM with early termination takes 4.3 seconds on

Fujitsu-Siemens Celsius M450 Workstation (Intel Core 2 Duo E6600, 2.4 GHz) – and only 1.5 seconds on IBM xSeries 445, 8x Intel Xeon MP 2.8 GHz.

IV. Opens possibility for optimizing second goal…

Page 21: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 24

Effectiveness: Adjusting IDMUses reduced size images of at most 32 pixels

width/height

Corresponding pixels

Page 22: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 25

Multithreading - Results

Multithreading on 8-way Xeon MP 2.8 GHz Server

1

10

100

1000

10000

0 2 4 6 8 10 12 14 16

Number of used threads

Ov

era

ll e

xe

cu

tio

n t

ime

fo

r 1

,00

0 q

ue

rie

s

(min

ute

s)

IDM

IDM + Sieve

IDM W=3,LC=3

W=3, LC=3 + Sieve

Euclidean

IDM no maxSum

W=3,LC=3 no maxSum

Page 23: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 26

ImageCLEF 2007 Results

RWTH_miKNN: IDM + CCF + TTF

BLOOMSVM: SIFT + Pixels

RWTHi6SVM/ME: Image Patches

UFRSVM: Color Moments +Texture (DWT) + Edge Orientation

UNIBAS_DBISKNN: IDM

OHSUNeural Network: GISTSVM: SIFT

BIOMODDecission Trees: Random Subwindow

Use Machine Learning

Experts on Domain –Provided Dataset, won 2005

Use Machine Learning

No Machine Learning… yet

Page 24: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 27

What’s next?

Blobworld(http://elib.cs.berkeley.edu/blobworld/)

More expressive query definition!

Region of interest

Page 25: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 29

Query by Sketch (SNF Project)

Page 26: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 30

Compound Document Matching

i1

i2

Text 2

Im age 2

Im age 1

Text 1

i3

__ __ ______ __ ____ ___ ___ ____ _____ _____ __ ____ ___ _________ ____ __ __ _______ ___ ___ ___ ___ _____ __ _____ ____ ___ ___ ___ ____ ___ _____ ___ ___ __

__ __ __ _ __ ___ _____ ___ ___ ___ _ ____ __ _ _______ __ __ ___ __ ____ ____ _____ ____ __ __ __ ___ ____ ___ ___ ___ ___ __ _ _____ ____ ___

Im age 3

Text 4

Text 3

__ __ __ _ __ ___ ___ ___ __ ___ ___ _____ _ ____ __ _ _______ __ __ ___ __ __ ____ __ __ __________ ____ __ __ __ ___ __ ____ ___ ___ __ ___ ___ __ _ __

i4

i1

Im a g e 3

T e x t 1

Im a g e 1

Im a g e 2

i2

i3

_ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Im a g e 4

T e x t 2

E.g. patient records

Page 27: Similarity Search

October 10th, 2007Michael Springmann - Database & Information Systems Group 31

Conclusion

I. Similarity Search allows for a variety of applications: New means for browsing, data mining, classification

II. Is computationally intensive Algorithmic optimization can speed up IDM by

factors 3.5-4.9 Multithreading / distributed execution

III. Query requires example object QbS may help