Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi,...

34
Similarity Search on Bregman Divergence, Towards Non-Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung

Transcript of Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi,...

Similarity Search on Bregman Divergence, Towards Non-Metric Indexing

Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung

Metric v.s. Non-Metric

Euclidean distance dominates DB queries

Similarity in human perception

Metric distance is not enough!

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 2

Outline

Bregman Divergence

Solution

Basic solution

Better pruning bounds

Query distribution

Experiments

Conclusion

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 3

Bregman Divergence

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 4

q pEuclidean dist.

convex function f(x)

Bregman divergence

Df(p,q)

(q,f(q))

(p,f(p))

h

Bregman Divergence

Mathematical Interpretation

The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 5

original f(x) first order Taylor expansion of f(x) at q

Bregman Divergence

General Properties

Uniqueness

A function f(x) uniquely decides the Df(p,q)

Non-Negativity

Df(p,q)≥0 for any p, q

Identity

Df(p,p)=0 for any p

Symmetry and Triangle Inequality Do NOT hold any more

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 6

Examples

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 7

Distance f(x) Df(p,q) Usage

KL-Divergence x logx p log (p/q) distribution, color histogram

Itakura-Saito Distance

-logx p/q-log (p/q)-1 signal, speech

Squared Euclidean

x2 (p-q)2 traditional queries

Von-Nuemann Entropy

tr(X log X – X) tr(X logX – X logY – X + Y)

symmetric matrix

Why in DB system?

Database application

Retrieval of similar images, speech signals, or time series

Optimization on matrices in machine learning

Efficiency is important!

Query Types

Nearest Neighbor Query

Range Query

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 8

Euclidean Space

How to answer the queries

R-Tree

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 9

Euclidean Space

How to answer the queries

VA File

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 10

Our goal

Re-use the infrastructure of existing DB system to support Bregman divergence

Storage management

Indexing structures

Query processing algorithms

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 11

Outline

Bregman Divergence

Solution

Basic solution

Better pruning bounds

Query distribution

Experiments

Conclusion

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 12

Basic Solution

Extended Space

Convex function f(x) = x2

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 13

point A1 A2

p 0 1

q 0.5 0.5

r 1 0.8

t 1.5 0.3

point A1 A2 A3

p+ 0 1 1

q+ 0.5 0.5 0.5

r+ 1 0.8 1.64

t+ 1.5 0.3 3.15

Basic Solution

After the extension

Index extended points with R-Tree or VA File

Re-use existing algorithms with new lower and upper bound computation

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 14

How to improve?

Reformulation of Bregman divergence

Tighter bounds are derived

No change on index construction or query processing algorithm

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 15

A New Formulation

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 16

q p

Df(p,q)+Δ

query vector vq

D*f(p,q)

h

h’

Math. Interpretation

Reformulation of similarity search queries

k-NN query: query q, data set P, divergence Df

Find the point p, minimizing

Range query: query q, threshold θ, data set P Return any point p that

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 17

Naïve Bounds

Check the corners of the bounding rectangles

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 18

Tighter Bounds

Take the curve f(x) into consideration

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 19

Query distribution

Distortion of rectangles

The difference between maximum and minimum distances from inside the rectangle to the query

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 20

Can we improve it more?

When Building R-Tree in Euclidean space

Minimize the volume/edge length of MBRs

Does it remain valid?

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 21

Query distribution

Distortion of bounding rectangles Invariant in Euclidean space (triangle inequality)

Query-dependent for Bregman Divergence

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 22

Utilize Query Distribution

Summarize query distribution with O(d) real number

Estimation on expected distortion on any bounding rectangle in O(d) time

Allows better index to be constructed for both R-Tree and VA File

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 23

Outline

Bregman Divergence

Solution

Basic solution

Better pruning bounds

Query distribution

Experiments

Conclusion

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 24

Experiments

Data Sets KDD’99 data

Network data, the proportion of packages in 72 different TCP/IP connection Types

DBLP data

Use co-authorship graph to generate the probabilities of the authors related to 8 different areas

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 25

Experiment

Data Sets

Uniform Synthetic data Generate synthetic data with uniform distribution

Clustered Synthetic data Generate synthetic data with Gaussian Mixture Model

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 26

Experiments

Methods to compare

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 27

Basic Improved Bounds

Query Distribution

R-Tree R R-B R-BQ

VA File V V-B V-BQ

Linear Scan LS

BB-Tree BBT

Experiments

Index Construction Time

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 28

Experiments

Varying dimensionality

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 29

Experiments

Varying dimensionality (cont.)

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 30

Experiments

Varying k for nearest neighbor query

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 31

Conclusion

A general technique on similarity for Bregman Divergence

All techniques are based on existing infrastructure of commercial database

Extensive experiments to compare performances with R-Tree and VA File with different optimizations

23/4/18 Similarity Search on Bregman Divergence: Towards Non-Metric Indexing 32

Acknowledgment

Zhenjie Zhang, Anthony K. H. Tung and Beng Chin Ooi were supported by Singapore NRF grant R-252-000-376-279.

Srinivasan Parthasarathy was supported by NSF IIS-0347662 (CAREER) and NSF CCF-0702587.

Q & A