Similarity Ranking

42
Similarity-based Ranking and Query Processing in Multimedia Databases K. Selc ¸uk Candan Wen-Syan Li Computer Science and Engineering Dept. C&C Research Laboratories Arizona State University, Box 875406 NEC USA, Inc., MS/SJ10 Tempe, AZ 85287-5406, USA. San Jose, CA 95134,USA. [email protected] [email protected] M. Lakshmi Priya Computer Science and Engineering Dept. Arizona State University, Box 875406 Tempe, AZ 85287-5406, USA. [email protected] Abstract Since media-based evaluation yields similarity values, results to a multimedia database query, , is defined as an ordered list of n-tuples of the form . The query itself is composed of a set of fuzzy and crisp predicates, constants, variables, and conjunction, disjunction, and negation operators. Since many multimedia applications require partial matches, includes results which do not satisfy all predicates. Due to the ranking and partial match requirements, traditional query processing techniques do not apply to multimedia databases. In this paper, we first focus on the problem of “given a multimedia query which consists of multiple fuzzy and crisp predicates, providing the user with a meaningful final ranking.” More specifically, we study the problem of merging similarity values in queries with multiple fuzzy predicates. We describe the essential multimedia retrieval semantics, compare these with the known approaches, and propose a semantics which captures the requirements of multimedia retrieval problem. We then build on these results in answering the related problem of “given a multimedia query which consists of multiple fuzzy and crisp predicates, finding an efficient way to process the query.” We develop an algorithm to efficiently process queries with unordered fuzzy predicates (sub-queries). Although, this algorithm can work with different fuzzy semantics, it benefits from the statistical properties of the semantics proposed in this paper. We also present experimental results for evaluating the proposed algorithm in terms of quality of results and search space reduction. 1

Transcript of Similarity Ranking

Page 1: Similarity Ranking

Similarity-based Ranking and Query Processing

in Multimedia Databases

K. Selcuk Candan Wen-Syan Li

Computer Science and Engineering Dept. C&C Research Laboratories

Arizona State University, Box 875406 NEC USA, Inc., MS/SJ10

Tempe, AZ 85287-5406, USA. San Jose, CA 95134,USA.

[email protected] [email protected]

M. Lakshmi Priya

Computer Science and Engineering Dept.

Arizona State University, Box 875406

Tempe, AZ 85287-5406, USA.

[email protected]

Abstract

Since media-based evaluation yields similarity values, results to a multimedia database query,Q(Y1

; : : : ; Y

n

),

is defined as an ordered list SQ

of n-tuples of the form hX

1

; : : : ; X

n

i. The query Q itself is composed of a set

of fuzzy and crisp predicates, constants, variables, and conjunction, disjunction, and negation operators. Since

many multimedia applications require partial matches, SQ

includes results which do not satisfy all predicates.

Due to the ranking and partial match requirements, traditional query processing techniques do not apply to

multimedia databases. In this paper, we first focus on the problem of “given a multimedia query which consists

of multiple fuzzy and crisp predicates, providing the user with a meaningful final ranking.” More specifically,

we study the problem of merging similarity values in queries with multiple fuzzy predicates. We describe the

essential multimedia retrieval semantics, compare these with the known approaches, and propose a semantics

which captures the requirements of multimedia retrieval problem. We then build on these results in answering

the related problem of “given a multimedia query which consists of multiple fuzzy and crisp predicates, finding

an efficient way to process the query.” We develop an algorithm to efficiently process queries with unordered

fuzzy predicates (sub-queries). Although, this algorithm can work with different fuzzy semantics, it benefits

from the statistical properties of the semantics proposed in this paper. We also present experimental results

for evaluating the proposed algorithm in terms of quality of results and search space reduction.

1

Page 2: Similarity Ranking

woman 0.43

man 0.87

computer 0.84

television 0.75

frame20

Figure 1: Fuzzy media modeling example

1 Introduction

Multimedia data includes image and video data which are very complex in terms of their visual and semantic

contents. Depending on the application, multimedia objects are modeled and indexed using their (1) visual prop-

erties (or a set of relevant visual features), (2) semantic properties, and/or (3) the spatial/temporal relationships

of subobjects.

Example 1.1 For instance, Figure 1 gives an example where multimedia data is modeled using both visual and

semantic features.

Figure 1(a) shows an image, Boy bike.gif, whose structure is viewed as a hierarchy with two image com-

ponents (i.e. boy and bicycle). These components are identified based on color/shape region division (visual

interpretation) and their real world meanings (semantic interpretation). The relationship between an image and

its components is contains. The spatial relationships can be described in 2D-String representation [1]. Objects,

Obj1 and Obj2, have both visual and semantic properties that can be used in retrieval.

Figure 1 shows a video frame whose structure can be viewed as a hierarchy with two image components

identified based on color/shape region division (visual interpretation) and their real world meanings (semantic

interpretation). The spatial relationships of these components can be described using various techniques [1, 2].

These components, or objects, have both visual and semantic properties that can be used in retrieval. In this

example, objects have multiple candidate semantics, each with an associated confidence value (smaller than 1.0

due to the limitations of the recognition engine). 3

Therefore, retrieval in multimedia databases is inherently fuzzy:

� similarity of media features, such as correlation between color (red vs. orange) or shape (circle vs. ellipse)

features,

� imperfections in the feature extraction algorithms, such as the high error rate in motion estimation due to

the multitude of factors involved, including camera and object speed, and camera effects,

� imperfections in the query formulation methods, such as the Query by Example (QBE) method where user

provides an example but is not aware of which features will be used for retrieval,

� partial match requirements, where objects in the database fail to satisfy all requirements in the query, and

2

Page 3: Similarity Ranking

� imperfections in the available index structures, such as low precision or recall rates due to the imperfections

in clustering algorithms.

In many multimedia applications, more than one of these reasons coexist and, consequently, the system must

take each of them into consideration. This requires quantification of different sources of fuzziness and merging

into a single combined value for user’s reference. The following example describes this requirement in greater

detail.

Example 1.2 A query for retrieving images containing Fuji Mountain and a lake can be specified with an SQL3

like query statement [3, 4, 5] as follows:

select image P, object object1, object object2

where P contains object1

and P contains object2

and object1.semantical property s like ”mountain”

and object1.image property image match ”Fuji mountain.gif”

and object2.semantical property is ”lake”

and object2.image property image match ”lake image sample.gif”

and object1.position is above object2.position

The above query contains two crisp query predicates: contains and is1. It also contains a set of fuzzy query

predicates:

� s like (i.e. semantically similar) which evaluates the degree of semantic similarity between two terms.

Helps resolving correlations between semantic features, imperfections in the semantics extraction algo-

rithms, in the index structures, and in the user queries;

� image match (i.e. visually like) which evaluates the visual similarity between two images. Helps resolving

correlations between visual features and imperfections in the index structures; and

� is above (a spatial condition) which compares the spatial position between two objects). Helps resolv-

ing correlations between spatial features, imperfections in the spatial information extraction algorithms,

imperfections in the index structures, and imperfections in the user queries.

This query returns a set of 3-tuples of the form hP; obje t1; obje t2i that satisfy all crisp conditions and that

has a combined fuzzy score above a given threshold. If users desire, the results may be sorted based on their

overall scores for quicker access to relevant results.

Figure 2(Query) shows the conceptual representation of the above query. Figure 2(a), (b), (c), and (d) shows

examples of candidate images that may match this query. The numbers next to the objects in these candidate

images denote the similarity values for the object level matching. As explained earlier, in this example, the

comparisons on spatial relationships are also fuzzy to account for correlations between spatial features.

1The keyword, where, used in the query does not correspond to a predicate.

3

Page 4: Similarity Ranking

Mountain

Lake

0.5

0.5

(b)

1.00.8

Fuji

Mountain

LakeFuji

Mountain

LakeMountain

Forest0.98

0.98

0.0

(c)(a)Query

Fuji

0.0 0.5

Lake

0.5

1.0

(d)

0.8Mountain

Fuji

Figure 2: Partial matches

The candidate image in Figure 2(a) satisfies object matching conditions but its layout does not match user

specification. Figures 2(b) and (d) satisfy image layout condition but objects do not perfectly match the spec-

ification. Figure 2(c) has structural and object matching with low scores. Note that in Figure 2(a), the spatial

predicate, and in Figure 2(c), the image similarity predicate for lake completely fail (i.e., the match is 0:0).

A multimedia database engine must consider all four images as candidates and must rank them according to

a certain unified criterion. 3

In this paper, we first address the problem of “given a query which consists of multiple fuzzy and crisp

predicates, how to provide a meaningful final ranking to the users.” We propose an alternative scoring approach

which captures the multimedia semantics well and which does not face the above problem while handling partial

matches. Although it is not based on weighing, the proposed approach can be used along with weighing strategies,

if weighing is requested by the user.

We then focus on the problem of “given a query which consists of multiple fuzzy and crisp predicates and a

scoring function, how to efficiently process the query.” It is clear that current database engines are not designed

to answer the needs of these kind of queries. Recently, there has been attempts to address challenges associated

with processing queries of the above kind. In [9], Adalı et al. proposes an algebra for similarity-based queries.

In [10, 11], Fagin proposes a set of efficient query execution algorithms for databases with fuzzy (similarity

based) queries. The algorithms proposed by Fagin assume that (1) individual sources can progressively (in

decreasing order of score) output results, and (2) users are interested in the best k matches to the query. This

for instance would require both of the s like and image match predicates, used in the above example, to return

ordered results. This assumption, however, may be invalid due to limited binding and processing capabilities

of the sources. For instance, the second assumption may be invalid due to the binding rules imposed by the

predicates.

Example 1.3 (Motivating Example) Consider the following SQL-like query which asks for 10 pairs of visually

similar images, such that both images in a given pair contains at least one object, a ”mountain” and a ”tree”,

respectively:

select 10 P1, P2

where semantically like(P1.semantical property, “mountain”)

4

Page 5: Similarity Ranking

and semantically like (P2.semantical property, “tree”)

and image match (P1.image property, P2.image property).

This query contains three fuzzy conditions: two semantically like predicates and one image match

predicate. Let us assume that the image match predicate is implemented as an external function, which can

be invoked only by providing two input images (i.e., both arguments have to be bound). The image match

predicate then returns a score denoting the visual similarity of its inputs. Hence, the predicate is fuzzy, but it can

not generate results in the order of score; i.e., it is a non-progressive fuzzy predicate. Let us also assume that

the semantically like predicate is implemented as an external function such that, when invoked with the

second argument bound, can return matching images (using an index structure) in the order of decreasing scores.

Consequently, in this example, we have two sources (semantically like predicates) which can output

images progressively through database access and one source (image match) which can not, and results from

all these sources have to be merges to get the final set of results. Unfortunately, due to the existence of a non-

progressive predicate, finding and returning the 10 best matching pairs of pictures would require a complete scan

of the database, which is clearly undesirable. 3

In this paper, we propose a query processing approach which uses score distribution function estimates/statistics [12]

and the statistical properties of the score merging functions for computing approximate top-k results when some

of the predicates can not return ordered results. More specifically, we propose a query processing algorithm,

which, given a query Q, uses the available score distribution function estimates/statistics [12], s, of the individ-

ual predicates and the statistical properties of the corresponding score merging function (�Q

s) for computing

approximate top-k results when some of the predicates in Q can not return ordered results. More specifically, we

provide a solution to the following problem: “ Given a query Q, a positive number k, and an error threshold, �,

return a set R of k results, such that each result r 2 R is most probably (prob(r 2 R

k

) > 1 � �) in the top k

results, Rk

.”

The paper is structured as follows: In Section 2, we provide an overview of the multimedia retrieval se-

mantics. We show the similarities between multimedia queries and fuzzy logic statements. Then, in Section 3,

we provide an overview of the popular fuzzy logic semantics and compare them with respect to the essential

requirements of multimedia retrieval problem. In Section 4 we propose an algorithm for generating approximate

results for queries with non-progressive fuzzy predicates. In Section 5, we investigate the statistical properties of

popular fuzzy logic semantics and we describe a generic function that approximates score distributions used in

the algorithm, when such distributions are not readily available. In Section 6 we experimentally evaluate the pro-

posed algorithm. In Section 7 we compare our approach with existing work. Finally, we present our concluding

remarks.

5

Page 6: Similarity Ranking

F1

F2

Figure 3: Clustering error which results in imperfections in the index structures: the squares denote the matching

objects, circles denote the non-matching objects, and the dashed rectangle denotes the cluster used by the index

for efficient storage and retrieval

2 Multimedia Retrieval Semantics

In this section, we first provide an overview of the multimedia retrieval semantics; i.e., we describe what we

mean by retrieval of multimedia data. We then review some of the approaches to deal with multimedia queries

that involve multiple fuzzy predicates.

2.1 Fuzziness in Multimedia Retrieval

It is possible to classify the fuzziness in the multimedia queries into three categories: precision-related, recall-

related, and partiality-related fuzziness. Precision related class captures fuzziness due to similarity of features,

imperfections in the feature extraction algorithms, imperfections in the query formulation methods, and the pre-

cision rate of the utilized index structures (Figure 3). Recall- and partiality-related classes are self explanatory.

Note that, in information retrieval, the precision/recall values are mainly used in evaluating the effectiveness

of a given retrieval operation or the effectiveness of a given index structure. Here, we are using these terms

more as statistics which can be utilized to estimate the quality of query results. We have used this approach, in

SEMCOG image retrieval system [4, 3, 5], to provide pre- and post-query feedback to users. The two examples

given below, show why precision and recall are important in processing multimedia queries.

Example 2.1 (Handling precision related fuzziness) Let us assume that we are given a query of the form

Q(X) s like(man;X:semanti property) ^ image mat h(X:image property; \a:gif

00

):

Let for a given object I , the corresponding semanti property be woman and image property be a im:gif .

Let us assume that the semantic precision of s like(man;woman) is 0:8 and the image matching precision

image mat h (\im:gif

00

; \a:gif

00

) is 0:6. This means that the index structure and semantic clusters used to

implement the predicate s like guarantee that 80% of the returned results are semantically similar to man. These

semantic similarities can be evaluated using various algorithms [14, 15]. Similarly, the predicate image mat h

6

Page 7: Similarity Ranking

guarantees that 60% of the returned results are visually similar to 00a:gif 00. Then, assuming that the two predicates

are not correlated, Q(I) should be 0:8 � 0:6 = 0:48: By replacing X in s like(man;X) with woman, we

maintain 80% precision; then, by replacing the X in image mat h(X; \a:gif

00

) with I , we maintain 60% of the

remaining precision. The final precision (or confidence) is 0:48. 3

Example 2.2 (Handling recall related fuzziness) Recall rate is the ratio of the number of returned results to the

ratio of the number of all relevant results. Let us assume that we have the query used in the earlier example. Let

us also assume that s like(man;X) returns 60% and image mat h(X; \a:gif

00

) returns 50% of all applicable

results in the database. Let us further assume that both functions work perfectly when both variables are bound.

Then (1) a left to right query execution plan would return 0:6 � 1:0 = 0:6, (2) a right to left query execution

plan would return 1:0 � 0:5 = 0:5, and a parallel execution of the predicates followed by a join would return

0:6 � 0:5 = 0:3 of all the applicable results. Consequently, given a query execution plan and assuming that the

predicates are independent, the recall rate can be found by multiplying the recall rates of the predicates. 3

2.2 Query Semantics in the Presence of Imperfections and Similarities

Traditional query languages are based on boolean logic, where a predicate is treated as a propositional function,

which returns one of the two values: true or false. However, due to the stated imperfections, predicates related

to visual or semantic features do not correspond to propositional functions, but to functions which return values

between 0:0 and 1:0. Consequently, the solution of a query of the form

Q(Y

1

; : : : ; Y

n

) �(p

1

(Y

1

; : : : ; Y

n

); : : : ; p

m

(Y

1

; : : : ; Y

n

));

where pi

s are fuzzy or crisp predicates, � is a logic formula, and Yj

are free variables. The solution, on the other

hand, is defined as an ordered set SQ

of n-tuples of the form hX1

; X

2

; : : : ;X

n

i, where (1) n is the number of

variables in query Q, (2) each Xi

corresponds to a variable Yi

in Q, and (3) each Xi

satisfies the type constraints

of the corresponding predicates in Q. The order of the set S denotes the relevance ranking of the solutions.

The first, trivial, way to process multimedia queries is by transforming similarity functions into propositional

functions by choosing a cutoff point, rtrue

and by mapping all numbers in [0:0; r

true

) to false and numbers

in [r

true

; 1:0℄ to true. Intuitively, such a cutoff point will correspond to a similarity degree which denotes

dissimilarity. The main advantage of this approach is that, predicates can refute (conjunctive queries) or validate

(disjunctive queries) solutions as soon as they are evaluated so that optimizations can be employed. In [16],

Chaudhuri and Gravano discuss cost-based query optimization techniques for such filter queries2 . If users only

consider full matching this method is preferred because it allows query optimization by early pruning of the

search space. However, when partial matches are also acceptable, this approach fails to produce appropriate

solutions. For instance, in Example 1.1, candidate images shown in Figures 2(a) and (c) will not be considered.

2Their techniques assume that queries do not contain negation.

7

Page 8: Similarity Ranking

The second way to process multimedia queries is to leave the final decision, not to the constituent predicates

but, to the n-tuple as a whole. This can be done by defining a scoring function �

Q

which maps a given object to

a value between 0:0 and 1:0. In this method a candidate object, o, is returned as a solution if �Q

(o) � sol

true

,

where sol

true

is the solution acceptance threshold. since determining an appropriate threshold is not always

possible, an alternative approach is to rank all candidate objects according to their scores and return the first k

candidates, where k is a user defined parameter. Related problems are studied within the domains of fuzzy set

theory [17], fuzzy relational databases [18], and probabilistic databases [19].

If users are also interested in partial matches, the second method is more suitable. The techniques proposed

in [16] do not address queries with partial matches since for a given object if any filter condition fails, the object

is omitted from the set of results. Furthermore, the proposed algorithms are tailored towards the min semantics

which, as discussed later in this section, despite of the many proven advantages, are not suitable for multimedia

applications. In this paper, we focus on the second approach. As discussed above, multimedia predicates, by their

nature, have associated scoring functions. Fuzzy sets and the corresponding fuzzy predicates also have similar

scoring or membership functions. Consequently, we will examine the use of fuzzy logic for multimedia retrieval.

The major challenge with the use of the second approach is that early filtering can not always be applied. As

mentioned earlier, Fagin [11, 10] has proposed algorithms for efficient query processing when all fuzzy predicates

are ordered. In contrast, our aim is to develop an algorithm that uses statistics to efficiently process such queries

even when not all of the fuzzy predicates are ordered.

3 Application of Fuzzy Logic for Multimedia Databases

In this section we provide an overview of fuzzy logic and, then, we introduce properties of different fuzzy logic

operators. A fuzzy set, F , with domain D can be defined using a membership function, �F

: D ! [0; 1℄. A crisp

(or conventional) set, C , on the other hand, has a membership function of the form �

C

: D ! f0; 1g. When for

an element d 2 D, �C

(d) = 1, we say that d is in C (d 2 C), otherwise we say that d is not in D (d =2 C). Note

that a crisp set is a special case of a fuzzy set.

A fuzzy predicate is defined as a predicate which corresponds to a fuzzy set. Instead of returning true(1)

or false(0) values for propositional functions (or conventional predicates which correspond to crisp sets), fuzzy

predicates return the corresponding membership values. Binary logical operators (^;_) take two truth values and

return a new truth value. The unary logical operator, :, on the other hand takes one truth value and returns an

other truth value. Similarly, binary fuzzy logical operators take two values between 0:0 and 1:0 and return a third

value between 0:0 and 1:0. A unary fuzzy logical operator, on the other hand, takes one value between 0:0 and

1:0 and returns an other value between 0:0 and 1:0.

8

Page 9: Similarity Ranking

Min semantics Produ t semantics

P

i

^P

j

(x) = minf�

i

(x); �

j

(x)g �

P

i

^P

j

(x) =

i

(x)��

j

(x)

maxf�

i

(x);�

j

(x);�g

� 2 [0; 1℄

P

i

_P

j

(x) = maxf�

i

(x); �

j

(x)g �

P

i

_P

j

(x) =

i

(x)+�

j

(x)��

i

(x)��

j

(x)�minf�

i

(x);�

j

(x);1��g

maxf1��

i

(x);1��

j

(x);�g

:P

i

(x) = 1� �

i

(x) �

:P

i

(x) = 1� �

i

(x)

Table 1: Min and products semantics for fuzzy logical operators

T-norm binary function N (for ^) T-conorm binary function C (for _)

Boundary conditions N(0; 0) = 0, N(x; 1) = N(1; x) = x C(1; 1) = 1, C(x; 0) = C(0; x) = x

Commutativity N(x; y) = N(y; x) C(x; y) = C(y; x)

Monotonicity x � x

0

; y � y

0

! N(x; y) � N(x

0

; y

0

) x � x

0

; y � y

0

! C(x; y) � C(x

0

; y

0

)

Associativity N(x;N(y; z)) � N(N(x; y); z) C(x;C(y; z)) � C(C(x; y); z)

Table 2: Properties of triangular-norm and triangular-conorm functions

3.1 Relevant Fuzzy Logic Operator Semantics

There are a multitude of functions [20, 21, 22], each useful in a different application domain, proposed as seman-

tics for fuzzy logic operators (^;_;:). In this section, we introduce the popular scoring functions, discuss their

properties, and show why these semantics may not be suitable for multimedia retrieval.

Two of the most popular scoring functions are the min and product semantics of fuzzy logical operators.

We can state these two semantics in the form of a table as follows: Given a set, P = fP

1

; : : : ; P

m

g of fuzzy

sets and F = f�

1

(x); : : : ; �

m

(x)g of corresponding membership functions, Table 1 shows the min and product

semantics.

These two semantics (along with some others) have the following properties: Defined as such, binary con-

junction and disjunction operators are triangular-norms (t-norms) and triangular-conorms (t-conorms). Table 2

shows the properties of t-norm and t-conorm functions. Intuitively, t-norm functions reflect the properties of the

crisp conjunction operation and t-conorm functions reflect those of the crisp disjunction operation.

Although the property of capturing crisp semantics is desirable in many cases, for multimedia applications,

this is not always true. For instance, the partial match requirements invalidate the boundary conditions. In

addition, monotonicity is too weak of a condition for multimedia applications. An increase in the score of a single

query criterion should increase the combined score; whereas the monotonicity condition dictates such a combined

increase only if the scores for all of the query criteria increases simultaneously. A stronger condition (N(x; y)

increases even if only x or only y increases) is called strictly increasing property3 . Clearly, min(x; y) is not

strictly increasing. Another desirable property for fuzzy conjunction and disjunction operators is distributivity.

The min semantics is known [23, 10, 24] to be the only semantics for conjunction and disjunction that preserves

logical equivalence (in the absence of negation) and be monotone at the same time. This property of the min

3This is not the same definition of strictness used in [10].

9

Page 10: Similarity Ranking

P

i

1

^:::^P

i

n

(x) �

:P

i

(x) �

P

i

1

_:::_P

i

n

(x)

i

1

(x)�:::��

i

n

(x)

n

1� �

i

1

(x) 1�

(1��

i

1

(x))�:::�(1��

i

n

(x))

n

(�

i

1

(x)� : : :� �

i

n

(x))

1

n

1� �

i

1

(x) 1� ((1� �

i

1

(x))� : : :� (1� �

i

n

(x)))

1

n

Table 3: N-ary arithmetic average and geometric average semantics

semantics makes it the preferred fuzzy semantics for most cases. Furthermore, in addition to satisfying the

properties of being t-norm and t-conorm, the min semantics also has the property of being idempotent.

Although it has nice features, the min semantics is not suitable for multimedia applications. As discussed

in Example 1.2, according to the min semantics, the score of the candidate images given in Figures 2(a) and (c)

would be 0:0 although they partially match the query. Furthermore, scores of the images in Figure 2(b) and (d)

would both be 0:5, although Figure 2(d) intuitively has a higher score.

The product semantics [21], on the other hand, satisfies idempotency only if � = 0. On the other hand,

when � = 1, it has the property of being strictly increasing (when x or y is different from 1) and Archimedean

(N(x; x) < x and C(x; x) > x). The Archimedean property is weaker than the idempotency, yet it provides an

upper bound on the combined score, allowing for optimizations.

3.2 N-ary Operator Semantics

In information retrieval research (which also shows the characteristics of multimedia applications), other fuzzy

semantics, including the arithmetic mean [26] are suggested. The arithmetic mean semantics (Table 3) provides

an n-ary scoring function (jfPi

; : : : ; P

j

gj = n). Note that the binary version of arithmetic mean does not satisfy

the requirements of being a t-norm: it does not satisfy boundary conditions and it is not associative. Hence, it

does not subsume crisp semantics. On the other hand, it is idempotent and strictly increasing.

Arithmetic average semantics emulate the behavior of the dot product based similarity calculation popular,

in information retrieval: effectively, each predicate is treated like an independent dimension in an n-dimensional

space (where n is the number of predicates), and the merged score is defined as the dot-product distance between

the complete truth, h1; 1; : : : ; 1i, and the given values of the predicates, h�1

(x); : : : ; �

n

(x)i. Although this

approach is shown to be suitable for many information retrieval applications it does not capture the semantics of

multimedia retrieval applications, introduced in Section 2, which are multiplicative in nature. Therefore, we can

use the n-ary geometric average semantics instead.

Note that, as it was the case in the original product semantics, the geometric average semantics is also

not distributive. Therefore, if Q(Y

1

; : : : ; Y

n

) is a query which consists of a set of fuzzy predicates, variables,

constants, and conjunction, disjunction, and negation operators, and if Q_(Y1

; : : : ; Y

n

) is the disjunctive normal

representation of Q, then, we define the normal fuzzy semantics of Q(Y

1

; : : : ; Y

n

) as the fuzzy semantics of

Q

_

(Y

1

; : : : ; Y

n

). In general, �Q

6= �

Q

_ . The former semantics would be used when logical equivalence of

queries is not expected. The latter, on the other hand, would be used when the logical equivalence is required.

10

Page 11: Similarity Ranking

3.3 Accounting for Partial Matches

Both the min and the geometric mean functions have weaknesses in supporting partial matches. When, one of

the involved predicates returns zero, then both of these functions return 0 as the combined score. However, in

multimedia retrieval, partial matches are required (see Section 1). In such cases, having a few number of terms

with 0 score value in a conjunction should not eliminate the whole conjunctive term from consideration.

One proposed [6] way to deal with the partial match requirement is to weigh different query criteria in such a

way that those criterion that are not important for the user are effectively omitted. For instance, in the query given

in Figure 2(Query), if the user knows that spatial information is not important, then the user can choose to provide

a lower weight to spatial constraints. Consequently, using a weighting technique, the image given in Figure 2(a)

can be maintained although the spatial condition is not satisfied. This approach, however, presupposes that

users can identify and weigh different query criteria. This assumption may not be applicable to many situations,

including databases for naive users or retrieval by QBE (query by example). Furthermore, it is always possible

that for each feature or criterion in the query, there may be a set of images in the database that fails it. In such a

case, no weighting scheme will be able to handle the partial match requirement for all images.

Example 3.1 Let us assume that a user wants to find all images in the database that are similar to image Iexample

.

Let us also assume that the database uses three features, olor, shape, and edgedistribution to compare images,

and that the database contains three images, I1

, I2

, and I

3

. Finally let us assume that the following table gives

the matching degrees of the images in the database for each feature:

Image Shape Color Edge

I

1

0.0 0:9 0:8

I

2

0:8 0.0 0:9

I

3

0:9 0:8 0.0

According to this table, it is clear that if the user does not specify a priority among the three features, the

system should treat all three candidates equally. On the other hand, since for each of the three features, there is

a different image which fails it completely, even if we have a priori knowledge regarding the feature distribution

of the data, we can not use feature weighing to eliminate low scoring features. 3

To account for partial matches, we need to modify the semantics of the n-ary logical operators and eliminate

the undesirable nullifying effect of 0 [3]. Note that a similar modification can also be done for the min semantics:

Given a set, P = fP

1

; : : : ; P

m

g of fuzzy sets and F = f�

1

(x); : : : ; �

m

(x)g of corresponding scoring

functions, the semantics of n-ary fuzzy conjunction operator is as follows:

(P

i

1

^:::^P

i

n

)

(t; r

true

; �) =

((

Q

k

(t)�r

true

k

(t))� (

Q

k

(t)<r

true

�))

1=n

� �

1� �

where

� n is the number of predicates in Q,

11

Page 12: Similarity Ranking

� r

true

is the truth cutoff point, i.e., it is the minimum valid score,

� � is an offset value greater than 0:0 and less than r

true

; it corresponds to the fuzzy value of false and it

prevents the combined score to be 0 when one of the predicates has a 0 score, and

� �

k

(t) is the score of predicate Pk

2 fP

i

1

; : : : ; P

i

n

g for n-tuple t.

The term (

Q

k

(t)�r

true

k

(t)) � (

Q

k

(t)<r

true

�))

1=n returns a value between � and 1:0. Subtraction of � from

it and the subsequent division to 1� � normalizes the result to values between 0:0 and 1:0. An additional

improvement could be to allow the conjunction to give extra importance to the predicates which are above the

truth cutoff value, rtrue

. To achieve this, one can modify the scoring function as follows:

(P

i

1

^:::^P

i

n

)

(t; r

true

; �

1

; �

2

; �) = �

1

((

Q

k

(t)�r

true

k

(t))� (

Q

k

(t)<r

true

�))

1=n

� �

1� �

+�

2

P

k

(t)�r

true

1

n

;

where �1

and �2

are values between 0:0 and 1:0 such that �1

+ �

2

= 1:0,

The following is an example where various semantics for logical operators are compared. Among other things,

this example clearly shows that any semantics which capture the crisp semantics are not suitable for multimedia

retrieval.

Example 3.2 In this example, we use the query which was presented in the introduction section to compare

scores corresponding to different approaches. Figure 4 shows a set of candidate images and the associated scores

computed by different methods. Numbers next to the objects in the candidate images denote the similarity values

for the object level matching. The figure shows the score of the candidate images as well as their relative ranks.

The cutoff parameters used in this example are rtrue

= 0:4 and � = 0:4, and the structural weights are �1

= 0:8

and �2

= 0:2. 3

4 Evaluation of Queries with Unordered Fuzzy Predicates

In the earlier sections, we have investigated characteristics of multimedia retrieval and semantic properties of

different fuzzy retrieval options. In this section, we focus on the query processing requirements for multimedia

retrieval and provide an efficient algorithm. Although the algorithm is independent of the chosen semantics of the

fuzzy logic operators described above, it uses their statistical properties to deal with unknown system parameters.

4.1 Essentials of Multimedia Query Processing

Recently, many researchers studied query optimization (mostly through algebraic manipulations) in non-traditional

forms of databases. [28, 29, 30, 31, 32] provide overviews of techniques used for query processing and retrieval

12

Page 13: Similarity Ranking

Fuji

Mountain

Lake

Query

Lake

0.5

1.0

0.8Mountain

Fuji

Candidate 4

Lake

Mountain0.98

0.98

Fuji

0.0

Candidate 1

Mountain

Lake

0.5

0.5

1.0

Candidate 2

0.8

Fuji

Mountain

Forest

0.00.5

Candidate 3

Semantics Score Rank Score Rank Score Rank Score Rank

min 0.50 1-2 0.00 3-4 0.50 1-2 0.00 3-4

product 0.40 1 0.00 3-4 0.25 2 0.00 3-4

arithmetic

average

0.76 1 0.65 3 0.66 2 0.43 4

geometric

average

0.74 1 0.00 3-4 0.63 2 0.00 3-4

geometric aver-

age with cutoff

0.56 1 0.55 2 0.38 3 0.24 4

geometric aver-

age with weights

0.65 1 0.57 2 0.51 3 0.32 4

Figure 4: Comparison of different scoring mechanisms

in such databases. Solutions in such non-traditional databases vary from the use of database statistics and do-

main knowledge to facilitate query rewriting and intelligent use of cached information [33] to the use of domain

knowledge to discover and prune redundant queries [34]. Li et al. also used off-line feedback mechanisms to

prevent users from asking redundant or irrelevant queries [35]. In [16] Chaudhuri and Gravano discuss query

optimization issues and ranking in multimedia databases. In [36], Chaudhuri and Shim discuss approaches for

query processing in the presence of external predicates, or user defined functions which are very common in

multimedia systems.

The above work and others in the literature collectively point to the following essential requirements for

multimedia query processing:

� As discussed in earlier sections, fuzziness is inherent in multimedia retrieval due to many reasons includ-

ing similarity of features, imperfections in the feature extraction algorithms, imperfections in the query

formulation methods, partial match requirements, and imperfections in the available index structures.

� Users are usually not interested in a single result, but k � 1 ranked results, where k is provided by the user.

This is mainly due to the inherent fuzziness, users want to have more alternatives to choose what they are

interested in.

� We would prefer to generate kth result, after we generate (k � 1)

th result, as progressively as possible.

� Since the solution space is large, we can not perform any processing which would require us to touch or

enumerate all solutions.

13

Page 14: Similarity Ranking

In [10, 11], Fagin proposes a set of efficient query execution algorithms for databases with fuzzy queries.

These algorithms assume that

� the query has a monotone increasing combined scoring function,

� individual sources can progressively (in decreasing order of score) output results, and

� the user is interested in the best k matches to the query.

If all these conditions hold, then these algorithms can be used to progressively find the best k matches to the

given query.

Note that, if the min semantics for conjunction is used and if the query does not contain negation, then

scoring function of queries are guaranteed to be monotone increasing. Similarly, for arithmetic average, product,

and geometric average semantics, (if the query does not contain negation, then) the combined scoring function

will be monotone. Consequently, the algorithms proposed by Fagin can be applied.

4.2 Negation

If the query contains negation, on the other hand, then the scoring function may not be monotone increasing,

invalidating one of the assumptions.

This, however, can be taken care of if we can assume that some of the sources (the negated ones) can also

output results in increasing order of score. Since a multimedia predicate is more likely to return lower scores, the

execution cost of such a query is expected to be higher. Although, the algorithm we introduce in this paper can

take into account negated goals, when such an index is available, the actual focus of the algorithm is to deal with

unordered subgoals.

4.3 Unordered Subgoals

The second assumption can also be invalid for various reasons, including the binding rules imposed by the

predicates.

Example 4.1 For example, consider the following query which is aimed at receiving all pairs of images, each

containing at least one object, a ”mountain” and a ”tree”, respectively, and that are visually similar to each other:

select image P1, P2

where P1.semantical property s like ”mountain”

and P2.semantical property s like ”tree”

and P1.image property image match P2.image property

The above query contains three fuzzy conditions (two s like predicates and one image match predicate). Let

us assume that the image match predicate is implemented as an external function, which can be invoked only by

providing two input images. The image match predicate then returns a score denoting the visual similarity of

its inputs. In this case, we have two sources (s like predicates) which can output images progressively through

database access and one source (image match) which can not. 3

14

Page 15: Similarity Ranking

--------------------P(X) and Q(Y) and R(X,Y)

<y3,0.6>

--------------------P(X) and Q(Y) and R(X,Y)

<y1,0.8>

<y3,0.6>

<y2,0.7> <y2,0.7>

(ordered) (ordered) (ordered) (unordered)

(a) (b)

Three results can be merged in order Only two results can be merged in order

(ordered) (ordered)

<x1,0.9> <y1,0.8>

<x2,y2,0.1>

<x3,y3,0.7>

<x2,0.9>

<x3,0.9>

<x1,0.9>

<x2,0.9>

<x3,0.9>

<x1,y1,0.78> <x3,y2,0.8>

<x1,y1,0.78>

<x2,y1,0.75>

Figure 5: The effect of having non-progressive fuzzy predicates (numbers in the figure denote scores of the re-

sults): In (a) all predicates are progressive (results are ordered); hence, the results can be merged using algorithms

presented in [11] to find the k top ranking results. In (b) only two of the three fuzzy predicates are progressive;

hence the merging can be done on two predicates only. However, in this case, the ranking generated by merging

two predicates may not be equal to the ranking that would be generated by merging three of the scores.

As shown in Figure 5, because of such non-progressive fuzzy predicates, finding and returning the k best

matching results may require a complete scan of the database. Consequently, in order to avoid the complete scan

of the database, our algorithm uses the score distribution estimates/statistics of the individual predicates and the

statistical properties of the score merging function to compute an approximate set of top-k results.

In [16] Chaudhuri and Gravano discuss query optimization issues and ranking in multimedia databases. In

their framework, they differentiate between top search (access through an index) and probe (testing the predicate

for given object). Top search and probe can be looked as the same as sorted access and random access described

in [11].

In the next subsection, we provide an algorithm that has a similar search/probe structure. However, unlike

the techniques proposed in [16], the algorithm we propose is aimed at dealing with queries with partial matches,

is not tied with min semantics, and can deal with negation as long as sources (the negated ones) can also output

results in increasing order of score. In order to deal with the non-progressive predicates, the algorithm we propose

takes a probability threshold, �, as input and it returns a set R of k results, such that each result r 2 R is most

probably (prob(r 2 Rk

) > 1��) in the top k results, Rk

.

Note that the algorithm uses score distribution function estimates/statistics [33, 35] and the statistical prop-

erties of the score merging functions for computing approximate top-k results. Note also that the algorithm

is flexible in the sense that both strict (such a product) and monotone (such as min) semantics of fuzzy logic

operators are acceptable.

15

Page 16: Similarity Ranking

Algorithm 4.1 Query Evaluation Algorithm

Input:

� A query, Q,

� A set of ordered predicates, PO

, a set of non-ordered predicates, PU

, and a set of crisp predicates, PC

,

� A positive integer, k,

� A threshold, �.

1. Identify a scoring function, �Q

. Find the sign (signi

2 f+;�g) of the slope of �Q

with respect to each

predicate Pi

2 P .

2. V isited = ;. solNum = 0. Marked = ;.

3. while solNum < k do

(a) Use the decreasingly (if signi

= +) or increasingly (if signi

= �) ordered outputs of predicates

P

i

2 P

O

to construct a set, S, of at least 1 n-tuple satisfying all crisp predicates, PC

.

(b) Order the resulting tuples with respect to �Q

, which encapsulate predicates in PO

, as well as those in

P

U

. Let �i;j

denote the score of the n-tuple si

2 S with respect to the predicate Pj

2 P

U

.

(c) V isited = V isited [ S.

(d) For every n-tuple si

2 S, find P

9

(i) = prob(9j > i su h that �

Q

(s

i

) < �

Q

(s

j

)) (assuming that

predicates in Q are independent).

(e) Put n-tuples in S such that P9

(i) � � into Marked.

(f) Let the number of n-tuples put into Marked in the previous step be k0. solNum = solNum+ k

0.

4. Let R be the best k solutions in V isited; output R.

Figure 6: Query evaluation algorithm.

4.4 Query Evaluation Algorithm

Let Q(Y

1

; : : : ; Y

n

) be a query and let P = P

O

[P

U

be the set of all fuzzy predicates in Q, such that predicates in

P

O

can output ordered results and those inPU

can not output ordered results. Let the query also contain a set, PC

,

of crisp (non-fuzzy) predicates, including those which check the equalities of variables. Let us also assume that

the user is interested in a set R of k results, such that each result r 2 R is most probably (prob(r 2 Rk

) > 1��)

in the top k results, Rk

. The proposed query execution algorithm is given in Figure 6.

The main inputs to this algorithm are

� a query, Q,

� a set of ordered predicates, PO

, a set of non-ordered predicates, PU

, and a set of crisp predicates, PC

,

� a positive integer, k,

� an error threshold, �,

and the output is a set, R, of k n-tuples. Below, we provide detailed description of the algorithm.

16

Page 17: Similarity Ranking

The first step of the algorithm calculates the combined scoring function for the query. If this combined

scoring function is not monotone (i.e., there are some negated subgoals), the algorithm identifies the predicates

which are negated. For all negated predicates which may have an inversely ordered index (if such an index is

available), the algorithm will use the results in the increase order of score. For all negated predicates which do

not have such an inverse index (which is more likely) the algorithm will treat them as non-progressive predicates.

The second step of the algorithm, initializes certain temporary data structures. solNum keeps track of the

number of candidate solutions generated, V isited is the set of all tuples generated (candidate solution or not),

and Marked is the set of all candidate solutions.

In steps 3(a) through 3(d), the algorithm uses an algorithm similar to the one presented in [11, 10], to merge

results using the ordered predicates (see Figure 5(b)). The algorithm stops when it finds k n-tuples which satisfy

the condition in step 3e. Note that an n-tuple, si

, satisfies this condition if and only if

� the probability of having another n-tuple, sj

, with a better score is less than �, i.e., when P

9

(i) =

prob(9j > i su h that �

Q

(s

i

) < �

Q

(s

j

)).

Let us refer to this condition as F (i;�) = (P

9

(i) � �). Intuitively, if F (i;�) is true, then the probability of

having another n-tuple, sj

, with a better combined score than the score of si

, is less than �.

At the end of the third step, Marked contains k n-tuples which satisfy the condition. However, in the fourth

step, the algorithm revisits all the n-tuples generated and put into V isited earlier, to see whether there are any

better solutions in those that are visited. The best k n-tuples generated during this process are returned as the

output, R.

Intuitively, the algorithm uses a technique similar to the ones presented in [11, 10] to generate a sequence of

results that are ranked with respect to the ordered predicates. As we mentioned above, this order does not neces-

sarily correspond the the final order of the predicates, because it does not take the unordered scores into account.

Therefore, for each tuple generated in the first stage, using the database statistics and statistical properties of �Q

,

the algorithm estimates the probability of having a better result in the remainder of the database. If for a given

tuple, this probability is below a certain level, then this tuple is said to be a candidate to be in the top k results.

Note that, for the algorithm to work, we need to be able to calculate F (i;�) for the fuzzy semantics chosen

for the query. For each different semantics, F (i;�) must be calculated in a different way. The following two

examples show how to calculate F (i;�) for product and min semantics we covered in Section 3.

Example 4.2 Let us assume the �Q

is a scoring function that has the product semantics. Let �Q;o

be the com-

bined score of the ordered predicates and let �Q;u

be the combined score of the unordered predicates. Then,

given the ith tuple, si

, ranked with respect to the ordered predicates, F�

(i;�) is equal to

F

(i;�)= (prob(9 j > i �

Q;o

(s

j

)� �

Q;u

(s

j

) > �

Q;o

(s

i

)� �

Q;u

(s

i

)) � �)

= (prob(8 j > i �

Q;o

(s

j

)� �

Q;u

(s

j

) � �

Q;o

(s

i

)� �

Q;u

(s

i

)) � 1��)

= (prob(8 j > i �

Q;u

(s

j

) �

Q;o

(s

i

)��

Q;u

(s

i

)

Q;o

(s

j

)

) � 1��)

17

Page 18: Similarity Ranking

3

Example 4.3 Let us assume the �Q

is a scoring function that has the min semantics. Then, given the ith tuple,

s

i

, ranked with respect to the ordered predicates, Fmin

(i;�) = (P

9

(i) � �) is equal to

F

min

(i;�)= (prob(9 j > i minf�

Q;o

(s

j

); �

Q;u

(s

j

)g > minf�

Q;o

(s

i

); �

Q;u

(s

i

)g) � �)

= (prob(8 j > i minf�

Q;o

(s

j

); �

Q;u

(s

j

)g � minf�

Q;o

(s

i

); �

Q;u

(s

i

)g) � 1��)

Since �Q;o

is a non-increasing function, �Q;o

(s

j

) is smaller than or equal to �Q;u

(s

i

). Consequently,

if �Q;o

(s

i

) � �

Q;u

(s

i

), then

F

min

(i;�)= true

else

F

min

(i;�)= (prob(8 j > i minf�

Q;o

(s

j

); �

Q;u

(s

j

)g � �

Q;u

(s

i

)) � 1��).

3

As seen in the above examples, in order to find the value of F (i;�), we need to know the distributions of

the scoring functions �Q;o

and �Q;u

. In Section 5.1, we will show how these statistical values can be calculated

for different fuzzy semantics. However, in some cases, such a score distribution function may not be readily

available. In such cases, we need to approximate the score distribution using the database statistics. Obviously,

such an approximation is likely to cause deviations from the expected results obtained using the algorithm. In

Section 5, we describe a method for approximating score distributions.

4.5 Correctness of the Algorithm

In this subsection, we show that the expected ratio of the relevant results, within the k top results returned by the

algorithm, is within the error bounds; i.e., we prove the correctness of the algorithm.

Theorem 4.1 Given an n-tuple, ri

, which is in the top k n-tuples returned in R, ri

is most probably (prob(r 2

R

k

) > 1��) in the set, Rk

, of top k results. 2

Proof 4.1 Given the ith n-tuple, ri

(i � k) in R, ri

is either in Marked or not.

If ri

2 Marked, then it satisfies the condition in step 3e, and the probability that there is another tuple with

a better combined probability is less than or equal to �. Since i � k, we can conclude that the probability that

the subsequent searches will yield k � i + 1 tuples with a better combined probability is also less than or equal

to �. In other words, the probability that the subsequent searches will push r

i

out of Rk

is less than or equal to

�. Consequently, the probability that ri

is in Rk

is greater than 1��.

If ri

=2Marked, then there is an mi

2Marked such that ri

> m

i

and mi

satisfies the condition in Step 3e.

Consequently, the probability that ri

is in Rk

is again greater than 1��.

Theorem 4.2 The expected number of n-tuples that are in R that are also in Rk

is greater than (1��)� k. 2

18

Page 19: Similarity Ranking

Proof 4.2 Given a set, R, of k n-tuples such that each one is in R

k

with a worst-case probability 1 � � (The-

orem 4.1), the number, j, of the elements of R that are also in R

k

is a random variable that has a binomial

distribution with parameters p > � and q � 1 � �. Consequently, the expected value of j is greater than

(1��)� k.

4.6 Complexity of the Algorithm

Note that the main loop of the proposed algorithm can be iterated many times during which no tuples are added

to the Marked set. The following theorem states the affect of this on the complexity of the algorithm.

Theorem 4.3 If N is the database size (the number of all possible n-tuples) and m is jPO

j and if prob(F (i;�) =

truej1 � i � N) is geometrically distributed with parameter F and if predicates are independent, then the

expected running time of the algorithm, with arbitrarily high probability, is O(

N

m�1

m

�k

F

+

k

F

� log(

k

F

))). 2

Proof 4.3 If we assume that the probability of having one n-tuple satisfying the condition in step 3e is geomet-

rically distributed with parameter F , the expected number of n-tuples to be tested until the first suitable n-tuple

is 1

F

. In reality, F is not a constant and it tends to be higher for lower values of i. Consequently, the actual

number of visited tuples is lower than the value provided by this assumption. See the results in Section 6 for

details. Consequently, for k matches, the expected number of times the loop will be repeated is k

F

. Step 3a

of the algorithm takes O(N

m�1

m

) (according to Fagin [10, 11]), with arbitrarily high probability, where N is the

database size, i.e., the number of all possible n-tuples. Consequently, the expected running time of the while loop

(or the expected number of tuples to be evaluated), with arbitrarily high probability, is O(

N

m�1

m

�k

F

).

The final selection of k best solutions among the candidates, then, takes O(

k

F

� log(

k

F

)) time. Consequently,

the expected running time of the algorithm, with arbitrarily high probability, is O(

N

m�1

m

�k

F

+

k

F

� log(

k

F

)). 2

Note that when N

m�1

m

�k

F

� N , the proposed algorithm accomplishes its task: it visits a very small portion

of the database, yet generates approximately good results. This means that if k

F

� N

1

m , then the algorithm will

work most efficiently. In Section 6, we will show experiment results that verify these expectations.

Note that the theorem assumes that prob(F (i;�) = truej1 � i � N) is geometrically distributed. The

results presented in Section 6 will show that, if we replace the geometric distribution assumption with a more

skewed distribution, the complexity of the algorithm relative to the database size will be less than predicted by

the above theorem. Note also that, similar to the case in [11], although positive correlations between predicates

may help reduce the complexity predicted above, negative correlations between predicates may result in higher

complexities.

5 Approximating the Score Distribution using Statistics

In the previous section, we have seen that in order to use the proposed query evaluation algorithm, we need to

calculate the score distribution for the combined scoring functions, �Q;o

and �Q;u

. In this section, we discuss the

19

Page 20: Similarity Ranking

0 0.2 0.4 0.6 0.8 1

pred1

0.5

1

pred2

0

0.2

0.4

0.6

0.8

1

Conjunction - geometric average

0 0.2 0.4 0.6 0.8 1

pred1

0.5

1

pred2

0

0.2

0.4

0.6

0.8

1

Conjunction - arithmetic average

0 0.2 0.4 0.6 0.8 1

pred1

0.5

1

pred2

0

0.2

0.4

0.6

0.8

1

Conjunction - minimum

(a) (b) (c)

Figure 7: The effect of (a) geometric average, (b) arithmetic average, and (c) minimum function with two pred-

icates. Horizontal axes correspond to the values of the two input predicates and the vertical axis corresponds to

the value of the conjunct according to the respective function.

differences in score distributions for various fuzzy semantics, followed by our proposed method for approximat-

ing the score distribution using statistics.

5.1 Statistical Properties of Fuzzy Semantics

We first investigate the score distributions and statistical properties of different fuzzy semantics. Figure 7 depicts

three mechanisms to evaluate conjunction. Figure 7(a) depicts the geometric averaging method (which is product

followed by a root operation), (b) depicts the arithmetic averaging mechanism used by other researchers [26],

and (c) the minimum function as described by Zadeh [17] and Fagin [10, 11]. In this section, we compare various

statistical properties of these semantics. These properties describe the shape of the combined score distribution

histograms.

5.2 Relative Importance of Query Criteria

An important advantage of the geometric average, against the arithmetic average and the min functions, is that

although it shows a linear behavior when the similarity values of the predicates are close to each other:

� (x = y) �!

d�(x;y)

dxdy

=

1

2�

p

x�x

� (x+ x) =

1

2�x

� (2� x) = 1;

it shows a non-linear behavior when one of the predicates has lower similarity compared to the others:

� (x << y) �!

d�(x;y)

dx

=

p

y

2�

p

x

=1 (x >> y) �!

d�(x;y)

dx

= 0:

Example 5.1 The first item below shows the linear increase in the score of the geometric average when the input

values are closer to each other. The second item, on the other hand, shows that non-linearity of the increase when

input values are different:

� (0:5� 0:5 � 0:5)

1=3

= 0:5; (0:6 � 0:6� 0:6)

1=3

= 0:6; and (0:7 � 0:7� 0:7)

1=3

= 0:7:

20

Page 21: Similarity Ranking

Arithmetic average Min Geometric averageR

1

0

R

1

0

x+y

2

dydx

R

1

0

R

1

0

dydx

=

1

2

R

1

0

R

1

0

minfx;ygdydx

R

1

0

R

1

0

dydx

=

1

3

R

1

0

R

1

0

p

x�y dydx

R

1

0

R

1

0

dydx

=

4

9

Table 4: Average score of various scoring semantics

Arithmetic average Min Geometric average

1�

8��

3

3

(� � 0:5) 1� 3� �

2

+ 2� �

3

1� �

3

+ 3� �

3

� ln(�)

4

3

� 4� �

2

+

8��

3

3

(� � 0:5)

Table 5: Score distribution (relative cardinality of a strong �-cut) of various scoring semantics

� (1:0� 1:0 � 0:5)

1=3

= 0:79; (1:0� 1:0 � 0:6)

1=3

= 0:85; and (1:0 � 1:0� 0:7)

1=3

= 0:88: 3

It is claimed that according to real-world and artificial nearest-neighbor workloads, the highest-scoring pred-

icates are interesting and the rest is not interesting [27]. This implies that the min semantics which gives the

highest importance on the lowest scoring predicate may not be suitable for real workloads. The geometric average

semantics, unlike the min semantics, on the other hand, does not suffer from this behavior.

Furthermore, the effect of an increase in the score of a sub-query (due to a modification/relaxation on the

query by the user or the system) with a small score value is larger than an equivalent increase in the score of a

sub-query with a large score value. This implies that, although the sub-queries with a high score have a larger

role in determining the final score, relaxing a non-satisfied sub-query may have a significant impact on improving

the final score. This makes sense as an increase in a low scoring sub-query increases the interestingness of the

sub-query itself.

Average score: The first statistical property that we consider in this section is the average score, which measures,

assuming a uniform distribution of input scores, the average output score. The average score, or the relative

cardinality, of a fuzzy set with respect to its discourse (or domain) is defined as the cardinality of the set divided

by the cardinality of its discourse. We can define the relative cardinality of a fuzzy set S with a scoring function

�(x), where x ranges between 0 and 1 as

R

1

0

�(x)dx

R

1

0

dx

: Consequently, the average score of conjunction semantics

can be computed as shown in Table 4. Note that, if analogously defined, the relative cardinality of the crisp

conjunction is

(false^false)

+ �

(false^true)

+ �

(true^false)

+ �

(true^true)

jf(false ^ false); (false ^ true); (true ^ false); (true ^ true)gj

=

0 + 0 + 0 + 1

4

=

1

4

:

If no score distribution information is available, the average score (assuming a uniform distribution of inputs)

can be used to calculate a crude estimate of the ratio of the inputs that have a score greater than a given value.

However, a better choice is obviously to look at the expected score distributions of the merge functions.

21

Page 22: Similarity Ranking

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5x 0

0.2

0.4

0.6

0.8

1

0.50.6 0.7 0.8 0.9 1x

Figure 8: Score distribution of various score merging semantics (lower curve = min, middle curve = geometric

average, and upper curve = arithmetic average)

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10k

Uniform Distribution of Scores

0.05

0.1

0.15

0.2

0.25

0.3

0.35

2 4 6 8 10k

Zipf’s Distribution of Scores

0

0.1

0.2

0.3

0.4

0.2 0.4 0.6 0.8 1x

Vector-space Dist. of Scores

(a) (b) (c)

Figure 9: Various score distributions: (a) Uniform distribution, (b) Zipf’s distribution, (c) vector-space distribu-

tion

Score Distribution: The second property we investigate is score distribution. The study of the score distribution

of fuzzy algebraic operators is essential in creating histograms that can be used in the algorithm we proposed.

Strong �-cut of a fuzzy set is defined as the set of elements of the discourse that have score values equal to

or larger than �. The relative cardinality of a strong �-cut (�x^y

� �) of a conjunction with respect to its overall

cardinality (�x^y

� 0) describes the concentration of scores above a threshold �. Table 5 and Figure 8 show the

score distribution of various scoring semantics(assuming a uniform distribution of inputs). Note that when, � is

close to 1, the relative cardinality of the strong �-cut of the geometric average behaves like arithmetic average.

5.3 Approximation of Score Distributions

In the previous subsection, we studied the score distribution of various merge functions. These distributions were

generated assuming a uniform distribution of input values and can be used when there is no other information. An

alternative approach, on the other hand, is to approximate the score distribution function using database statistics

or domain knowledge.

5.3.1 Score Distributions

There are various ways in which the scoring function of a predicate can behave. The following is a selection of

possible distributions:

22

Page 23: Similarity Ranking

–0.3

–0.2

–0.1

0

0.1

0.2

0.3

–0.3

–0.2

–0.1

0

0.1

0.2

0.3

–0.3

–0.2

–0.1

0

0.1

0.2

0.3

Figure 10: The equi-distance spheres enveloping a query point in a three-dimensional space

Uniform distribution of scores: In this case, the probability that a fuzzy predicate will return a score within the

range [

a

k

;

a+1

k

), where 0 � a < k, is 1

k

. Figure 9(a) shows this behavior.

Zipf’s distribution of scores: It is generally the case that multimedia predicates have small number of high-

scoring and a very high number of low-scoring inputs. For instance, according to Zipf’s distribution [37, 38], the

probability that for a given data element, a fuzzy predicate will return a score within the range [

a

k

;

a+1

k

), where

0 � a < k, is 1

(a+1)�log(1:78�k)

(Figure 9(b)).

Vector-space distribution of scores: In information retrieval or multimedia databases, documents or media

objects are generally represented as points in an n � dimensional space. This is called the vector-model rep-

resentation of the data [28, ?]. The predicates use the distance (Euclidean, city-block etc.) between the points

in this space to evaluate the dissimilarities of the objects. Therefore, the further apart the objects are from each

other the more dissimilar they are. Hence, given two objects, o1

and o

2

, in the vector space, their similarity can

be measured as

sim(o

1

; o

2

) = 1�

�(o

1

; o

2

)

max�

;

where �(o

1

; o

2

) is the distance between o1

and o2

and max� is the maximum distance between any two points

in the database. Note that sim(o

1

; o

1

) is equal to 1:0 and the minimum possible score is 0:0.

Given a nearest-neighbor query (which is a very common type of query that asks for objects that fall near

a given query point) q, we can divide the space into k using k spheres each of which is max�

k

apart from each

other. Figure 10 shows three spheres enveloping a query point in a three-dimensional space. In this example,

each of these spheres is max�

10

units apart from each other; i.e., k is equal to 10.

Note that in an n-dimensional space, the volume of a sphere of radius r is C � r

n, where C is some constant

(for example, in two dimensional space, the area of a circle is �r2 and in a three dimensional space, the volume

of a sphere is 4

3

�r

3. Consequently, in an n-dimensional space, the volume between two consecutive, ith and

23

Page 24: Similarity Ranking

(b)(a)

Figure 11: Two example queries to an image database

(i� 1)

th, spheres can be calculated as

C � (i�

max�

k

)

n

� C � ((i� 1)�

max�

k

)

n

= C

0

� i

n�1

+ o(i

n�1

)

Hence, assuming that the points are uniformly distributed in the space, the expected ratio of points that fall into

this volume to the points that fall into the next larger slice is approximately (

i

i+1

)

n�1

: Therefore, if the number

of points in the innermost sphere is I , then the number of points in

� the second slice is O(I � 2

n�1

),

� the third slice is O(I � 3

n�1

),

� the fourth slice is O(I � 4

n�1

), and so on.

Hence, assuming uniform distribution of points in the space, the probability that for a given data element, a fuzzy

predicate will return a score within the range [

a

k

;

a+1

k

), where 0 � a < k, is M � (k � a)

n�1, where M is a

positive constant and n is the number of dimension in the vector space. Figure 9(c) shows this behavior. Note

that, when the number of dimensions is higher, the curve becomes steeper.

Clustered Score Distributions: In the real world, the assumption that points are uniformly distributed across the

vector space does not always hold. Figure 11 provides two example score distributions from an image database,

ImageRoadMap [?]. This particular database contains 25000 images. The retrieval predicate omits those images

that are beyond a certain threshold distance from the query point and, then, it appropriately scales the scores

to cover the range [0; 1℄. Therefore, although not shown in the figure, the bin corresponding to 0 score is, in

actuality, very large.

Figure 11(a) and (b), both shows a rapid increase in the score distribution as predicated by the zipfian and

vector-based models. However, both figures shows an other phenomenon unpredicted by these models: the

distribution starts decreasing after a point instead of continuously increasing. This is due to the fact that points

in the vector space are not uniformly distributed; instead, they tend to form clusters. This is especially apparent

in Figure 11(b), where there are not only one, but two local maximas, corresponding to two different clusters.

Note, however, that the fact that clusters exist in the vector space does not mean that the vector space model

described earlier can not be used to reason about score distribution: since the volume between slices in the vector

24

Page 25: Similarity Ranking

Higher Score

Mo

re P

oin

ts Actual distribution

Envelop

Figure 12: A clustered distribution and the enveloping curve

space increases as we get further away from the center, slices away from the query point is likely to contain more

clusters then the closer ones (assuming that clusters themselves are uniformly distributed). Therefore, a zipfian

or vector-based model can be used to model an envelop curve for a large vector space with clusters of points

(Figure 12).

Merged Score Distributions: When cached results and materialized views are used for answering queries or sub-

queries, a single predicate in a given query may correspond to a cached combination of multiple sub-predicates

put together (in advance) using a merge function. As we have seen in Section 5.1, even when their inputs

are uniformly distributed, the score distributions corresponding to such score merge functions show a skewed

behavior: the number of low-scoring inputs is much larger than the number of high-scoring inputs.

Summary: Since, in most cases the number of low-scoring inputs is much larger than the number of high-scoring

inputs, in the rest of the paper, we will focus our attention to these kind of scoring functions. However, this does

not mean that scoring functions with other behaviors can not exist.

5.3.2 Selection of an Appropriate Appropriate Scoring Function

Since in this section we concentrate on score distributions where the number of low-scoring inputs is much larger

than the number of high-scoring inputs, in order to model and approximate �

Q;o

and �

Q;u

, we need to find a

generic scoring function which has a similar behavior. Clearly various functions can be used for approximating

the scoring function. We see that two good candidates are

� f

a

�;�;�

(rank) =

rank+�

+ � and

� f

b

�;�;�

(rank) = � � (

max rank�rank

max rank

)

+ �.

Intuitively, in both cases, f(i) gives the ith highest score of the predicate. Depending on the values of �, �, and

� parameters, these functions can describe rapidly and slowly decreasing scoring functions. The advantage of

the first function over the second one is that it does not need the maximum rank information (max rank) and it

can be evaluated much easily. Note also that the first function can describe a large set of curvatures, both concave

and convex (Figure 13(a,b,c,d)). Therefore, we use fa in the rest of this paper.

25

Page 26: Similarity Ranking

0.2

0.4

0.6

0.8

score

200 400 600 800 1000rank

alpha=207, beta=197, phi=0

0

50

100

150

200

250

300

350

num

0.2 0.4 0.6 0.8 1score

alpha=207, beta=197, phi=0

(a) (b)

0

0.2

0.4

0.6

0.8

1

score

200 400 600 800 1000rank

alpha=–1333.9, beta=447.3, phi=1.34

0

50

100

150

200

250

num

0 0.2 0.4 0.6 0.8 1score

alpha=–1333.9, beta=447.3, phi=1.34

(c) (d)

0 5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1

1.2

Rank(*20000)

Sco

re

P(X) and Q(Y)

- :statistics

. :appx 1

+ :appx 2

(e)

Figure 13: The first score approximation function: two functions (a and c) and the corresponding score distribu-

tions (b and d); (e) approximation of the combined score of a query, P (X) ^Q(Y )

Figure 13(e) shows two possible approximations, each with different parameters, for the combined score of

a query of the form P (X) ^ Q(Y ), where P (X) and Q(Y ) are ordered and both satisfying Zipf’s distribution.

This figure shows that, the function does not only approximate the predicate scores, but can also approximate

combined scores for a query. Therefore, we can conclude that the proposed approximation function (1) is not lim-

ited to the predicates with a Zipfian distribution and (2) can handle merged scores with non-uniformly distributed

inputs.

5.3.3 Calculating F�

(i;�) using the Approximate Scoring Function

Note that an approximate scoring function, f�;�;�

(i), for the ith highest score, is not enough for the algorithm

proposed in Section 4.4. Instead, the algorithm needs the value of F�

(i;�)

4, assuming that the behaviors of

Q;o

and �Q;u

are approximated with functions f�

o

;�

o

;�

o

(x) and f�

u

;�

u

;�

u

(x).

Let us assume that �Q

is a scoring function that has the product semantics. Let �Q;o

be the combined score of

the ordered predicates and let �Q;u

be the combined score of the unordered predicates. Let �Q;o

(x) and �Q;u

(x)

be approximated with functions f�

o

;�

o

;�

o

(x) and f�

u

;�

u

;�

u

(x). Then, given the ith tuple, si

, ranked with respect

to the ordered predicates in Q, F�

(i;�) = (P

9

(i) � �) is equal to

(prob(8 j > i �

Q;o

(s

j

)� �

Q;u

(s

j

) � �

Q;o

(s

i

)� �

Q;u

(s

i

)) � 1��);

4Reminder: If F (i;�) is true, then given a tuple s

i

, the probability of having another tuple, sj

, with a better combined score than the

score of si

, is less than �.

26

Page 27: Similarity Ranking

uα β uf (X)

κij

��������������������������������������������������������

Sco

re

lj

Rank

Nthat would satisfy the

constraint

The range of inputs1

Figure 14: The ratio of the unordered tuples that could satisfy the constraint isN�l

j

N

(prob(8 j > i �

Q;u

(s

j

) �

Q;o

(s

i

)� �

Q;u

(s

i

)

Q;o

(s

j

)

) � 1��);

(prob(8 j > i �

Q;u

(s

j

) �

Q;o

(s

i

)� �

Q;u

(s

i

)

o

j+�

o

+ �

o

) � 1��);

(prob(8 j > i �

Q;u

(s

j

) �

Q;o

(s

i

)� �

Q;u

(s

i

)� (j + �

o

)

o

+ �

o

� (j + �

o

)

) � 1��); or in other words

(

N

Y

j=i+1

prob(�

Q;u

(s

j

) �

a

i

� j + b

i

� j + d

) � 1��):

Note that ai

; b

i

; and d are used as shorthands. To simplify the calculations, let us split the above inequality into

two parts as follows:

F

(i;�) = (S

i

� 1��) where S

i

=

N

Y

j=i+1

prob(�

Q;u

(s

j

) �

a

i

� j + b

i

� j + d

):

Since, �Q;u

(s

j

) is the combined score of the unordered predicates, given a rank j calculated with respect to the

ordered predicates, we can not estimate the value of �Q;u

(s

j

). However, using the function, f�

u

;�

u

;�

u

, we can

estimate the implicit rank lj

at which f�

u

;�

u

(l

j

) is equal to �i;j

=

a

i

�j+b

i

�j+d

as follows:

f

u

;�

u

(l

j

) =

a

i

� j + b

i

� j + d

;

u

l

j

+ �

u

+ �

u

=

a

i

� j + b

i

� j + d

; whi h gives us

l

j

=

a

0

i

� j + b

0

i

0

i

� j + d

0

i

:

27

Page 28: Similarity Ranking

Note, on the other hand, that, as shown as the shaded region in Figure 14, the ratio of all tuples, sj

, such that

Q;u

(s

j

) � �

i;j

isN�l

j

N

. Hence, we have

S

i

= prob(�

Q;u

(s

j

) � �

ij

) =

N � l

j

N

:

Since this ratio corresponds to a probability, we must make sure that 0:0 �N�l

j

N

� 1:0. In other words

0:0 � 1�

1

N

� (

a

0

i

� j + b

0

i

0

i

� j + d

0

i

) � 1:0;

which means that we can find two limits such that

l

?

i

� j � l

>

i

Consequently,

� if i + 1 � l

?

i

, then there will be at least one i + 1 � j � N such thatN�l

j

N

will be 0:0. Hence

S

i

= 0:0 < 1��; and assuming that the error ratio, �, is less than 1:0, the condition F

(i;�) = false.

� If, on the other hand, i+ 1 � l

>

i

, for all i+ 1 � j � N ,N�l

j

N

will be 1:0. Hence, Si

= 1:0 � 1�� and

F

(i;�) = true.

� Finally, if neither is the case, if l>i

< N , then since for all j � l

>

i

N�l

j

N

= 1:0, we have S

i

=

Q

N

j=i+1

N�l

j

N

=

Q

l

>

i

j=i+1

N�l

j

N

: Therefore, Si

can be rewritten asQ

lim

i

j=i+1

N�l

j

N

;where limi

= minfl

>

i

; Ng.

Note that we can further rewrite Si

as

lim

i

Y

j=i+1

a

00

i

� j + b

00

i

00

i

� j + d

00

i

;

where a00i

; b

00

i

;

00

i

and d00i

are used as shorthands 5.

Since we have guaranteed that the ratio,a

00

i

�j+b

00

i

00

i

�j+d

00

i

is always between 0:0 and 1:0, we can convert the product

into summation by taking the logarithm of both sides of the equation:

ln(S

i

) =

lim

i

X

j=i+1

ln(

a

00

i

� j + b

00

i

00

i

� j + d

00

i

):

Note that since it is not straight forward to solve the above summation, we can instead replace the equality

with two inequalities that bound the value of ln(Si

) from above and below. These two inequalities are

ln(S

i

) �

Z

lim

i

i+1

ln(

a

00

i

� j + b

00

i

00

i

� j + d

00

i

)dj and ln(S

i

) �

Z

lim

i

�1

i

ln(

a

00

i

� j + b

00

i

00

i

� j + d

00

i

)dj:

5

S

i

can be calculated as a

00

i

lim

i

�(lim

i

+

b

00

i

a

00

i

)

00

i

i+1

�(i + 1 +

d

00

i

00

i

)

00

i

lim

�1

�(lim

i

+

d

00

i

00

i

)

�1

a

i+1

�1

�(i+ 1 +

b

00

i

a

00

i

)

�1

,

where �(x) = (x� 1)!. But, we will provide an alternative formulation.

28

Page 29: Similarity Ranking

Using the second of the two inequalities, we can get

S

i

� e

R

lim

i

�1

i

ln(

a

00

i

�j+b

00

i

00

i

�j+d

00

i

)dj

= T

i

:

Since F�

(i;�) = (S

i

� 1��), we can conclude that

F

(i;�) = (T

i

� 1��) :

Consequently, putting altogether the different cases encountered so far, we have

� if i+ 1 � l

?

i

then

F

(i;�) = false;

� else if i+ 1 � l

>

i

,

F

(i;�) = true;

� else

F

(i;�) = (T

i

� 1��) :

Note that, therefore, given a tuple si

, ranked ith with respect to the ordered predicates, which satisfy the boundary

conditions is most probably in the result (prob(si

2 R

k

) > 1��) if

T

i

� 1��:

Final note: If the approximate score function overestimates �

Q;u

, then the value of the lim

i

(the limit rank

below which the probability of finding a larger combined score goes to 0) is also overestimated. An important

consequence of this overestimation, however, is that F (i;�) may evaluate to false in more situations then nec-

essary. Therefore, in order to balance such overestimations, we can introduce a new parameter win and set limi

to i+win. However, since the approximation function given in the previous section is flexible enough to fit into

various situations, we believe that this parameter will not be necessary in most situations. Nevertheless, in the

experiments section, we experimented with different values of the win parameter as well.

6 Experimental Evaluation

We have conducted a set of experiments to evaluate the algorithm we proposed to reduce the number of tuples

searched when looking for the best k matches to a fuzzy query. The primary goal of the experiments was to see

whether (a) the proposed algorithm returns the expected percentage of the results by (b) exploring only a small

29

Page 30: Similarity Ranking

0

0.2

0.4

0.6

0.8

1

score

200 400 600 800 1000rank*1000

alpha=197315, beta=207807, phi=0

Figure 15: The distribution of R(X;Y ) and its overestimating approximation

fraction of the entire search space. These experiments were also aimed at investigating whether (c) the proposed,

approximation of predicates through their statistical properties, approach affects the solutions obtained by the

algorithm negatively or not. In this section, we report on the observations we obtained through simulations.

6.1 Experiment Setup

In our experiments, we have varied (1) the number of top results required (k), (2) the database size (or the

search space size), and (3) the error threshold. The query that we used for the simulations contains two ordered

predicates, P (X) and Q(Y ), and one unordered predicate, R(X;Y ). We have ranged the sizes of predicates P

and Q between 100 and 1,000. Note that this means that there are up to 1,000,000 tuples in the database [X;Y ℄.

This is similar to the case when there are 1,000,000 images in the database, and we are using features with 1,000

different values for retrieval.

We ran the experiments on a Linux platform with a 300MHz Pentium machine with 128 MB main memory.

Each experiment was run 20 times and the results were averaged.

We have generated the fuzzy values for the predicate scores according to Zipf’s distribution [37, 38]. This

distribution guarantees that the number of high scores returned by the algorithm is less than the number of low

scores, fitting into the profile of many multimedia predicates. More specifically, we set the probability that for

a given data element, a fuzzy predicate will return a score within the range [

a

10

;

a+1

10

), where 0 � a < 10, to

1

(a+1)�log(1:78�10)

.

We used product semantics as the fuzzy semantics for retrieval. We have approximated �

Q;o

and �

Q;u

with

functions f�

o

;�

o

;�

o

(x) and f

u

;�

u

;�

u

(x). For instance, for the case when the size of the database is 1,000,000,

the corresponding approximation parameters are chosen as h�o

= 7913:20; �

o

= 8000:39; �

o

= 0:0i (for

P (X)^Q(Y ) as shown in Figure 13(e)) and h�u

= 197315:00; �

u

= 207807:59; �

u

= 0:0i (for R(X;Y )). Note

that the parameters are chosen such that the approximation for R(X;Y ) overestimates the scores (Figure 15). To

account for this overestimation, we varied win between 1 and 5.

For comparison purposes we have also implemented a brute force algorithm which explores all the search

30

Page 31: Similarity Ranking

2040

6080

100

0200

400600

8001000

0.1

0.2

0.3

0.4

0.5

0.6

Ksqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%20,win:2)

2030

4050

6070

8090

100

0

200

400

600

800

1000

0.6

0.7

0.8

0.9

1

K sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%80,win:4)

(a) (b)

2040

6080

100

0200

400600

8001000

0

50

100

150

200

250

300

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%20,win:2)

2040

6080

100

0200

400600

8001000

0

500

1000

1500

2000

2500

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%80,win:4)

(c) (d)

Figure 16: (a)-(c) Percentage of real top-k results among the tuples returned by the algorithm and (b)-(d) the

number of tuples enumerated (figures are plotted to thep

scale for database size)

space and returns the actual rank of each element in the database with respect to a given query. In the next section,

we describe our observations.

6.2 Experiment Results

The initial set of experiments showed that as expected due to the overestimation of the �

Q;u

(Figure 15) the

algorithm found all the data elements in top k, it performed significantly more comparisons than required, and in

the worst case, it degenerated to checking all tuples. This condition, however, gave us a reasonable framework

in which we can study the effect of correcting imperfect approximations through the use of the win parameter

which limits the look-ahead.

Figures 16(a) and (b) show the percentage of real top-k results among the tuples returned by the algorithm

for two different expected percentages: (a) %20 and (b) %80 (or 0.8 and 0.2 error thresholds, respectively). For

the first case, we used win = 2 and for the second case, we used win = 5. The reason why, in this particular

figure, we are using a larger win value for the smaller error threshold is that, when we want smaller errors, the

algorithm must make its estimations using more information. Results for the other parameter assignments are

provided in the Appendix. Both of these figures show that, as the value of k increases, the algorithm performs

better; in the sense that it returns closer to the expected ratio of real top-k results. Note, on the other hand, that

31

Page 32: Similarity Ranking

for the %20 case, when the database is small, the effect is higher number of top-k results (which means that the

algorithm visits more tuples then it should for the percentage provided by the user). Note that in both cases, when

k and database size are sufficiently large (50 and 5000 respectively), the algorithm provides at least the expected

ratio of top-k results.

Figures 16(c) and (d) show the number of tuples visited by the algorithm. According to these figures, the

proposed algorithm works very efficiently. For instance, when the database size is 1,000,000, k is 100, and

expected ratio of top-k matches is %80, the algorithm visits around 2000 tuples; i.e., %0.2 of the database.

Note that Theorem 4.3 implies that the number of tuples visited should be O(

N

m�1

m

�k

F

), where N is the

database size, m is the number of ordered predicates (2 in our case), k is the number of results, and F =

prob(F (i;�) = truej1 � i � N). According to this, the number of tuples visited must increase linearly with

k. This expectation is confirmed by both figures. Again according to the theorem, since m = 2, for a given k and

assuming that F is a constant, the number of visited tuples must be O(

p

N). In the simulation, we have actually

seen an even smaller increase in the number of tuples by increase in the database size (note that the figures are

plotted to thep

scale for database size to provide a better visualization of the results). This is because the

theorem gives an upper bound on the number of tuples visited: the proof assumes that n-tuples satisfying the

condition F (i;�) are geometrically distributed with parameter F . In the simulations, however, since we used

Zipf’s distribution of the predicates, the actual distribution of F is not geometrical. Consequently, the algorithm

finds the matches much earlier than the upper-bound suggests.

A complete set of experiment results with various parameter settings are given in Appendix. The results

presented in the appendix mimic the sample of results presented in this section.

7 Comparison with the Related Work

In addition to various related work we mention along with the presentation of our approach, a notable recent work

by Donjerkovic and Ramakrishnan [?] aims at optimizing a “top k results” type of a query probabilistically by

viewing the selectivity estimates as probability distributions. However, unlike our approach, it does not address

fuzzy queries, but exact match queries on data that have an inherent order (such as salaries). Also, instead of

allowing a limited error on the results themselves, it uses the selectivity estimates in optimizing the cost of exact

retrieval of first k results by providing probabilistic guarantees for the optimization cost. Also recently, Chauduri

and Gravano[?] considered the problem of top k selection queries effectively, their work has a different focus

from ours. It focuses on providing a way to convert top k queries into regular database queries. A more relevant

work is by Acharya et al.[?], which proposes algorithms aimed at providing approximate answers to warehouse

queries using statistics about database content. Unlike our approach, it does not provide mechanisms to deal with

similarity-based query processing and they do not address the problem of finding ”top k results” incrementally.

32

Page 33: Similarity Ranking

8 Conclusion

In this paper, we have first presented the difference between the general fuzzy query and multimedia query

evaluation problems. More specifically, we have pointed to the multimedia precision/recall semantics, partial

match requirement, and unavoidable necessity of fuzzy, but non-progressive predicates. Next, we have presented

an approximate query evaluation algorithm that builds on [10, 11] to address the existence of non-progressive

fuzzy predicates. The proposed algorithm returns a set R of k results, such that each result r 2 R is most

probably in the set, Rk

, of top k results of the query. This algorithm uses the statistical properties of the fuzzy

predicates as well as the merge functions used to combine the fuzzy values returned by individual predicates.

It minimizes the unnecessary accesses to non-progressive predicates, while providing error-bounds on the top-k

retrieval results. Since, the performance of the algorithm depends on the accuracy of the database statistics; we

have discussed techniques to generate and maintain relevant statistics. Finally, we presented simulation results

for evaluating the proposed algorithm in terms of quality of results and search space reduction.

Acknowledgements

We thank Dr. Golshani and Y.-C. Park for providing us with data distributions, which we used for evaluating the

score approximations, from their image database, ImageRoadMap.

References

[1] S. Y. Lee, M. K. Shan, and W. P. Yang. Similarity Retrieval of ICONIC Image Databases Systems. Pattern

Recognition, 22(6):675–682, 1989.

[2] A. Prasad Sistla, Clement Yu, Chengwen Liu, and King Liu. Similarity based Retrieval of Pictures Us-

ing indices on Spatial Relationships. In Proceedings of the 1995 VLDB Conference, Zurich, Switzerland,

September23-25 1995.

[3] Wen-Syan Li and K. Selcuk Candan. SEMCOG: A Hybrid Object-based Image Database System and Its

Modeling, Language, and Query Processing. In Proceedings of the 14th International Conference on Data

Engineering, Orlando, Florida, USA, February 1998.

[4] Wen-Syan Li, K. Selcuk Candan, Kyoji Hirata, and Yoshinori Hara. Facilitating Multimedia Database

Exploration through Visual Interfaces and Perpetual Query Reformulations. In Proceedings of the 23th

International Conference on Very Large Data Bases, pages 538–547, Athens, Greece, August 1997. VLDB.

[5] Wen-Syan Li, K. Selcuk Candan, Kyoji Hirata, and Yoshinori Hara. Hierarchical Image Modeling for

Object-based Media Retrieval. Data & Knowledge Engineering, 27(2):139–176, July 1998.

33

Page 34: Similarity Ranking

[6] R. Fagin and E. L. Wimmers. Incorporating user preferences in multimedia queries. In F. Afrati and

P. Koliatis, editors, Database Theory – ICDT ’97, volume 1186 of LNCS, pages 247–261, Berlin, Germany,

1997. Springer Verlag.

[7] R. Fagin and Y. S. Maarek. Allowing users to weight search terms. Technical Report RJ10108, IBM

Almaden Research Center, San Jose, CA, 1998.

[8] S. Y. Sung. A linear transform scheme for combining weights into scores. Technical Report TR98-327,

Rice University, Houston, TX, 1998.

[9] S. Adali, Piero A. Bonatti, Maria Luisa Sapino, and V. S. Subrahmanian. A Multi-Similarity Algebra. In

Proceedings of the 1998 ACM SIGMOD Conference, pages 402–413, Seattle, WA, USA, June 1998.

[10] Ronald Fagin. Fuzzy Queries in Multimedia Database Systems. In 17th ACM Symposium on Principles of

Database Systems, pages 1–10, June 1998.

[11] Ronald Fagin. Combining Fuzzy Information from Multiple Systems. In 15th ACM Symposium on Princi-

ples of Database Systems, pages 216–226, 1996.

[12] Wen-Syan Li, K. Selcuk Candan, Kyoji Hirata, and Yoshinori Hara. Facilitating Multimedia Database

Exploration through Visual Interfaces and Perpetual Query Reformulations. In Proceedings of the 23th

International Conference on Very Large Data Bases, pages 538–547, Athens, Greece, August 1997. VLDB.

[13] Christos Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic Publishers, Boston,

1996.

[14] R. Richardson, Alan Smeaton, and John Murphy. Using Wordnet as a Knowledge base for Measuring

Conceptual Similarity between Words. In Proceedings of Artificial Intelligence and Cognitive Science

Conference, Trinity College, Dublin, 1994.

[15] Weining Zhang, Clement Yu, Bryan Reagan, and Hiroshi Nakajima. Context-Dependent Interpretations of

Linguistic Terms in Fuzzy Relational Databases. In Proceedings of the 11th International Conference on

Data Engineering, Taipei, Taiwan, March 1995. IEEE.

[16] Surajit Chaudhuri and Luis Gravano. Optimizing Queries over Multimedia Repositories. In Proceedings of

the 1996 ACM SIGMOD Conference, pages 91–102, Montreal, Canada, June 1996.

[17] L. Zadeh. Fuzzy Sets. Information and Control, pages 338–353, 1965.

[18] H. Nakajima. Development of Efficient Fuzzy SQL for Large Scale Fuzzy Relational Database. In Pro-

ceedings of the 5th International Fuzzy Systems Association World Conference, pages 517–520, 1993.

34

Page 35: Similarity Ranking

[19] V.S. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian. ProbView: A Flexible Probabilistic Database

System. ACM Transactions on Database Systems, 22(3):419–469, September 1997.

[20] H. Bandermer and S. Gottwald. Fuzzy Sets, Fuzzy Logic, Fuzzy Methods with Applications. John Wiley

and Sons Ltd., England, 1995.

[21] U. Thole, H.-J. Zimmerman, and P. Zysno. On the Suitability of Minimum and Product operators for the

Intersection of Fuzzy Sets. Fuzzy Sets and systems, pages 167–180, 1979.

[22] J. Yen. Fuzzy logic–a modern perspective. IEEE Transactions on Knowledge and Data Engineering,

11(1):153–165, January 1999.

[23] D. Dubois and H. Prade. Criteria Aggregation and Ranking of Alternatives in the Framework of Fuzzy Set

Theory. Fuzzy Sets and Decision Analysis, TIMS Studies in Management Sciences, 20:209–240, 1984.

[24] R.R. Yager. Some Procedures for Selecting Fuzzy Set-Theoretic Operations. International Jounral General

Systems, pages 115–124, 1965.

[25] S. Adali, K.S. Candan, Y. Papakonstantinou, and V.S. Subrahmanian. Query Caching and Optimization in

Distributed Mediator Systems. In Proceedings of the 1996 ACM SIGMOD Conference, pages 137–147,

Montreal, Canada, June 1996.

[26] Y. Alp Aslandogan, Chuck Thier, Clement Yu, Chengwen Liu, and Krishnakumar R. Nair. Design, Im-

plementation and Evaluation of SCORE. In Proceedings of the 11th International Conference on Data

Engineering, Taipei, Taiwan, March 1995. IEEE.

[27] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In

Database Theory – ICDT ’99, Berlin, Germany, 1999. Springer Verlag. to appear.

[28] C. Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic Publishers, 1996.

[29] C.T. Yu and W. Meng. Principles of Database Query Processing for Advanced Applications. Morgan

Kauffman Publishers, 1998.

[30] A. Yoshitaka and T. Ichikawa. A survey on content-based retrieval for multimedia databases. IEEE Trans-

actions on Knowledge and Data Engineering, 11(1):81–93, January 1999.

[31] F. Idris and S. Panchanathan. Review of image and video indexing techniques. Journal of Visual Commu-

nication and Image Representation-Special Issue on Indexing, Storage and Retrieval of Images and Video -

Part II, 8(2):146–166, June 1997.

[32] Y.A. Aslandogan and C.T. Yu. Techniques and systems for image and video retrieval. IEEE Transactions

on Knowledge and Data Engineering, 11(1):56–63, 1999.

35

Page 36: Similarity Ranking

[33] S. Adalı, K.S. Candan, Y. Papakonstantinou, and V.S. Subrahmanian. Query Caching and Optimization in

Distributed Mediator System. In Proc. of the 1996 ACM SIGMOD Conference, Montreal, Canada, June

1996.

[34] O. Etzioni, K. Golden, and D. Weld. Sound and efficient closed-world reasoning for planning. Artificial

Intelligence, 89(1-2):113–148, 1997.

[35] W.-S. Li, K.S. Candan, K. Hirata, and Y. Hara. Facilitating multimedia database exploration through visual

interfaces and perpetual query reformulations. In Proceddings of the 23rd International Conference on

VLDB, pages 538–547, Athens, Greece, August 1997.

[36] S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. In VLDB’96, pages 87–98,

1996.

[37] G. K. Zipf. Relative Frequency as a Determinant of Phonetic Change. Harvard Studies in Classical Phili-

ology, 1929.

[38] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. On the implications of zipf’s law for

web caching. In INFOCOM ’99, New York, USA, March 1999.

36

Page 37: Similarity Ranking

Appendix

20

40

60

80

100

0

200

400

600

800

10000

50

100

150

200

250

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%20,win:1)

2040

6080

100

0200

400600

8001000

0

50

100

150

200

250

300

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%20,win:2)

20

40

60

80

100

0

200

400

600

800

10000

50

100

150

200

250

300

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%20,win:3)

20

40

60

80

100

0

200

400

600

800

100050

100

150

200

250

300

350

400

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%20,win:4)

20

40

60

80

100

0

200

400

600

800

100050

100

150

200

250

300

350

400

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%20,win:5)

20

40

60

80

100

0

200

400

600

800

10000

50

100

150

200

250

300

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%50,win:1)

Figure 17: Number of tuples enumerated by the algorithm

37

Page 38: Similarity Ranking

20

40

60

80

100

0

200

400

600

800

100050

100

150

200

250

300

350

400

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%50,win:2)

20

40

60

80

100

0

200

400

600

800

10000

100

200

300

400

500

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%50,win:3)

20

40

60

80

100

0

200

400

600

800

10000

100

200

300

400

500

600

700

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%50,win:4)

20

40

60

80

100

0

200

400

600

800

1000100

200

300

400

500

600

700

800

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%50,win:5)

20

40

60

80

100

0

200

400

600

800

10000

100

200

300

400

500

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%80,win:1)

20

40

60

80

100

0

200

400

600

800

10000

200

400

600

800

1000

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%80,win:2)

Figure 18: Number of tuples enumerated by the algorithm

38

Page 39: Similarity Ranking

20

40

60

80

100

0

200

400

600

800

10000

500

1000

1500

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%80,win:3)

2040

6080

100

0200

400600

8001000

0

500

1000

1500

2000

2500

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%80,win:4)

20

40

60

80

100

0

200

400

600

800

10000

500

1000

1500

2000

2500

3000

Ksqrt(database size)

# o

f tu

ple

s

Tuples visited (exp:%80,win:5)

Figure 19: Number of tuples enumerated by the algorithm

39

Page 40: Similarity Ranking

20

40

60

80

100

0

200

400

600

800

1000

0

0.1

0.2

0.3

0.4

0.5

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%20,win:1)

2040

6080

100

0200

400600

8001000

0.1

0.2

0.3

0.4

0.5

0.6

Ksqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%20,win:2)

20

40

60

80

100

0

200

400

600

800

1000

0.1

0.2

0.3

0.4

0.5

0.6

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%20,win:3)

20

40

60

80

100

0

200

400

600

800

1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%20,win:4)

20

40

60

80

100

0

200

400

600

800

1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%20,win:5)

20

40

60

80

100

0

200

400

600

800

1000

0.1

0.2

0.3

0.4

0.5

0.6

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%50,win:1)

Figure 20: Percentage of real top-k results among the tuples returned by the algorithm

40

Page 41: Similarity Ranking

20

40

60

80

100

0

200

400

600

800

1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%50,win:2)

20

40

60

80

100

0

200

400

600

800

1000

0.2

0.3

0.4

0.5

0.6

0.7

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%50,win:3)

20

40

60

80

100

0

200

400

600

800

1000

0.2

0.3

0.4

0.5

0.6

0.7

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%50,win:4)

20

40

60

80

100

0

200

400

600

800

1000

0.2

0.3

0.4

0.5

0.6

0.7

0.8

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%50,win:5)

20

40

60

80

100

0

200

400

600

800

1000

0.2

0.3

0.4

0.5

0.6

0.7

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%80,win:1)

20

40

60

80

100

0

200

400

600

800

1000

0.3

0.4

0.5

0.6

0.7

0.8

0.9

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%80,win:2)

Figure 21: Percentage of real top-k results among the tuples returned by the algorithm

41

Page 42: Similarity Ranking

20

40

60

80

100

0

200

400

600

800

1000

0.3

0.4

0.5

0.6

0.7

0.8

0.9

K

sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%80,win:3)

2030

4050

6070

8090

100

0

200

400

600

800

1000

0.6

0.7

0.8

0.9

1

K sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%80,win:4)

20

30

40

50

60

70

80

90

100

0

200

400

600

800

1000

0.6

0.7

0.8

0.9

1

K sqrt(database size)

Ra

tio

of

tup

les

Ratio of tuples in top K (exp:%80,win:5)

Figure 22: Percentage of real top-k results among the tuples returned by the algorithm

42