Searching in Metric Spaces - ALGORITMICA:...

Searching in Metric Spaces

EDGAR CHAVEZ

Escuela de Ciencias Fısico-Matematicas, Universidad Michoacana

GONZALO NAVARRO, RICARDO BAEZA-YATES

Depto. de Ciencias de la Computacion, Universidad de Chile

AND

JOSE LUIS MARROQUIN

Centro de Investigacion en Matematicas (CIMAT )

The problem of searching the elements of a set that are close to a given query elementunder some similarity criterion has a vast number of applications in many branches ofcomputer science, from pattern recognition to textual and multimedia informationretrieval. We are interested in the rather general case where the similarity criteriondefines a metric space, instead of the more restricted case of a vector space. Manysolutions have been proposed in different areas, in many cases without cross-knowledge.Because of this, the same ideas have been reconceived several times, and very differentpresentations have been given for the same approaches. We present some basic resultsthat explain the intrinsic difficulty of the search problem. This includes a quantitativedefinition of the elusive concept of “intrinsic dimensionality.” We also present a unified

Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problemcomplexity]: Nonnumerical algorithms and problems—computations on discretestructures, geometrical problems and computations, sorting and searching; H.2.1[Database management]: Physical design—access methods; H.3.1 [Informationstorage and retrieval]: Content analysis and indexing—indexing methods; H.3.2[Information storage and retrieval]: Information storage—file organization; H.3.3[Information storage and retrieval]: Information search and retrieval—clustering,search process; I.5.1 [Pattern recognition]: Models—geometric; I.5.3 [Patternrecognition]: Clustering

General Terms: Algorithms

Additional Key Words and Phrases: Curse of dimensionality, nearest neighbors,similarity searching, vector spaces

This project has been partially supported by CYTED VII.13 AMYRI Project (all authors), by CONACyT grantR-28923A (first author) and by Fondecyt grant 1-000929 (second and third authors).

Authors’ addresses: E. Chavez, Escuela de Ciencias Fısico-Matematicas, Universidad Michoacana,Edificio “B”, Ciudad Universitaria, Morelia, Mich. Mexico 58000; email: [email protected];G. Navarro and R. Baeza-Yates, Depto. de Ciencias de la Computacion, Universidad de Chile, BlancoEncalada 2120, Santiago, Chile; email: {gnavarro,rbaeza}@dcc.uchile.cl; J. L. Marroquın, Centro de In-vestigacion en Matematicas (CIMAT), Callejon de Jalisco S/N, Valenciana, Guanajuato, Gto. Mexico 36000;email: [email protected].

Permission to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage,the copyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists,requires prior specific permission and/or a fee.c©2001 ACM 0360-0300/01/0900-0273 $5.00

ACM Computing Surveys, Vol. 33, No. 3, September 2001, pp. 273–321.

274 E. Chavez et al.

view of all the known proposals to organize metric spaces, so as to be able tounderstand them under a common framework. Most approaches turn out to bevariations on a few different concepts. We organize those works in a taxonomy thatallows us to devise new algorithms from combinations of concepts not noticed beforebecause of the lack of communication between different communities. We presentexperiments validating our results and comparing the existing approaches. Wefinish with recommendations for practitioners and open questions for futuredevelopment.

1. INTRODUCTION

Searching is a fundamental problem incomputer science, present in virtuallyevery computer application. Simple ap-plications pose simple search problems,whereas a more complex application willrequire, in general, a more sophisticatedform of searching.

The search operation traditionally hasbeen applied to “structured data,” thatis, numerical or alphabetical informationthat is searched for exactly. That is, asearch query is given and the number orstring that is exactly equal to the searchquery is retrieved. Traditional databasesare built around the concept of exactsearching: the database is divided intorecords, each record having a fully com-parable key. Queries to the database re-turn all the records whose keys match thesearch key. More sophisticated searchessuch as range queries on numerical keys orprefix searching on alphabetical keys stillrely on the concept that two keys are orare not equal, or that there is a total lin-ear order on the keys. Even in recent years,when databases have included the abilityto store new data types such as images,the search has still been done on a prede-termined number of keys of numerical oralphabetical types.

With the evolution of informationand communication technologies, unstruc-tured repositories of information haveemerged. Not only new data types such asfree text, images, audio, and video haveto be queried, but also it is no longer pos-sible to structure the information in keysand records. Such structuring is very dif-ficult (either manually or computation-ally) and restricts beforehand the types ofqueries that can be posed later. Even whena classical structuring is possible, new

applications such as data mining re-quire accessing the database by any field,not only those marked as “keys.” Hence,new models for searching in unstructuredrepositories are needed.

The above scenarios require more gen-eral search algorithms and models thanthose classically used for simple data.A unifying concept is that of “similaritysearching” or “proximity searching,” thatis, searching for database elements thatare similar or close to a given queryelement.1 Similarity is modeled witha distance function that satisfies thetriangle inequality, and the set of ob-jects is called a metric space. Since theproblem has appeared in many diverseareas, solutions have appeared in manyunrelated fields, such as statistics, compu-tational geometry, artificial intelligence,databases, computational biology, patternrecognition, and data mining, to name afew. Since the current solutions come fromsuch diverse fields, it is not surprising thatthe same solutions have been reinventedmany times, that obvious combinationsof solutions have not been noticed, andthat no thorough comparisons have beendone. More important, there have beenno attempts to conceptually unify allthose solutions.

In some applications the metric spaceturns out to be of a particular type called“vector space,” where the elements consistof k real-valued coordinates. A lot of workhas been done on vector spaces by exploit-ing their geometric properties, but nor-mally these cannot be extended to generalmetric spaces where the only available

1 The term “approximate searching” is also used, butit is misleading and we use it here only when refer-ring to approximation algorithms.

ACM Computing Surveys, Vol. 33, No. 3, September 2001.

Searching in Metric Spaces 275

information is the distance among objects.In this general case, moreover, the dis-tance is normally quite expensive to com-pute, so the general goal is to reduce thenumber of distance evaluations. In con-trast, the operations in vector spaces tendto be simple and hence the goal is mainlyto reduce I/O. Some important advanceshave been done for general metric spaces,mainly around the concept of buildingan index, that is, a data structure to re-duce the number of distance evaluationsat query time. Some recent work [Ciacciaet al. 1997; Prabhakar et al. 1998] triesto achieve simultaneously the goals of re-ducing the number of distance evaluationsand the amount of I/O performed.

The main goal of this work is to presenta unifying framework to describe andanalyze all the existing solutions to thisproblem. We show that all the existing in-dexing algorithms for proximity search-ing consist of building a set of equiv-alence classes, discarding some classes,and exhaustively searching the rest. Twomain techniques based on equivalence re-lations, namely, pivoting and compact par-titions, are shown to encompass all the ex-isting methods. As a consequence of theanalysis we are able to build a taxon-omy on the existing algorithms for prox-imity search, to classify them accordingto their essential features, and to analyzetheir efficiency. We are able to identifyessentially similar approaches, to pointout combinations of ideas that have notpreviously been noticed, and to identifythe main open problems in this area.We also present quantitative methods toassert the intrinsic difficulty in search-ing on a given metric space and providelower bounds on the search problem. Thisincludes a quantitative definition of thepreviously conceptual notion of “intrin-sic dimensionality,” which we show to bevery appropriate. We present some ex-perimental results that help to validateour assertions.

We remark that we are concerned withthe essential features of the search al-gorithms for general metric spaces. Thatis, we try to extract the basic featuresfrom the wealth of existing solutions, so

as to be able to categorize and analyzethem under a common framework. We fo-cus mainly on the number of distance eval-uations needed to execute range queries(i.e., with fixed tolerance radius), whichare the most basic ones. However, we alsopay some attention to the total CPU time,time and space cost to build the indexes,nearest neighbor queries, dynamic capa-bilities of the indexes, and I/O consid-erations. There are some features thatwe definitely do not cover in order tokeep our scope reasonably bounded, suchas (1) complex similarity queries involv-ing more than one similarity predicate[Ciaccia et al. 1998b], as few works onthem exist and they are an elaborationover the simple similarity queries (a par-ticular case is polygon searching in vec-tor spaces); (2) subqueries (i.e., searchinga small element inside a larger element)since the solutions are basically the sameafter a domain-dependent transformationis done; and (3) inverse queries (i.e., find-ing the elements for which q is theirclosest neighbor) and total queries (e.g.,finding all the closest neighbors) sincethe algorithms are, again, built over thesimple ones.

This article is organized as follows. Thefirst part (Sections 2 through 5) is a puresurvey of the state of the art in search-ing metric spaces, with no attempt to pro-vide a new way of thinking about the prob-lem. The second part (Sections 6 through8) presents our basic results on the diffi-culty of the search problem and our unify-ing model that allows understanding theessential features of the problem and itsexisting solutions. Finally, Section 9 givesour conclusions and points out future re-search directions.

2. MOTIVATING APPLICATIONS

We now present a sample of applicationswhere the concept of proximity searchingappears. Since we have not presented aformal model yet, we do not try to explainthe connections between the different ap-plications. We rather delay this discussionto Section 3.



2.1. Query by Content in StructuredDatabases

In general, the query posed to a databasepresents a piece of a record of information,and it needs to retrieve the entire record.In the classical approach, the piece pre-sented is fixed (the key). Moreover, it isnot allowed to search with an incompleteor an erroneous key. On the other hand, inthe more general approach required nowa-days the concept of searching with a keyis generalized to searching with an arbi-trary subset of the record, allowing errorsor not.

Possible types of searches are point orkey search (all the key information isgiven), range search (only some fields aregiven or only a range of values is speci-fied for them), and proximity search (inaddition, records “close” to the query areconsidered interesting). These types ofsearch are of use in data mining (wherethe interesting parts of the record can-not be predetermined), when the informa-tion is not precise, when we are lookingfor a range of values, when the search keymay have errors (e.g., a misspelled word),and so on.

A general solution to the problem ofrange queries by any record field is thegrid file [Nievergelt and Hinterberger1984]. The domain of the database is seenas a hyperrectangle of k dimensions (oneper record field), where each dimensionhas an ordering according to the domainof the field (numerical or alphabetical).Each record present in the database isconsidered as a point inside the hyper-rectangle. A query specifies a subrectan-gle (i.e., a range along each dimension),and all the points inside the specifiedquery are retrieved. This does not ad-dress the problem of searching on non-traditional data types, nor allowing errorsthat cannot be recovered with a rangequery. However, it converts the originalsearch problem to a problem of obtain-ing, in a given space, all the points “close”to a given query point. Grid files are es-sentially a disk organization techniqueto efficiently retrieve range queries insecondary memory.

2.2. Query by Content in Multimedia Objects

New data types such as images, finger-prints, audio, and video (called “multime-dia” data types) cannot be meaningfullyqueried in the classical sense. Not onlycan they not be ordered, but they cannoteven be compared for equality. There isno interest in an application for search-ing an audio segment exactly equal to agiven one. The probability that two differ-ent images are pixelwise equal is negli-gible unless they are digital copies of thesame source. In multimedia applications,all the queries ask for objects similar toa given one. Some example applicationsare face recognition, fingerprint match-ing, voice recognition, and, in general,multimedia databases [Apers et al. 1997;Yoshitaka and Ichikawa 1999].

Think, for example, of a repository of im-ages. Interesting queries are of the type,“Find an image of a lion with a savannabackground.” If the repository is tagged,and each tag contains a full descriptionof what is inside the image, then our ex-ample query can be solved with a classi-cal scheme. Unfortunately, such a classifi-cation cannot be done automatically withthe available image processing technol-ogy. The state of object recognition in real-world scenes is still too immature to per-form such complex tasks. Moreover, wecannot predict all the possible queries thatwill be posed so as to tag the image for ev-ery possible query. An alternative to au-tomatic classification consists of consid-ering the query as an example image, sothat the system searches all the imagessimilar to the query. This can be builtinside a more complex feedback systemwhere the user approves or rejects the im-ages found, and a new query is submit-ted with the approved images. It is alsopossible that the query is just part of animage and the system has to retrieve thewhole image.

These approaches are based on the def-inition of a similarity function among ob-jects. Those functions are provided by anexpert, but they pose no assumptions onthe type of queries that can be answered.In many cases, the distance is obtained via



a set of k “features” that are extracted fromthe object (e.g., in an image a useful fea-ture is the average color). Then each ob-ject is represented as its k features (i.e.,a point in a k-dimensional space), and weare again in a case of range queries on vec-tor spaces.

There is a growing community ofscientists deeply involved with the de-velopment of such similarity measures[Cascia et al. 1998; Bhanu et al. 1998;Bimbo and Vicario 1998].

2.3. Text Retrieval

Although not considered a multimediadata type, unstructured text retrievalposes similar problems to those of mul-timedia retrieval. This is because textualdocuments are in general not structuredto easily provide the desired informa-tion. Text documents may be searched forstrings that are present or not, but inmany cases they are searched for semanticconcepts of interest. For instance, an idealscenario would allow searching a text dic-tionary for a concept such as, “to free fromobligation,” retrieving the word “redeem.”This search problem cannot be properlystated with classical tools.

A large community of researchers hasbeen working on this problem for a longtime [Salton and McGill 1983; Frakesand Baeza-Yates 1992; Baeza-Yates andRibeiro-Neto 1999]. A number of measuresof similarity have emerged. The problem issolved basically by retrieving documentssimilar to a given query. The user can evenpresent a document as a query, so thatthe system finds similar documents. Somesimilarity approaches are based on map-ping a document to a vector of real val-ues, so that each dimension is a vocabularyword and the relevance of the word to thedocument (computed using some formula)is the coordinate of the document alongthat dimension. Similarity functions arethen defined in that space. Notice, how-ever, that the dimensionality of the spaceis very high (thousands of dimensions).

Another problem related to text re-trieval is spelling. Since huge text

databases with low quality control areemerging (e.g., the Web), and typing,spelling, or OCR (optical character recog-nition) errors are commonplace in the textand the query, we have cases where doc-uments that contain a misspelled wordare no longer retrievable by a correctlywritten query. Models of similarity amongwords exist (variants of the “edit distance”[Sankoff and Kruskal 1983]) that capturevery well those types of errors. In thiscase, we give a word and want to re-trieve all the words close to it. Anotherrelated application is spelling checkers,where we look for close variants of themisspelled word.

In particular, OCR can be done usinga low-level classifier, so that misspelledwords can be corrected using the editdistance to find promising alternativesto replace incorrect words [Chavez andNavarro 2002].

2.4. Computational Biology

ADN and protein sequences are the ba-sic object of study in molecular biology.As they can be modeled as texts, we havethe problem of finding a given sequence ofcharacters inside a longer sequence. How-ever, an exact match is unlikely to occur,and computational biologists are more in-terested in finding parts of a longer se-quence that are similar to a given shortsequence. The fact that the search is notexact is due to minor differences in the ge-netic streams that describe beings of thesame or closely related species. The mea-sure of similarity used is related to theprobability of mutations such as reversalsof pieces of the sequences and other rear-rangements [Waterman 1995; Sankoff andKruskal 1983].

Other related problems are to build phy-logenetic trees (a tree sketching the evolu-tionary path of the species), to search pat-terns for which only some properties areknown, and others.

2.5. Pattern Recognition and FunctionApproximation

A simplified definition of pattern recog-nition is the construction of a function



approximator. In this formulation of theproblem one has a finite sample of thedata, and each data sample is labeled asbelonging to a certain class. When a freshdata sample is provided, the system is re-quired to label this new sample with one ofthe known data labels. In other words, theclassifier can be thought of as a functiondefined from the object (data) space to theset of labels. In this sense all the classifiersare considered as function approximators.

If the objects are m-dimensional vectorsof real numbers then the natural choicesare neural nets and fuzzy function approx-imators. Another popular universal func-tion approximator, the k-nearest-neighborclassifier, consists of finding the k objectsnearest to the unlabeled sample, and as-signing to this sample the label havingthe majority among the k nearest objects.Opposed to neural nets and fuzzy classi-fiers, the k-nearest-neighbor rule has zerotraining time, but if no indexing algorithmis used it has linear complexity [Duda andHart 1973].

Other applications of this universalfunction approximator are density esti-mation [Devroye 1987] and reinforcementlearning [Sutton and Barto 1998]. In gen-eral, any problem where we want to infera function based on a finite set of samplesis a potential application.

2.6. Audio and Video Compression

Audio and video transmission over a nar-rowband channel is an important prob-lem, for example, in Internet-based audioand video conferencing or in wireless com-munication. A frame (a static picture ina video, or a fragment of the audio) canbe thought of as formed by a number of(possibly overlapped) subframes (16 × 16subimages in a video, for example). Ina very succinct description, the problemcan be solved by sending the first frameas-is and for the next frames sendingonly the subframes having a significativedifference from the previously sent sub-frames. This description encompasses theMPEG standard.

The algorithms use, in fact, a subframebuffer. Each time a frame is about to be

sent it is searched (with a tolerance) in thesubframe buffer and if it is not found thenthe entire subframe is added to the buffer.If the subframe is found then only the in-dex of the similar frame found is sent. Thisimplies, naturally, that a fast similaritysearch algorithm has to be incorporated tothe server to maintain a minimum frames-per-second rate.

3. BASIC CONCEPTS

All the applications presented in the pre-vious section share a common framework,which is in essence to find close objects,under some suitable similarity function,among a finite set of elements. In this sec-tion we present the formal model compris-ing all the above cases.

3.1. Metric Spaces

We now introduce the basic notationfor the problem of satisfying proximityqueries and for the model used to groupand analyze the existing algorithms.

The set X denotes the universe of validobjects. A finite subset of it, U, of size n =|U|, is the set of objects where we search.U is called the dictionary, database, orsimply our set of objects or elements. Thefunction

d : X× X→ R

denotes a measure of “distance” betweenobjects (i.e., the smaller the distance, thecloser or more similar are the objects).Distance functions have the followingproperties.

(p1) ∀x, y ∈X, d (x, y)≥ 0 positiveness,(p2) ∀x, y ∈ X,

d (x, y) = d ( y , x) symmetry,(p3) ∀x ∈ X, d (x, x) = 0 reflexivity,

and in most cases

(p4) ∀x, y ∈ X, x 6= y ⇒d (x, y) > 0 strict

positiveness.

The similarity properties enumeratedabove only ensure a consistent definitionof the function, and cannot be used to save



comparisons in a proximity query. If d isindeed a metric, that is, if it satisfies

(p5) ∀x, y , z ∈X, d (x, y)≤d (x, z)+d (z, y)triangle inequality,

then the pair (X, d ) is called a metric space.If the distance does not satisfy the strict

positiveness property (p4) then the spaceis called a pseudometric space. Althoughfor simplicity we do not consider pseudo-metric spaces in this work, all the pre-sented techniques are easily adapted tothem by simply identifying all the objectsat distance zero as a single object. Thisworks because, if (p5) holds, one can eas-ily prove that d (x, y) = 0 ⇒ ∀z, d (x, z) =d ( y , z).

In some cases we may have a quasi-metric, where the symmetry property (p2)does not hold. For instance, if the objectsare corners in a city and the distance cor-responds to how much a car must travelto move from one to the other, then the ex-istence of one-way streets makes the dis-tance asymmetric. There exist techniquesto derive a new, symmetric, distance func-tion from an asymmetric one, such asd ′(x, y) = d (x, y) + d ( y , x). However, tobe able to bound the search radius of aquery when using the symmetric functionwe need specific knowledge of the domain.

Finally, we can relax the triangle in-equality (p5) to d (x, y)≤αd (x, z)+ βd (z,y)+ δ, and after some scaling we cansearch in this space using the same algo-rithms designed for metric spaces. If thedistance is symmetric we need α = β forconsistency.

In the rest of the article we use the termdistance in the understanding that we re-fer to a metric.

3.2. Proximity Queries

There are basically three types of queriesof interest in metric spaces.

Range query (q, r)d . Retrieve all elementsthat are within distance r to q. This is{u ∈ U/d (q, u) ≤ r}.

Nearest neighbor query N N (q). Retrievethe closest elements to q in U. This is{u ∈ U/∀v ∈ U, d (q, u) ≤ d (q, v)}. In

some cases we are satisfied with onesuch element (in continuous spacesthere is normally just one answeranyway). We can also give a maximumdistance r∗ such that if the closestelement is at distance more than r∗ wedo not want any one reported.

k-Nearest neighbor query NNk(q). Retrievethe k closest elements to q in U. Thisis, retrieve a set A ⊆ U such that|A| = k and ∀u ∈ A, v ∈ U − A,d (q, u) ≤ d (q, v). Note that in case ofties we are satisfied with any set of kelements satisfying the condition.

The most basic type of query is the rangequery. The left part of Figure 1 illustratesa query on a set of points which is our run-ning example, usingR2 as the metric spacefor clarity.

A range query is therefore a pair (q, r)dwith q an element inX and r a real numberindicating the radius (or tolerance) of thequery. The set {u ∈ U, d (q, u) ≤ r} is calledthe outcome of the range query.

We use “NN” as an abbreviation of“nearest neighbor,” and give the genericname “NN-query” to the last two types ofqueries and “NN searching” to the tech-niques to solve them. As we see later, NN-queries can be systematically built overrange queries.

The total time to evaluate a query canbe split as

T = distance evaluations× complexity of d ()+ extra CPU time+ I/O time

and we would like to minimize T . In manyapplications, however, evaluating d () isso costly that the other components ofthe cost can be neglected. This is themodel we use in this article, and hencethe number of distance evaluations per-formed is the measure of the complex-ity of the algorithms. We can even allowa linear in n (but reasonable) amount ofCPU work and a linear traversal over thedatabase on disk, as long as the number ofdistance computations is kept low. How-ever, we pay some marginal attention to



Fig. 1 . On the left, an example of a range query on a set of points. On the right, the setof points at the same distance to a center point, for different Ls distances.

the so-called extra CPU time. The I/O timecan be the dominant factor in some appli-cations and negligible in others, depend-ing on the amount of main memory avail-able and the relative cost to compute thedistance function. We cover the little ex-isting work on relating metric spaces andI/O considerations in Section 5.3.2.

It is clear that either type of query canbe answered by examining the entire dic-tionary U. In fact if we are not allowed topreprocess the data (i.e., to build an in-dex data structure), then this exhaustiveexamination is the only way to proceed.An indexing algorithm is an offline proce-dure to build beforehand a data structure(called an index) designed to save distancecomputations when answering proximityqueries later. This data structure can beexpensive to build, but this will be amor-tized by saving distance evaluations overmany queries to the database. The aim istherefore to design efficient indexing algo-rithms to reduce the number of distanceevaluations. All these structures work onthe basis of discarding elements using thetriangle inequality (the only property thatallows saving distance evaluations).

4. THE CASE OF VECTOR SPACES

If the elements of the metric space (X, d )are indeed tuples of real numbers (actu-

ally tuples of any field) then the pair iscalled a finite-dimensional vector space, orvector space for short.

A k-dimensional vector space is a par-ticular metric space where the objects areidentified with k real-valued coordinates(x1, . . . , xk). There are a number of optionsfor the distance function to use, but themost widely used is the family of Ls dis-tances, defined as

Ls((x1, . . . , xk), ( y1, . . . , yk))

=(

k∑i=1

|xi − yi|s)1/s

.

The right part of Figure 1 illustratessome of these distances. For instance, theL1 distance accounts for the sum of the dif-ferences along the coordinates. It is alsocalled “block” or “Manhattan” distance,since in two dimensions it corresponds tothe distance to walk between two pointsin a city of rectangular blocks. The L2distance is better known as “Euclidean”distance, as it corresponds to our notionof spatial distance. The other most usedmember of the family is L∞, which corre-sponds to taking the limit of the Ls for-mula when s goes to infinity. The result isthat the distance between two points is the



maximum difference along a coordinate:

L∞((x1, . . . , xk), ( y1, . . . , yk))

= kmax

i=1|xi − yi|.

Searching with the L∞ distance cor-responds directly to a classical rangesearch query, where the range is thek-dimensional hyperrectangle. This dis-tance plays a special role in this survey.

In many applications the metric space isindeed a vector space; that is, the objectsare k-dimensional points and the similar-ity is interpreted geometrically. A vectorspace permits more freedom than a gen-eral metric space when designing searchapproaches, since it is possible to use geo-metric and coordinate information that isunavailable in a general metric space.

In this framework optimal algorithms(on the database size) exist in both the av-erage and the worst case [Bentley et al.1980] for closest point search. Searchstructures for vector spaces are called spa-tial access methods (SAM). Among themost popular are kd -trees [Bentley 1975,1979], R-trees [Guttman 1984], quad-trees [Samet 1984], and the more recentX -trees [Berchtold et al. 1996]. Thesetechniques make extensive use of coor-dinate information to group and classifypoints in the space. For example, kd-treesdivide the space along different coordi-nates and R-trees group points in hy-perrectangles. Unfortunately the existingtechniques are very sensitive to the vectorspace dimension. Closest point and rangesearch algorithms have an exponential de-pendency on the dimension of the space[Chazelle 1994] (this is called the curse ofdimensionality).

Vector spaces may suffer from large dif-ferences between their representationaldimension (k) and their intrinsic dimen-sion (i.e., the real number of dimensionsin which the points can be embeddedwhile keeping the distances among them).For example, a plane embedded in a 50-dimensional space has intrinsic dimension2 and representational dimension 50. Thisis, in general, the case of real applications,where the data are clustered, and it has

led to attempts to measure the intrinsicdimension such as the concept of “fractaldimension” [Faloutsos and Kamel 1994].Despite the fact that no technique can copewith intrinsic dimension higher than 20,much higher representational dimensionscan be handled by dimensionality reduc-tion techniques [Faloutsos and Lin 1995;Cox and Cox 1994; Hair et al. 1995].

Since efficient techniques to cope withvector spaces exist, application design-ers try to give their problems a vectorspace structure. However, this is not al-ways easy or feasible at all. For exam-ple, experts in image processing try to ex-press the similarity between images as thedistance between “vectors” of features ex-tracted from the images, although in manycases better results are obtained by cod-ing specific functions that compare twoimages, despite their inability to be eas-ily expressed as the distance between twovectors (e.g., crosstalk between features[Faloutsos et al. 1994]). Another examplethat resists conversion into a vector spaceis similarity functions between strings, tocompare DNA sequences, for instance. Astep towards improving this situation isChavez and Navarro [2002].

For this reason several authors resort togeneral metric spaces, even knowing thatthe search problem is much more difficult.Of course it is also possible to treat a vec-tor space as a general metric space, by us-ing only the distances between points. Oneimmediate advantage is that the intrinsicdimension of the space shows up, indepen-dent of any representational dimension(this requires extra care in vector spaces).It is interesting to remark that Ciacciaet al. [1997] present preliminary resultsshowing that a metric space data struc-ture (the M-tree) can outperform a well-known vector space data structure (theR∗-tree) when applied to a vector space.

Specific techniques for vector spaces area whole different world which we do notintend to cover in this work (see Samet[1984], White and Jain [1996], Gaede andGunther [1998], and Bohm et al. [2002] forgood surveys). However, we discuss in thenext section a technique that, instead oftreating a vector space as a metric space,



tries to embed a general metric space intoa vector space. This concept is central inthis survey, although specific knowledge ofspecific techniques for vector spaces is, aswe shortly show, not necessary to under-stand it.

4.1. Resorting to Vector Spaces

An interesting and natural reduction ofthe similarity search problem consists ofa mapping 8 from the original metricspace into a vector space. In this way,each element of the original metric spaceis represented as a point in the targetvector space. The two spaces are relatedby two distances: the original one d (x, y)and the distance in the projected spaceD(8(x),8( y)). If the mapping is contrac-tive (i.e., D(8(x),8( y)) ≤ d (x, y) for anypair of elements), then one can processrange queries in the projected space withthe same radius. Since some spurious ele-ments can be captured in the target space,the outcome of the query in the projectedspace is a candidate list, which is laterverified elementwise with the original dis-tance to obtain the actual outcome of thequery.

Intuitively, the idea is to “invent” k co-ordinates and map the points onto a vec-tor space, using some vector space tech-nique as a first filter to the actual answerto a query. One of the main theses of thiswork is that a large subclass of the ex-isting algorithms can be regarded as re-lying on some mapping of this kind. Awidely used method (explained in detailin Section 6.6) is to select {p1, . . . , pk} ⊆ Uand map each u ∈ U to (Rk , L∞) using8(u) = (d (u, p1), . . . , d (u, pk)). It can beseen that this mapping is contractive butnot proximity preserving.

If, on the other hand, the mappingis proximity preserving (i.e., d (x, y) ≤d (x, z)⇒ D(8(x),8( y)) ≤ D(8(x),8(z))),then NN-queries can be directly per-formed in the projected space. Indeed,most current algorithms for NN-queriesare based in range queries, and with somecare they can be done in the projectedspace if the mapping is contractive, evenif it is not proximity preserving.

This type of mapping is a special case ofa general idea in the literature that saysthat one can find it cheaper to compute dis-tances that lower-bound the real one, anduse the cheaper distance to filter out mostelements (e.g., for images, the averagecolor is cheaper to compute than the dif-ferences in the color histograms). Whilein general this is domain-dependent,mapping onto a vector space can be donewithout knowledge of the domain. Afterthe mapping is done and we have iden-tified each data element with a point inthe projected space, we can use a general-purpose spatial access method for vectorspaces to retrieve the candidate list. Theelements found in the projected spacemust be finally checked using the originaldistance function.

Therefore, there are two types of dis-tance evaluations: first to obtain the co-ordinates in the projected space and laterto check the final candidates. These arecalled “internal” and “external” evalu-ations, respectively, later in this work.Clearly, incrementing internal evalua-tions improves the quality of the filter andreduces external evaluations, and there-fore we seek a balance.

Notice finally that the search itself inthe projected space does not use evalua-tions of the original distance, and hence itis costless under our complexity measure.Therefore, the use of kd -trees, R-trees, orother data structures aims at reducing theextra CPU time, but it makes no differ-ence in the number of evaluations of thed distance.

How well do metric space techniquesperform in comparison to vector spacemethods? It is difficult to give a formal an-swer because of the different cost modelsinvolved. In metric spaces we use the num-ber of distance evaluations as the basicmeasure of complexity, while vector spacetechniques may very well use many coordi-nate manipulations and not a single eval-uation of the distance. Under our model,the cost of a method that maps to a vec-tor space to trim the candidate list is mea-sured as the number of distance evalua-tions to realize the mapping plus the finaldistances to filter the trimmed candidate



list, whereas the work on the artificial co-ordinates is seen as just extra CPU time.

A central question related to this reduc-tion is: how well can a metric space be em-bedded into a vector space? How many co-ordinates have to be considered so that theoriginal metric space and the target vec-tor spaces are similar enough so that thecandidate list given by the vector space isnot much larger than the actual outcomeof the query in the original space? This isa very difficult question that lies behindthis entire article, and we return to it inSection 7.

The issue is better developed in vectorspaces. There are different techniques toreduce the dimensionality of a set of pointswhile preserving the original distancesas much as possible [Cox and Cox 1994;Hair et al. 1995; Faloutsos and Lin 1995],that is, to find the intrinsic dimension ofthe data.

5. CURRENT SOLUTIONS FORMETRIC SPACES

In this section we explain the existing in-dexes to structure metric spaces and howthey are used for range and NN search-ing. Since we have not yet developed theconcepts of a unifying perspective, the de-scription is kept at an intuitive level, with-out any attempt to analyze why some ideasare better or worse. We add a final sub-section devoted to more advanced issuessuch as dynamic capabilities, I/O consider-ations, and approximate and probabilisticalgorithms.

5.1. Range Searching

We divide the presentation in severalparts. The first one deals with tree in-dexes for discrete distance functions, thatis, functions that deliver a small set of val-ues. The second part corresponds to treeindexes for continuous distance functions,where the set of alternatives is infinite orvery large. Third, we consider other meth-ods that are not tree-based.

Table I summarizes the complexitiesof the different structures. These are ob-tained from the source papers, which use

different (and incompatible) assumptionsand in many cases give just gross analysesor no analysis at all (just heuristic con-siderations). Therefore, we give the com-plexities as claimed by the authors of eachpaper, not as a proven fact. At best, theresults are analytical but rely on diversesimplifying assumptions. At worst, the re-sults are based on a few incomplete ex-periments. Keep also in mind that thereare hidden factors depending (in manycases exponentially) on the dimension ofthe space, and that the query complexityis always on average, as in the worst casewe can be forced to compare all the ele-ments. Even in the simple case of orthogo-nal range searching on vector spaces thereexist�(nα) lower bounds for the worst case[Melhorn 1984].

5.1.1. Trees for Discrete Distance Functions.We start by describing tree data structuresthat apply to distance functions which re-turn a small set of different values. At theend we show how to cope with the generalcase with these trees.

5.1.1.1. BKT. Probably the first generalsolution to search in metric spaces waspresented in Burkhard and Keller [1973].They propose a tree (thereafter called theBurkhard–Keller Tree, or BKT), which issuitable for discrete-valued distance func-tions. It is defined as follows. An arbitraryelement p ∈ U is selected as the root ofthe tree. For each distance i > 0, we de-fine Ui = {u ∈ U, d (u, p) = i} as the set ofall the elements at distance i to the rootp. Then, for any nonempty Ui, we build achild of p (labeled i), where we recursivelybuild the BKT for Ui. This process can berepeated until there is only one element toprocess, or until there are no more than belements (and we store a bucket of size b).All the elements selected as roots of sub-trees are called pivots.

When we are given a query q and a dis-tance r, we begin at the root and enter intoall children i such that d (p, q)− r ≤ i≤d (p, q)+ r, and proceed recursively. If wearrive at a leaf (bucket of size one or more)we compare sequentially all its elements.Each time we perform a comparison



Table I. Average Complexities of Existing Approachesa

Data Space Construction Claimed Query Extra CPUStructure Complexity Complexity Complexity Query TimeBKT n ptrs n log n nα —FQT n . .n log n ptrs n log n nα —FHQT n . .nh ptrs nh log nb nα

FQA nhb bits nh log nb nα log nVPT n ptrs n log n log n c —MVPT n ptrs n log n log n c —VPF n ptrs n2−α n1−α log n c —BST n ptrs n log n not analyzed —GHT n ptrs n log n not analyzed —GNAT nm2 dsts nm logm n not analyzed —VT n ptrs n log n not analyzed —MT n ptrs n(m..m2) logm n not analyzed —SAT n ptrs n log n/ log log n n1−2(1/ log log n) —AESA n2 dsts n2 O(1) d n . .n2

LAESA kn dsts kn k + O(1) d log n . .knLC n ptrs n2/m not analyzed n/m

aTime complexity considers only n, not other parameters such as dimension. Space complexity mentions themost expensive storage units used (“ptrs” is short for “pointers” and “dsts” for “distances”). α is a numberbetween 0 and 1, different for each structure, whereas the other letters are parameters particular to eachstructure.bIf h = log n.cOnly valid when searching with very small radii.dEmpirical conclusions without analysis, in the case of LAESA for “large enough” k.

(against pivots or bucket elements u)where d (q, u)≤ r, we report the element u.

The triangle inequality ensures that wecannot miss an answer. All the subtreesnot traversed contain elements u that areat distance d (u, p) = i from some node p,where |d (p, q) − i| > r. By the triangleinequality, d (p, q) ≤ d (p, u)+d (u, q), andtherefore d (u, q) ≥ d (p, q)− d (p, u) > r.

Figure 2 shows an example where the el-ement u11 has been selected as the root. Wehave built only the first level of the BKTfor simplicity. A query q is also shown,and we have emphasized the branches ofthe tree that would have to be traversed.In this and all the examples of this sectionwe discretize the distances of our example,so that they return integer values.

The results of Table I for BKTs areextrapolated from those made for fixedqueries trees [Baeza-Yates et al. 1994],which can be easily adapted to this case.The only difference is that the space over-head of BKTs is O(n) because there is ex-actly one element of the set per tree node.

5.1.1.2. FQT. A further developmentover BKTs is the fixed queries tree or FQT[Baeza-Yates et al. 1994]. This tree is basi-

cally a BKT where all the pivots stored inthe nodes of the same level are the same(and of course do not necessarily belong tothe set stored in the subtree). The actualelements are all stored at the leaves. Theadvantage of such construction is thatsome comparisons between the query andthe nodes are saved along the backtrack-ing that occurs in the tree. If we visit manynodes of the same level, we do not needto perform more than one comparisonbecause all the pivots in that level are thesame. This is at the expense of somewhattaller trees. FQTs are shown experimen-tally in Baeza-Yates et al. [1994] to per-form fewer distance evaluations at querytime using a couple of different metricspace examples. Under some simplifyingassumptions (experimentally validated inthe paper) they show that FQTs built overn elements are O(log n) height on aver-age, are built using O(n log n) distanceevaluations, and that the average numberof distance computations is O(nα), where0<α<1 is a number that depends on therange of the search and the structure ofthe space (this analysis is easy to extendto BKTs as well). The space complexity issuperlinear since, unlike BKTs, it is not



Fig. 2 . On the left, the division of the space obtained when u11 is taken as a pivot. On the right, the firstlevel of a BKT with u11 as root. We also show a query q and the branches that it has to traverse. We havediscretized the distances so they return integer values.

true that a different element is placed ateach node of the tree. An upper boundis O(n log n) since the average height isO(log n).

5.1.1.3. FHQT. In Baeza-Yates et al.[1994] and Baeza-Yates [1997], the au-thors propose a variant called “fixed-height FQT” (or FHQT for short), whereall the leaves are at the same depthh, regardless of the bucket size. Thismakes some leaves deeper than neces-sary, which makes sense because we mayhave already performed the comparisonbetween the query and the pivot of anintermediate level, therefore eliminatingfor free the need to consider the leaf. InBaeza-Yates [1997] and Baeza-Yates andNavarro [1998] it is shown that by usingO(log n) pivots, the search takes O(log n)distance evaluations (although the costdepends exponentially on the search ra-dius r). The extra CPU time, that is, thenumber of nodes traversed, remains how-ever, O(nα). The space, like FQTs, is some-where between O(n) and O(nh). In prac-tice the optimal h=O(log n) cannot beachieved because of space limitations.

5.1.1.4. FQA. In Chavez et al. [2001],the fixed queries array (FQA) is presented.The FQA, although not properly a tree,is no more than a compact representationof the FHQT. Imagine that an FHQT offixed height h is built on the set. If we tra-

verse the leaves of the tree left to right andput the elements in an array, the result isthe FQA. For each element of the arraywe compute h numbers representing thebranches to take in the tree to reach theelement from the root (i.e., the distancesto the h pivots). Each of these h numbersis coded in b bits and they are concate-nated in a single (long) number so that thehigher levels of the tree are the most sig-nificant digits.

As a result the FQA is sorted by the re-sulting hb-bits number, each subtree of theFHQT corresponds to an interval in theFQA, and each movement in the FHQTis simulated with two binary searches inthe FQA (at O(log n) extra CPU cost fac-tor, but no extra distances are computed).There is a similarity between this idea andsuffix trees versus suffix arrays [Frakesand Baeza-Yates 1992]. This idea of usingfewer bits to represent the distances ap-peared also in the context of vector spaces[Weber et al. 1998].

Using the same memory, the FQA simu-lation is able to use many more pivots thanthe original FHQT, which improves the ef-ficiency. The b bits needed by each pivotcan be lowered by merging branches of theFHQT, trying that about the same num-ber of elements lies in each cell of the nextlevel. This allows using even more pivotswith the same space usage. For reasonsthat are made clear later, the FQA is alsocalled FMVPA in this work.



Fig. 3 . Example BKT, FQT, FHQT, and FQA for our set of points. We use b = 2 for the BKT and FQT, andh = 2 for FHQT and FQA.

Figure 3 shows an arbitrary BKT, FQT,FHQT, and FQA built on our set of points.Notice that, while in the BKT there is adifferent pivot per node, in the othersthere is a different pivot per level, thesame for all the nodes of that level.

5.1.1.5. Hybrid. In Shapiro [1977], theuse of more than one element per node ofthe tree is proposed. Those k elements al-low eliminating more elements per levelat the cost of doing more distance evalua-tions. The same effect would be obtainedif we had a mixture between BKTs andFQTs, so that for k levels we had fixed keysper level, and then we allowed a differentkey per node of the level k+ 1, continu-ing the process recursively on each sub-tree of the level k+ 1. The authors conjec-ture that the pivots should be selected tobe outside the clusters.

5.1.1.6. Adapting to Continuous Functions.If we have a continuous distance or if itgives too many different values, it is notpossible to have a child of the root for anysuch value. If we did that, the tree woulddegenerate into a flat tree of height 2,and the search algorithm would be almostlike sequential searching for the BKT and

FQT. FHQTs and FQAs do not degeneratein this sense, but they lose their sublinearextra CPU time.

In Baeza-Yates et al. [1994] the au-thors mention that the structures can beadapted to a continuous distance by as-signing a range of distances to each branchof the tree. However, they do not specifyhow to do this. Some approaches explic-itly defined for continuous functions areexplained later (VPTs and others), whichassign the ranges trying to leave the samenumber of elements at each class.

5.1.2. Trees for Continuous Distance Func-tions. We now present the data structuresdesigned for the continuous case. Theyalso can be used for discrete distance func-tions with virtually no modifications.

5.1.2.1. VPT. The “metric tree” is pre-sented in Uhlmann [1991b] as a treedata structure designed for continuousdistance functions. More complete workon the same idea [Yianilos 1993; Chiueh1994] calls them “vantage-point trees” orVPTs. They build a binary tree recursively,taking any element p as the root andtaking the median of the set of all dis-tances, M =median{d (p, u)/u∈U}. Those



Fig. 4 . Example VPT with root u11. We plot the radius M used for the root. For the first levels we showexplicitly the radii used in the tree.

elements u such that d (p, u) ≤ M are in-serted into the left subtree, while thosesuch that d (p, u) > M are inserted intothe right subtree. The VPT takes O(n)space and is built in O(n log n) worst casetime, since it is balanced. To solve a queryin this tree, we measure d = d (q, p). Ifd − r ≤ M we enter into the left subtree,and if d + r > M we enter into the rightsubtree (notice that we can enter into bothsubtrees). We report every element consid-ered that is close enough to the query. SeeFigure 4.

The query complexity is argued to beO(log n) in Yianilos [1993], but as pointedout, this is true only for very small searchradii, too small to be an interesting case.

In trees for discrete distance functions,the exact distance between an element inthe leaves and any pivot in the path tothe root can be inferred. However, here weonly know that the distance is larger orsmaller than M . Unlike the discrete case,it is possible that we arrive at an elementin a leaf which we do not need to compare,but the tree has not enough informationto discover that. Some of those exact dis-tances lost can be stored explicitly, as pro-posed in Yianilos [1993], to prune moreelements before checking them. Finally,Yianilos [1993] considers the problem ofpivot selection and argues that it is betterto take elements far away from the set.

5.1.2.2. MVPT. The VPT can be ex-tended to m-ary trees by using the m − 1

uniform percentiles instead of just themedian. This is suggested in Brin [1995]and Bozkaya and Ozsoyoglu [1997]. Inthe latter, the “multi-vantage-point tree”(MVPT) is presented. They propose theuse of many elements in a single node,much as in Shapiro [1977]. It can beseen that the space is O(n), since eachinternal node needs to store the m per-centiles but the leaves do not. The con-struction time is O(n log n) if we search them percentiles hierarchically at O(n log m)instead of O(mn) cost. Bozkaya andOzsoyoglu [1997] show experimentallythat the idea of m-ary trees slightly im-proves over VPTs (and not in all cases),while a larger improvement is obtained byusing many pivots per node. The analysisof query time for VPTs can be extrapolatedto MVPTs in a straightforward way.

5.1.2.3. VPF. Another generalization ofthe VPT is given by the VPF (shorthandfor excluded middle vantage point for-est) [Yianilos 1999]. This algorithm is de-signed for radii-limited NN search (anN N (q) query with a maximum radius r∗),but in fact the technique is perfectly com-patible with a range search query. Themethod consists of excluding, at each level,the elements at intermediate distances totheir pivot (this is the most populated partof the set): if r0 and rn stand for the closestand farthest elements to the pivot p, theelements u ∈ U such that d (p, r0) + δ ≤d (p, u) ≤ d (p, rn) − δ are excluded from



Fig. 5 . Example of the first level of a BST or GHT and a query q. Either using covering radii(BST) or hyperplanes (GHT), both subtrees have to be considered in this example.

the tree. A second tree is built with the ex-cluded “middle part” of the first tree, andso on to obtain a forest. With this idea theyeliminate the backtracking when search-ing with a radius r∗ ≤ (rn − r0 − 2δ)/2,and in return they have to search all thetrees of the forest. The VPF, of O(n) size,is built using O(n2−ρ) time and answersqueries in O(n1−ρ log n) distance evalua-tions, where 0 < ρ < 1 depends on r∗. Un-fortunately, to achieve ρ > 0, r∗ has to bequite small.

5.1.2.4. BST. In Kalantari andMcDonald [1983], the “bisector trees”(BSTs) are proposed. The BST is a binarytree built recursively as follows. At eachnode, two “centers” c1 and c2 are selected.The elements closer to c1 than to c2 gointo the left subtree and those closerto c2 into the right subtree. For each ofthe two centers, its “covering radius” isstored, that is, the maximum distancefrom the center to any other element inits subtree. At search time, we enter intoeach subtree if d (q, ci) − r is not largerthan the covering radius of ci. That is,we can discard a branch if the query ball(i.e., the hypersphere of radius r centeredin the query) does not intersect the ballthat contains all the elements inside thatbranch. In Noltemeier et al. [1992], the“monotonous BST” is proposed, whereone of the two elements at each nodeis indeed the parent center. This makesthe covering radii decrease as we move

downward in the tree. Figure 5 illustratesthe first step of the tree construction.

5.1.2.5. GHT. Proposed in Uhlmann[1991b], the “generalized-hyperplanetree” (GHT) is identical in construc-tion to a BST. However, the algorithmuses the hyperplane between c1 and c2as the pruning criterion at search time,instead of the covering radius. At searchtime we enter into the left subtree ifd (q, c1) − r < d (q, c2) + r and into theright subtree if d (q, c2) − r ≤ d (q, c1) + r.Again, it is possible to enter into bothsubtrees. In Uhlmann [1991b] it is arguedthat GHTs could work better than VPTs inhigh dimensions. The same idea of reusingthe parent node is proposed in Bugnion etal. [1993], this time to avoid performingtwo distance evaluations at each node.

5.1.2.6. GNAT. The GHT is extended inBrin [1995] to an m-ary tree, called GNAT(geometric near-neighbor access tree),keeping the same essential idea. We se-lect, for the first level, m centers c1, . . . , cm,and define Ui = {u ∈ U, d (ci, u) < d (c j , u),∀ j 6= i}. That is, Ui are the elementscloser to ci than to any other c j . Fromthe root, m children numbered i = 1 . .mare built, each one recursively as a GNATfor Ui. Figure 6 shows a simple exam-ple of the first level of a GNAT. Noticethe relationship between this idea and aVoronoi-like partition of a vector space[Aurenhammer 1991].



Fig. 6 . Example of the first level of a GNAT with m = 4.

The search algorithm, however, is quitedifferent. At indexing time, the GNATstores at each node an O(m2) size tablerangeij = [minu∈U j (ci, u), maxu∈U j (ci, u)],which stores minimum and maximum dis-tances from each center to each class.At search time the query q is comparedagainst some center ci and then it discardsany other center c j such that d (q, ci) ± rdoes not intersect rangei, j . All the subtreeU j can be discarded using the triangle in-equality. The process is repeated with ran-dom centers until none can be discarded.The search then enters recursively in eachnondiscarded subtree. In the process, anycenter close enough to q is reported.

The authors use a gross analysis to showthat the tree takes O(nm2) space and isbuilt in close to O(nm logm n) time. Ex-perimental results show that the GHT isworse than the VPT, which is only beatenwith GNATs of arities between 50 and 100.Finally, they mention that the arities ofthe subtrees could depend on their depthin the tree, but give no clear criteria todo this.

5.1.2.7. VT. The “Voronoi tree” (VT) isproposed in Dehne and Noltemeier [1987]as an improvement over BSTs, where thistime the tree has two or three elements(and children) per node. When a new treenode has to be created to hold an insertedelement, its closest element from the par-ent node is also inserted in the new node.

VTs have the property that the coveringradius is reduced as we move downwardsin the tree, which provides better pack-ing of elements in subtrees. It is shown inDehne and Noltemeier [1987] that VTs aresuperior and better balanced than BSTs.In Noltemeier [1989] they show that bal-anced VTs can be obtained by insertionprocedures similar to those of B-trees, afact later exploited in M-trees (see next).

5.1.2.8. MT. The M-tree (MT) datastructure is presented in Ciaccia et al.[1997], aiming at providing dynamic ca-pabilities and good I/O performance inaddition to few distance computations.The structure has some resemblancesto a GNAT, since it is a tree where aset of representatives is chosen at eachnode and the elements closer to eachrepresentative2 are organized into a sub-tree rooted by that representative. Thesearch algorithm, however, is closer toBSTs. Each representative stores its cov-ering radius. At query time, the query iscompared to all the representatives of thenode and the search algorithm enters re-cursively into all those that cannot be dis-carded using the covering radius criterion.

The main difference of the MT is theway in which insertions are handled.An element is inserted into the “best”

2 There are many variants but this is reported as themost effective.



subtree, defined as that causing thesubtree covering radius to expand less(zero expansion is the ideal), and in caseof ties selecting the closest representa-tive. Finally, the element is added tothe leaf node and if the node overflows(i.e., becomes of size m + 1) it is split intwo and one node element is promotedupwards, as in a B-tree or an R-tree[Guttman 1984]. Hence the MT is abalanced data structure, much as the VPfamily. There are many criteria to selectthe representative and to split the node,the best results being obtained by trying asplit that minimizes the maximum of thetwo covering radii obtained. They showexperimentally that the MT is resistantto the dimensionality of the space andthat it is competitive against R∗-trees.

5.1.2.9. SAT. The SAT algorithm (spa-tial approximation tree) [Navarro 1999]does not use centers to split the set ofcandidate objects, but rather relies on“spatial” approximation. An element p isselected as the root of a tree, and it isconnected to a set of “neighbors” N , de-fined as a subset of elements u ∈ U suchthat u is closer to p than to any other el-ement in N (note that the definition isself-referential). The other elements (notin N ∪ {p}) are assigned to their clos-est element in N . Each element in N isrecursively the root of a new subtree con-taining the elements assigned to it.

This allows searching for elements withradius zero by simply moving from the rootto its “neighbor” (i.e., connected element)which is closest to the query q. If a radiusr > 0 is allowed, then we consider that anunknown element q′ ∈ U is searched withtolerance zero, from which we only knowthat d (q, q′) ≤ r. Hence, we search as be-fore for q and consider that any distancemeasure may have an “error” of at most±r. Therefore, we may have to enter intomany branches of the tree (not only theclosest one), since the measuring “error”could permit a different neighbor to be theclosest one. That is, if c ∈ N is the clos-est neighbor of q, we enter into all c′ ∈ Nsuch that d (q, c′)− r ≤d (q, c)+ r. The treeis built in O(n log n/ log log n) time, takes

Fig. 7 . Example of a SAT and the traversaltowards a query q, starting at u11.

O(n) space, and inspects O(n1−2(1/ log log n))elements at query time. Covering radii arealso used to increase pruning. Figure 7shows an example and the search path fora query.

5.1.3. Other Techniques

5.1.3.1. AESA. An algorithm that isclose to many of the presented ideas butperforms surprisingly better by an or-der of magnitude is that of Vidal [1986](called AESA, for approximating eliminat-ing search algorithm). The structure issimply a matrix with the n(n − 1)/2 pre-computed distances among the elementsof U. At search time, they select an ele-ment p ∈ U at random and measure rp =d (p, q), eliminating all elements u of Uthat do not satisfy rp− r ≤d (u, p)≤ rp+ r.Notice that all the d (u, p) distances areprecomputed, so only d (p, q) has been cal-culated at search time. This process of tak-ing a random pivot among the (not yeteliminated) elements of U and eliminatingmore elements fromU is repeated until thecandidate set is empty and the query hasbeen satisfied. Figure 8 shows an examplewith a first pivot u11.

Although this idea seems very similarto FQTs, there are some key differences.The first one, only noticeable in continu-ous spaces, is that there are no predefined“rings” so that all the intersected rings



Fig. 8 . Example of the first iteration of AESA.The points between both rings centered at u11qualify for the next iteration.

qualify (recall Figure 2). Instead, onlythe minimal necessary area of the ringsqualifies. The second difference is that thesecond element to compare against q is se-lected from the qualifying set, instead offrom the whole set as in FQTs. Finally,the algorithm uses every distance com-putation as a pivoting operation, whileFQTs must precompute that decision (i.e.,bucket size).

The problem with the algorithm [Vidal1986] is that it needs O(n2) space and con-struction time. This is unacceptably highfor all but very small databases. In thissense the approach is close to Sasha andWang [1990], although in this latter casethey may take fewer distances and boundthe unknown ones. AESA is experimen-tally shown to have O(1) query time.

5.1.3.2. LAESA and Variants. In a newerversion of AESA, called LAESA (for linearAESA), Mico et al. [1994] propose usingk fixed pivots, so that the space and con-struction time is O(kn). In this case, theonly difference with an FHQT is that fixedrings are not used, but the exact set of el-ements in the range is retrieved. FHQTuses fixed rings to reduce the extra CPUtime, while in this case no such algorithmis given. In LAESA, the elements are sim-ply linearly traversed, and those that can-not be eliminated after considering thek pivots are directly compared againstthe query.

A way to reduce the extra CPU timeis presented later in Mico et al. [1994],which builds a GHT-like structure usingthe same pivots. The algorithm is arguedto be sublinear in CPU time. Alternativesearch structures to reduce CPU time notlosing information on distances are pre-sented in Nene and Nayar [1997] andChavez et al. [1999], where the distancesto each pivot are sorted separately so thatthe relevant range [d (q, p)−r, d (q, p)+ r]can be binary searched.3 Extra pointersare added to be able to trace an elementacross the different orderings for eachpivot (this needs more space, however).

5.1.3.3. Clustering and LCs. Clusteringis a very wide area with many applications[Jain and Dubes 1988]. The general goalis to divide a set into subsets of elementsclose to each other in the same subset.A few approaches to index metric spacesbased on clustering exist.

One of them, LC (for “list of clusters”)[Chavez and Navarro 2000] chooses an el-ement p and finds the m closest elementsin U. This is the cluster of p. The processcontinues recursively with the remainingelements until a list of n/(m + 1) clus-ters is obtained. The covering radius cr()of each cluster is stored. At search timethe clusters are inspected one by one. Ifd (q, pi) − r > cr(pi) we do not enter thecluster of pi, otherwise we verify it ex-haustively. If d (q, pi) + r < cr(pi) we donot need to consider subsequent clusters.The authors report unbeaten performanceon high dimensional spaces, at the cost ofO(n2/m) construction time.

5.2. Nearest-Neighbor Queries

We have concentrated in range searchqueries up to now. This is because, as weshow in this section, most of the existingsolutions for NN-queries are built system-atically over range searching techniques,and indeed can be adapted to any of thedata structures presented (despite havingbeen originally designed for specific ones).

3Although Nene and Nayar [1997] consider only vec-tor spaces, the same technique can be used here.



5.2.1. Increasing Radius. The simplest NNsearch algorithm is based on using a rangesearching algorithm as follows. Search qwith fixed radii r = aiε (a > 1), startingwith i = 0 and increasing it until at leastthe desired number of elements (1 or k)lies inside the search radius r = aiε.

Since the complexity of the rangesearching normally grows sharply on thesearch radius, the cost of this methodcan be very close to the cost of a rangesearching with the appropriate r (whichis not known in advance). The increas-ing steps can be made smaller (a→ 1) toavoid searching with a radius much largerthan necessary.

5.2.2. Backtracking with Decreasing Radius.A more elaborate technique is as follows.We first explain the search for the clos-est neighbor. Start the search on any datastructure using r∗ = ∞. Each time q iscompared to some element p, update thesearch radius as r∗ ← min(r∗, d (q, p)) andcontinue the search with this reduced ra-dius. This has been proposed, for exam-ple, for BKTs and FQTs [Burkhard andKeller 1973; Baeza-Yates et al. 1994] aswell as for vector spaces [Roussopouloset al. 1995].

As closer and closer elements to q arefound, we search with smaller radius andthe search becomes cheaper. For this rea-son it is important to try to find quickly el-ements that are close to the query (whichis unimportant in range queries). The wayto achieve this is dependent on the partic-ular data structure. For example, in BKTsand FQTs we can begin at the root andmeasure i = d (p, q). Now, we consider theedges labeled i, i−1, i+1, i−2, i+2, andso on, and proceed recursively in the chil-dren (other heuristics may work better).Therefore, the exploration ends just afterconsidering the branch i+ r∗ (r∗ is reducedalong the process). At the end r∗ is the dis-tance to the closest neighbors and we havealready seen all of them.

N Nk(q) queries are solved as an exten-sion of the above technique, where we keepthe k elements seen that are closest to qand set r∗ as the maximum distance be-tween those elements and q (clearly we are

not interested in elements farther awaythan the current kth closest element).Each time a new element is seen whosedistance is relevant, it is inserted as one ofthe k nearest neighbors known up to now(possibly displacing one of the old candi-dates from the list) and r∗ is updated. Inthe beginning we start with r∗ = ∞ andkeep this value until the first k elementsare found.

A variant of this type of query is the lim-ited radius NN searching. Here we startwith the maximum expected distance be-tween the query element and its nearestneighbor. This type of query has been thefocus of Yianilos [1999, 2000].

5.2.3. Priority Backtracking. The previoustechnique can be improved by a smarterselection of which elements to considerfirst. For clarity we consider backtrackingin a tree, although the idea is general. In-stead of following the normal backtrack-ing order of the range query, modifying atmost the order in which the subtrees aretraversed, we give much more freedom tothe traversal order. The goal is to increasethe probability of quickly finding elementsclose to q and therefore reduce r∗ fast.This technique has been used in vectorand metric spaces several times [Uhlmann1991a; Hjaltason and Samet 1995; Ciacciaet al. 1997].

At each point of the search we have a setof possible subtrees that can be traversed(not necessarily all at the same level). Foreach subtree, we know a lower bound tothe distance between q and any elementof the subtree. We first select subtrees withthe smallest lower bound. Once a subtreehas been selected we compare q againstits root, update r∗ and the candidates foroutput if necessary, and determine whichof the children of the considered root de-serve traversal. Unlike the normal back-tracking, those children are not immedi-ately traversed but added to the set ofsubtrees that have to be traversed at somemoment. Then we select again a subtreefrom the set.

The best way to implement this searchis with a priority queue ordered by theheuristic “goodness,” where the subtrees



are inserted and removed. We start withan empty queue where we insert the rootof the tree. Then we repeat the step of re-moving the most promising subtree, pro-cessing it, and inserting the relevant sub-trees until the lower bound of the bestsubtree exceeds r∗.

If applied to a BKT or a FQT, thismethod yields the same result as the previ-ous section, but this technique is superiorfor dealing with continuous distances.

5.2.4. Specific NN Algorithms. The tech-niques described above cover almostall the existing proposals for solvingNN-queries. The only exception we areaware of was presented in Clarkson[1999], which is a GNAT-like data struc-ture where the points are inserted intomore than one subtree to limit back-tracking (hence the space requirementis superlinear).

After selecting the representatives forthe root of the tree, each element u is notonly inserted into the subtree of its clos-est representative p, but also in the treeof any other representative p′ such thatd (u, p′) ≤ 3d (u, p). At search time, thequery q enters not only into its nearestrepresentative p but also into every otherrepresentative p′ such that d (q, p′) ≤3d (q, p). As shown in Clarkson [1999] thisis enough to guarantee that the nearestneighbor will be reached.

By using subsets of size n1/2k+1at depth

k in the tree, the search time is polyloga-rithmic in n and the space requirement isO(n polylog n) if some conditions hold inthe metric space.

5.3. Extensions

We cover in this section the work that hasbeen pursued on extensions of the basicproblems or in alternative models. Noneof these are the main focus of our survey.

5.3.1. Dynamic Capabilities. Many of thedata structures for metric spaces are de-signed to be built on a static data set.In many applications this is not reason-able because elements have to be insertedand deleted dynamically. Some data struc-

tures tolerate insertions well, but notdeletions.

We first consider insertions. Among thestructures that we have surveyed, theleast dynamic is SAT, which needs fullknowledge of the complete set at indexconstruction time and has difficulty inhandling later insertions (some solutionsare described in Navarro and Reyes[2001]). The VP family (VPT, MVPT,VPF) has the problem of relying on globalstatistics (such as the median) to build thetree, so later insertions can be performedbut the performance of the structure maydeteriorate. Finally, the FQA needs, inprinciple, insertion in the middle of anarray, but this can be handled by usingstandard techniques. All the other datastructures can handle insertions in a rea-sonable way. There are some structure pa-rameters that may depend on n and thusrequire periodical structural reorganiza-tion, but we disregard this issue here (e.g.,adding or removing pivots is generallyproblematic).

Deletion is a little more complicated. Inaddition to the above structures, whichpresent the same problems as for inser-tion, BKTs, GHTs, BSTs, VTs, GNATs,and the VP family cannot tolerate deletionof an internal tree node because it plays anessential role in organizing the subtree. Ofcourse this can be handled as just markingthe node as removed and actually keep-ing it for routing purposes, but the qual-ity of the data structure is affected overtime.

Therefore, the only structures that fullysupport insertions and deletions are inthe FQ family (FQT, FQHT, FQA, sincethere are no truly internal nodes), AESAand LAESA approaches (since they arejust vectors of coordinates), the MT (whichis designed with dynamic capabilities inmind and whose insertion/deletion algo-rithms are reminiscent of those of the B-tree), and a variant of GHTs designedto support dynamic operations [Verbarg1995]. The analysis of this latter struc-ture shows that dynamic insertions canbe done in O(log2 n) amortized worst casetime, and that deletions can be done atsimilar cost under some restrictions.



5.3.2. I/O Considerations. Most of the re-search on metric spaces deals with reduc-ing the number of distance evaluations orat most the total CPU time. However, de-pending on the application, the I/O costmay play an important role. As most of theresearch on metric spaces has focused onalgorithms to discard elements, I/O con-siderations have been normally left aside.

The only exception to the rule is MT, de-signed specifically for secondary memory.The tree nodes in the MT are to be storedin a single disk page (indeed, the MT doesnot fix an arity but rather a node capac-ity in bytes). Earlier balanced trees exist(such as the VP family), but the purposeof this balancing is to keep the extra CPUcosts low. As we show later, unbalanceddata structures perform much better inhigh dimensions and it is unlikely that thereduced CPU costs may play an importantrole. The purpose of balancing the MT, onthe other hand, is to keep I/O costs low, anddepending on the application this may beeven more important than the number ofdistance evaluations.

Other data structures could probablybe adapted to perform well in secondarymemory, but the authors simply have notconsidered the problem. For instance, itis not hard to imagine strategies for thetree data structures to pack “compact”subtrees in disk pages, so as to make asgood use as possible of a page that isread from disk. When a subtree growslarger than a page it is split into twopages of approximately the same numberof nodes. Of course, a B-tree-like schemesuch as that of MT has to be superiorin this respect. Finally, array-oriented ap-proaches such as FQA, AESA, and LAESAare likely to read all the disk pages of theindex for each query, and hence have badI/O performance.

5.3.3. Approximate and Probabilistic Algo-rithms. For the sake of a complete overviewwe include a brief description of an im-portant branch of similarity searching,where a relaxation on the query preci-sion is allowed to obtain a speedup inthe query time complexity. This is reason-

able in some applications because the met-ric space modelization already involves anapproximation to the true answer (recallSection 2), and therefore a second approx-imation at search time may be acceptable.

Additionally to the query one specifies aprecision parameter ε to control how faraway (in some sense) we want the out-come of the query from the correct result.A reasonable behavior for this type of al-gorithm is to asymptotically approach thecorrect answer as ε goes to zero, and com-plementarily to speed up the algorithm,losing precision, as εmoves in the oppositedirection.

This alternative to exact similaritysearching is called approximate similar-ity searching, and encompasses approxi-mate and probabilistic algorithms. We donot cover them in depth here but present afew examples. Approximation algorithmsfor similarity searching are considered indepth in White and Jain [1996].

As a first example, we mention an ap-proximate algorithm for NN search inreal vector spaces using any Ls metric byArya et al. [1994]. They propose a datastructure, the BBD-tree, inspired in kd -trees, that can be used to find “(1 + ε)nearest neighbors”: instead of findingu such that d (u, q) ≤ d (v, q) ∀v ∈ U,they find an element u∗, an (1+ ε)-nearestneighbor, differing from u by a factor of(1 + ε), that is, u∗ such that d (u∗, q) ≤(1+ ε) d (v, q) ∀v ∈ U.

The essential idea behind this algorithmis to locate the query q in a cell (each leafin the tree is associated with a cell in thedecomposition). Every point inside the cellis processed to obtain the current nearestneighbor (u). The search stops when nopromising cells are encountered, that is,when the radius of any ball centered at qand intersecting a nonempty cell exceedsthe radius d (q, p)/(1+ ε). The query timeis O(d1+ 6k/εekk log n).

A second example is a probabilistic al-gorithm for vector spaces [Yianilos 2000].The data structure is like a standard kd -tree, using “aggressive pruning” to im-prove the performance. The idea is to in-crease the number of branches pruned



Fig. 9 . The unified model for indexing and querying metric spaces.

at the expense of losing some candidatepoints in the process. This is done in acontrolled way, so the probability of suc-cess is always known. In spite of the vec-tor space focus of the algorithm, it could begeneralized to metric spaces as well. Thedata structure is useful for finding onlylimited radius nearest neighbors, that is,neighbors within a fixed distance to thequery.

Finally, an example of a probabilisticNN algorithm for general metric spacesis that of Clarkson [1999]. The originalintention is to build a Voronoi-like datastructure on a metric space. As this isnot possible in general because there isnot enough knowledge of the character-istics of the queries that will come later[Navarro 1999], Clarkson [1999] choosesto have a “training set” of queries and tobuild a data structure able to answer cor-rectly only queries belonging to the train-ing set. The idea is that this is enoughto answer correctly an arbitrary querywith high probability. Under some prob-abilistic assumptions on the distributionof the queries, it is shown that the proba-bility of not finding the nearest neighboris O(log n)2/K ), where K can be made ar-bitrarily large at the expense of O(K nρ)space and O(Kρ log n) expected search

time. Here ρ is the logarithm of the ra-tio between the farthest and the nearestpairs of points in the union of U and thetraining set.

6. A UNIFYING MODEL

At first sight, the indexes and the searchalgorithms seem to emerge from a greatdiversity, and different approaches are an-alyzed separately, often under differentassumptions. Currently, the only realisticway to compare two different algorithmsis to apply them to the same data set.

In this section we make a formal intro-duction to our unifying model. Our inten-tion is to provide a common frameworkto analyze all the existing approaches toproximity searching. As a result, we areable to capture the similarities of appar-ently different approaches. We also obtaintruly new ways of viewing the problem.

The conclusion of this section can besummarized in Figure 9. All the indexingalgorithms partition the setU into subsets.An index is built that allows determininga set of candidate subsets where the el-ements relevant to the query can appear.At query time, the index is searched to findthe relevant subsets (the cost of doing thisis called “internal complexity”) and those



subsets are checked exhaustively (whichcorresponds to the “external complexity”of the search).

The last two subsections describe thetwo main approaches to similarity search-ing in abstract terms.

6.1. Equivalence Relations

The relevance of equivalence classes forthis survey comes from the possibilityof partitioning a metric space so that anew metric space is derived from the quo-tient set. Readers familiar with equiva-lence relations can safely skip this shortsection.

Given a set X, a partition π (X) ={π1, π2, . . . } is a collection of pairwise dis-joint subsets whose union is X; that is,∪πi = X and ∀i 6= j , πi ∩ π j = ∅.

A relation, denoted by ∼, is a subset ofthe crossproduct X× X (the set of orderedpairs) of X. Two elements x, y are said tobe related, denoted by x∼ y , if the pair(x, y) is in the subset. A relation ∼ is saidto be an equivalence relation if it satisfies,for all x, y , z ∈ X, the properties of reflex-ivity (x ∼ x), symmetry (x ∼ y ⇔ y ∼ x),and transitivity (x ∼ y ∧ y ∼ z ⇒ x ∼ z).

Every partition π (X) induces an equiv-alence relation ∼ and, conversely, everyequivalence relation induces a partition:two elements are related if they belongto the same partition element. Every el-ement πi of the partition is then calledan equivalence class. An equivalence classis often named after one of its represen-tatives (any element of πi can be takenas a representative). An alternative def-inition of an equivalence class of an ele-ment x is the set of all y such that x ∼ y .We denote the equivalence class of x as[x] = { y , x ∼ y}.

Given the set X and an equivalence rela-tion∼, we obtain the quotient π (X) = X/∼.It indicates the set of equivalence classes(or just classes), obtained when applyingthe equivalence relation to the set X.

6.2. Indexing and Partitions

The equivalence classes in the quotient setπ (X) of a metric space X can be consid-

ered themselves as objects in a new metricspace, provided we define a distance func-tion in π (X).

We introduce a new function D0 : π (X)×π (X)→ R now defined in the quotient.

Definition 1. Given a metric space (X, d )and a partition π (X), the extension ofd to π (X) is defined as D0([x], [ y]) =infx∈[x], y∈[ y]{d (x, y)}.

D0 gives the maximum possible valuesthat keep the mapping contractive (i.e.,D0([x], [ y]) ≤ d (x, y) for any x, y). Unfor-tunately, D0 does not satisfy the triangleinequality, just (p1) to (p3), and in mostcases (p4) (recall Section 3.1). Hence, D0itself is not suitable for indexing purposes.

However, we can use any metric Dthat lower bounds D0 (i.e., D([x], [ y]) ≤D0([x], [ y])). Since D is a metric, (π (X), D)is a metric space and therefore we canmake queries in π (X) in the same waywe have done in X. We redefine the out-come of a query in π (X) as ([q], r)D = {u ∈U, D([u], [q]) ≤ r} (although formally weshould retrieve classes, not elements).

Since the mapping is contractive(D([x], [ y]) ≤ d (x, y)) we can convertone search problem into another, it ishoped simpler, search problem. For agiven query (q, r)d we find out to whichequivalence class the query q belongs(i.e., [q]). Then, using the new distancefunction D the query is transformed into([q], r)D. As the mapping is contractive,we have (q, r)d ⊆ ([q], r)D. That is, ([q], r)Dis indeed a candidate list, so it is enoughto perform an exhaustive search on thatcandidate list (now using the originaldistance d ), to obtain the actual outcomeof the query (q, r)d .

Our main thesis is that the above pro-cedure is in fact used in virtually everyindexing algorithm (recall Figure 9).

PROPOSITION. All the existing indexingalgorithms for proximity searching consistof building an equivalence relation, so thatat search time some classes are discardedand the others are exhaustively searched.

As we show shortly, the most importanttradeoff when designing the partition is to



Fig. 10 . Two points x and y , and their equivalenceclasses (the shaded rings). D gives the minimal dis-tance among rings, which lower bounds the distancebetween x and y .

balance the cost to find ([q], r)D and thecost to verify this candidate list.

In Figure 10 we can see a schematic ex-ample of the idea. We divide the space inseveral regions (equivalence classes). Theobjects inside each region become indistin-guishable. We can consider them as ele-ments in a new metric space. To find theanswer, instead of exhaustively examin-ing the entire dictionary we just examinethe classes that contain potentially inter-esting objects. In other words, if a classcan contain an element that should be re-turned in the outcome of the query, thenthe class will be examined (see also therings considered in Figure 2).

We recall that this property is notenough for an arbitrary NN search algo-rithm to work (since the mapping wouldhave to preserve proximity instead), butmost existing algorithms for NN are basedon range queries (recall Section 5.2), andthese algorithms can be applied as well.

Some examples may help to understandthe above definitions, for both the conceptof equivalence relation and the obtaineddistance function.

Example 1. Say that we have an ar-bitrary reference pivot p ∈ X and theequivalence relation is given by x∼ y ⇔d (p, x) = d (p, y). In this case D([x],[ y]) =|d (x, p) − d ( y , p)| is a safe lower boundfor the D0 distance (guaranteed by the tri-angle inequality). For a query of the form

(q, r)d the candidate list ([q], r)D consistsof all elements x such that D([q],[x]) ≤ r,or which is the same, |d (q, p) − d (x, p)| ≤r. Graphically, this distance represents aring centered at p containing a ball cen-tered at q with radius r (recall Figures 10and 8). This is the familiar rule used inmany independent algorithms to trim thespace.

Example 2. As explained, the similaritysearch problem was first introduced in vec-tor spaces, and the very first family of al-gorithms used there was based on a par-tition operation. These algorithms werecalled bucketing methods, and consistedof the construction of cells or buckets[Bentley et al. 1980] Searching for an ar-bitrary point in Rk is converted into an ex-haustive search in a finite set of cells. Theprocedure uses two steps: they find whichcell the query point belongs to and builda set of candidate cells using the queryrange; then they inspect this set of can-didate cells exhaustively to find the actualpoints inside the query range.4 In this casethe equivalence classes are the cells, andthe tradeoff is that the larger the cells, thecheaper it is to find the appropriate ones,but the more costly is the final exhaustivesearch.

6.3. Coarsening and Refining a Partition

We start by defining the concepts of refine-ment and coarsening.

Definition 2. Let ∼1 and ∼2 be twoequivalence relations over a set X. We saythat ∼1 is a refinement of ∼2 or that ∼2 isa coarsening of ∼1 if for any pair x, y ∈ Xsuch that x ∼1 y it holds x ∼2 y . The sameterms can be applied to the correspondingpartitions π1(X) and π2(X).

Refinement and coarsening are impor-tant concepts for the topic we are dis-cussing. The following lemma shows theeffect of coarsening on the effectiveness ofthe partition for searching purposes.

4 The algorithm is in fact a little more sophisticatedbecause they try to find the nearest neighbor of apoint. However, the version presented here for rangequeries is in the same spirit as the original one.



LEMMA 1. If ∼1 is a coarsening of∼2 then their extended distances D1and D2 have the property D1([x], [ y]) ≤D2([x], [ y]).

PROOF. Let us denote [x]i and [ y]i theequivalence classes of x and y underequivalence relation ∼i. Then

D1([x], [ y]) = infx∈[x]1, y∈[ y]1

{d (x, y)}≤ inf

x∈[x]2, y∈[ y]2

{d (x, y)} = D2([x], [ y]),

since [x]2 ⊆ [x]1 and [ y]2 ⊆ [ y]1.

An interesting idea arising from theabove lemma is to build a hierarchy ofcoarsening operations. Using this hierar-chy we could proceed downwards from avery coarse level building a candidate listof equivalence classes of the next level.This candidate list will be refined usingthe next distance function and so on untilwe reach the bottom level.

6.4. Discriminative Power

As sketched previously, most indexing al-gorithms rely on building an equivalencerelation. The corresponding search algo-rithms have the following parts.

1. Find the classes that may be relevantfor the query.

2. Exhaustively search all the elements ofthese classes.

The first part involves performing someevaluations of the d distance, as shownin Example 1 above. It may also involvesome extra CPU time (which although notthe central point in this article, must bekept reasonable). The second part consistsof directly comparing the query againstthe candidate list. The following defini-tion gives a name to both parts of thesearch cost.

Definition 3. Let A be a search algo-rithm over (X, d ) based on a mapping to(π (X), D), and let (q, r)d be a range query.Then the internal complexity of A is thenumber of evaluations of d necessary tocompute ([q], r)D, and the external com-plexity is |([q], r)D|.

We recall that |([q], r)D| refers to thenumber of elements in the original metricspace, not the number of classes retrieved.

There is a concept related to the ex-ternal complexity of a search algorithm,which we define next.

Definition 4. The discriminative powerof a search algorithm based on a mappingfrom (X, d ) to (π (X), D), with regard to aquery (q, r)d of nonempty outcome, is de-fined as |(q, r)d |/|([q], r)D|.

Although the definition depends on qand r, we can speak in general terms of thediscriminative power by averaging overthe qs and rs of interest. The discrimina-tive power serves as an indicator of theperformance or fitness of the equivalencerelation.

In general, it will be more costly to havemore discriminative power. The indexingscheme needs to find a balance betweenthe complexity to find the relevant classesand the discriminative power.

Let us consider Example 1. The inter-nal complexity is one distance evaluation(the distance from q to p), and the externalcomplexity will correspond to the numberof elements that lie in the selected ring. Wecould intersect it with more rings (increas-ing internal complexity) to reduce the ex-ternal complexity.

The tradeoff is partially formalized withthis lemma.

LEMMA 2. If A1 and A2 are search algo-rithms based on equivalence relations ∼1and ∼2, respectively, and ∼1 is a coarsen-ing of∼2, then A1 has higher external com-plexity than A2.

PROOF. We have to show that, for anyr, ([q], r)D2 ⊆ ([q], r)D1 . But this is clear,since D1([x], [ y]) ≤ D2([x], [ y]) implies([q], r)D2 = { y ∈ U, D2([q], [ y]) ≤ r} ⊆{ y ∈ U, D1([q], [ y]) ≤ r} = ([q], r)D1 .

Although having more discriminativepower normally costs more internal eval-uations, one can make better or worse useof the internal complexity. We elaboratemore on this next.



Fig. 11 . An equivalence relation induced by intersecting rings centered in two pivots, and how a query istransformed.

6.5. Locality of a Partition

The equivalence classes can be thoughtof as a set of nonintersecting cells in thespace, where every element inside a givencell belongs to the same equivalence class.However, the mathematical definition ofan equivalence class is not confined to asingle cell. We define “locality” as a prop-erty of a partition that stands for howmuch the classes resemble cells.

Definition 5. The nonlocality of a par-tition π (X) = {π1, π2, . . . } with respectto a finite dictionary U is defined asmaxi{maxx, y∈πi∩U d (x, y)}, that is, as themaximum distance between elementslying in the same class.

We say that a partition is “local” or “non-local” meaning that it has high or low lo-cality. Figure 11 shows an example of anonlocal partition (u5 and u12 lie in sepa-rate fragments of a single class). It is nat-ural to expect more discriminative powerfrom a local partition than from a nonlo-cal one. This is because in a nonlocal par-tition the candidate list tends to containelements actually far away from the query.

Notice that in Figure 11 the localitywould improve sharply if we added a thirdpivot. In a vector space of k dimensions, itsuffices to consider k+ 1 pivots in generalposition5 to obtain a highly local partition.

5 That is, not lying on a (k − 1)-hyperplane.

In general metric spaces we can also take asufficient number of pivots so as to obtainhighly local partitions.

However, obtaining local partitions maybe expensive in internal complexity andnot enough to achieve low external com-plexity, otherwise the bucketing methodfor vector spaces [Bentley et al. 1980] ex-plained in Example 2 would have excellentperformance. Even with such a local par-tition and assuming uniformly distributeddata, a number of empty cells are ver-ified, whose volume grows exponentiallywith the dimension. We return later tothis issue.

6.6. The Pivot Equivalence Relation

A large class of algorithms to build theequivalence relations is based on pivot-ing. This consists of considering the dis-tances between an element and a num-ber of preselected “pivots” (i.e., elementsof U or even X, also called reference points,vantage points, keys, queries, etc. in theliterature).

The equivalence relation is defined interms of the distances of the elements tothe pivots, so that two elements are equiv-alent if they are at the same distance fromall the pivots. If we consider one pivot p,then this equivalence relation is

x ∼p y ⇔ d (x, p) = d ( y , p).



Fig. 12 . Mapping from a metric space onto a vector space under the L∞ metric, using two pivots.

The equivalence classes correspond tothe intuitive notion of the family of sphereshells with center p. Points falling in thesame sphere shell (i.e., at the same dis-tance from p) are equivalent from theviewpoint of p.

The above equivalence relation is easilygeneralized to k pivots.

Definition 6. The pivot equivalence re-lation based on elements {p1, . . . , pk} (thek pivots) is defined as

x ∼{pi} y ⇔ ∀i, d (x, pi) = d ( y , pi).

A graphical representation of the classin the general case corresponds to theintersection of several sphere shells cen-tered at the points pi (recall Figure 11).

The distance d (x, y) cannot be smallerthan |d (x, p)−d ( y , p)| for any element p,because of the triangle inequality. HenceD([x], [ y]) = |d (x, p) − d ( y , p)| is a safelower bound to the D0 function corre-sponding to the class of sphere shells cen-tered in p. With k pivots, this becomesD([x], [ y]) = maxi{|d (x, pi) − d ( y , pi)|}.This D distance lower bounds d andhence can be used as our distance in thequotient space.

Alternatively, we can consider theequivalence relation as a projection tothe vector space Rk , k being the num-ber of pivots used. The ith coordinate of

an element is the distance of the ele-ment to the ith pivot. Once this is done,we can identify points in Rk with ele-ments in the original space with the L∞distance. As we have described in Sec-tion 6, the indexing algorithm will consistof finding the set of equivalence classessuch that they fall inside the radius ofthe search when using D in the quo-tient space. In this particular case fora query of the form (q, r)d we have tofind the candidate list as the set ([q], r)D,that is, the set of equivalence classes [ y]such that D([q], [ y]) ≤ r. In other words,we want the set of objects y such thatmaxi{|d (q, pi) − d ( y , pi)|} ≤ r. This isequivalent to search with the L∞ distancein the vector space Rk where the equiv-alence classes are projected. Figure 12illustrates this concept (Figure 11 isalso useful).

Yet a third way to see the technique, lessformal but perhaps more intuitive, is asfollows. To check if an element u∈U be-longs to the query outcome, we try a num-ber of random pivots pi. If, for any such pi,we have |d (q, pi)−d (u, pi)|> r, then by thetriangle inequality we know that d (q, u) >r without the need to actually evalu-ate d (q, u). At indexing time we precom-pute the d (u, pi) values and at search timewe compute the d (q, pi) values. Only thoseelements u that cannot be discarded bylooking at the pivots are actually checkedagainst q.



Fig. 13 . A compact partition using four centersand two query balls intersecting some classes.Note that the class of c3 could be excludedfrom consideration for q1 by using the coveringradius criterion but not the hyperplane crite-rion, whereas the opposite happens to discard c4for q2.

6.7. The Compact Equivalence Relation

A different type of equivalence relation,used by another large class of algorithms,is defined with respect to the proximity toa set of elements (which we call “centers”to distinguish them from the pivots of theprevious section).

Definition 7. The compact equivalencerelation based on {c1, . . . , cm} (the centers)is

x ∼{ci} y⇔ closest(x, {ci})= closest( y , {ci}),where closest(z, S) = {w ∈ S, ∀w′ ∈ S,d (z, w) ≤ d (z, w′)}. The associated parti-tion is called a compact partition.

That is, we divide the space with onepartition element for each ci and the classof ci is that of the points that have ci astheir closest center. Figure 13 shows an ex-ample in (R2, L2). In particular, note thatwe can select U as the set of centers, inwhich case the partition has optimal local-ity. Even if {ci} is not U, compact partitionshave good locality.

In vector spaces the compact partitioncoincides with the Voronoi partition if the

whole set U is the set of centers. Its as-sociated concept, the “Delaunay tessela-tion,” is a graph whose nodes are theelements of U and whose edges connectnodes whose classes share a border. TheDelaunay tesselation is the basis of verygood algorithms for proximity searching[Aurenhammer 1991; Yao 1990]. For ex-ample, an O(log n) time NN algorithm ex-ists in two dimensions. Unfortunately, thisalgorithm does not generalize efficiently tomore than two dimensions. The Delaunaytesselation, which has O(n) edges in twodimensions, can have O(n2) edges in threeand more dimensions.

In a general metric space, theD0([x], [ y]) distance in the space π (X)of the compact classes is, as before, thesmallest distance between points x ∈ [x]and y ∈ [ y]. To find [q] we basically needto find the nearest neighbor of q in theset of centers. The outcome of the query([q], r)D0 is the set of classes intersectedby a query ball (see Figure 13).

A problem in general metric spaces isthat it is not easy to bound a class so as todetermine whether the query ball inter-sects it. From the many possible criteria,two are the most popular.

6.7.1. Hyperplane Criterion. This is themost basic one and the one that bestexpresses the idea of compactness. Inessence, if c is the center of the class[q] (i.e., the center closest to q), then (1)the query ball of course intersects [c]; (2)the query ball does not intersect [ci] ifd (q, c) + r < d (q, ci) − r. Graphically, ifthe query ball does not intersect the hy-perplane dividing its closest neighbor andanother center ci, then the ball is totallyoutside the class of ci.

6.7.2. Covering Radius Criterion. Thistries to bound the class [ci] by consideringa ball centered at ci that contains all theelements of U that lie in the class.

Definition 8. The covering radius of cfor U is cr(c) = maxu∈[c]∩U d (c, u).

Now it is clear that we can discard ci ifd (q, ci)− r > cr(ci).



Fig. 14 . A low-dimensional (left) and high-dimensional (right) histogram of distances, showingthat on high dimensions virtually all the elements become candidates for exhaustive evaluation.Moreover, we should use a larger r in the second plot in order to retrieve some elements.

7. THE CURSE OF DIMENSIONALITY

As explained, one of the major obstacles forthe design of efficient search techniques onmetric spaces is the existence and ubiq-uity in real applications of the so-calledhigh-dimensional spaces. Traditional in-dexing techniques for vector spaces (suchas kd-trees) have an exponential depen-dency on the representational dimensionof the space (as the volume of a box or hy-percube containing the answers grows ex-ponentially with the dimension).

More recent indexing techniques for vec-tor spaces and those for generic metricspaces can get rid of the representationaldimension of the space. This makes abig difference in many applications thathandle vector spaces of high represen-tational dimension but low intrinsic di-mension (e.g., a plane immersed in a50-dimensional vector space, or simplyclustered data). However, in some caseseven the intrinsic dimension is very highand the problem becomes intractable forexact algorithms, and we have to resortto approximate or probabilistic algorithms(Section 5.3.3).

Our aim in this section is (a) to showthat the concept of intrinsic dimensional-ity can be conceived even in a general met-ric space; (b) to give a quantitative defi-nition of the intrinsic dimensionality; (c)to show analytically the reason for the so-called “curse of dimensionality;” and (d ) todiscuss the effects of pivot selection tech-

niques. This is partially based on previouswork [Chavez and Navarro 2001a].

7.1. Intrinsic Dimensionality

Let us start with a well-known example.Consider a distance such that d (x, x) = 0and d (x, y) = 1 for all x 6= y . Under thisdistance (in fact an equality test), we donot obtain any information from a compar-ison except that the element considered isor is not our query. It is clear that it isnot possible to avoid a sequential searchin this case, no matter how smart our in-dexing technique is.

Let us consider the histogram of dis-tances between points in the metric spaceX. This can be approximated by usingthe dictionary U as a random sample ofX. This histogram is mentioned in manypapers, for example, Brin [1995], Chavezand Marroquın [1997], and Ciaccia et al.[1998a], as a fundamental measure re-lated to the intrinsic dimensionality of themetric space. The idea is that, as the spacehas higher intrinsic dimension, the meanµ of the histogram grows and its varianceσ 2 is reduced. Our previous example is anextreme case.

Figure 14 gives an intuitive explanationof why the search problem is harder whenthe histogram is concentrated. If we con-sider a random query q and an indexingscheme based on random pivots, then thepossible distances between q and a pivot pare distributed according to the histogram



of the figure. The elimination rule saysthat we can discard any point u such thatd (p, u) 6∈ [d (p, q) − r, d (p, q) + r]. Thegrayed areas in the figure show the pointsthat we cannot discard. As the histogramis more and more concentrated around itsmean, fewer points can be discarded usingthe information given by d (p, q).

This phenomenon is independent of thenature of the metric space (vectorial or not,in particular) and gives us a way to quan-tify how hard it is to search on an arbitrarymetric space.

Definition 9. The intrinsic dimension-ality of a metric space is defined as ρ =µ2/2σ 2, where µ and σ 2 are the mean andvariance of its histogram of distances.

The technical convenience of the exactdefinition is made clear shortly. The impor-tant part is that the intrinsic dimension-ality grows with the mean and decreaseswith the variance of the histogram.

The particular cases of the Ls distancesin vector spaces are useful illustrations. Asshown in Yianilos [1999], a uniformly dis-tributed k-dimensional vector space underthe Ls distance has mean2(k1/s) and stan-dard deviation 2(k1/s−1/2). Therefore itsintrinsic dimensionality is 2(k) (althoughthe constant is not necessarily 1). So theintuitive concept of dimensionality in vec-tor spaces matches our general concept ofintrinsic dimensionality.

7.2. A Lower Bound for Pivoting Algorithms

Our main result in this section relatesthe intrinsic dimensionality with thedifficulty of searching with a given searchradius r using a pivoting equivalence re-lation that chooses the pivots at random.As we show next, the difficulty of theproblem is related to r and the intrinsicdimensionality ρ.

We are considering independent identi-cally distributed random variables for thedistribution of distances between points.Although not accurate, this simplificationis optimistic and hence can be used tolower bound the performance of theindexing algorithms. We come back tothis shortly.

Let (q, r)d be a range query over a met-ric space indexed by means of k randompivots, and let u be an element of U. Theprobability that u cannot be excludedfrom direct verification after consideringthe k pivots is exactly

(a) Pr(|d (q, p1)− d (u, p1)|≤ r, . . . , |d (q, pk)− d (u, pk)| ≤ r).

Since all the pivots are assumed tobe random and their distance distribu-tions independent identically distributedrandom variables, this expression is theproduct of probabilities

(b) Pr(|d (q, p1)− d (u, p1)| ≤ r)× · · ·× Pr(|d (q, pk)− d (u, pk)| ≤ r)

which for the same reason can be simpli-fied to

Pr(not discarding u)= Pr(|d (q, p)− d (u, p)| ≤ r)k .

for a random pivot p.If X and Y are two independent iden-

tically distributed random variables withmean µ and variance σ 2, then the mean ofX − Y is 0 and its variance is 2σ 2. UsingChebyschev’s inequality6 we have thatPr(|X − Y | > ε) < 2σ 2/ε2. Therefore,

Pr(|d (q, p)− d (u, p)| ≤ r) ≥ 1− 2σ 2

r2 ,

where σ 2 is precisely the variance of thedistance distribution in our metric space.The argument that follows is valid for2σ 2/r2 < 1, or r >

√2σ (large enough

radii); otherwise the lower bound is zero.Then, we have

Pr(not discarding u) ≥(

1− 2σ 2

r2

)k

.

We have now that the total searchcost is the number of internal distance

6 For an arbitrary distribution Z with mean µz andvariance σ 2

z , Pr(|Z − µz | > ε) < σ 2z /ε

2.



evaluations (k) plus the external eval-uations, whose number is on averagen× Pr(not discarding u). Therefore

Cost ≥ k + n(

1− 2σ 2

r2

)k

is a lower bound to the average searchcost by using random pivots. Optimizingwe obtain that the best k is

k∗ = ln n+ ln ln(1/t)ln(1/t)

,

where t = 1 − 2σ 2/r2. Using this optimalk∗, we obtain an absolute (i.e., indepen-dent on k) lower bound for the averagecost of any random pivot-based algorithm:

Cost ≥ ln n+ ln ln(1/t)+ 1ln(1/t)

≥ ln nln(1/t)

≥ r2

2σ 2 ln n

which shows that the cost dependsstrongly on σ/r. As r increases t tendsto 1 and the scheme requires more andmore pivots and it is anyway more andmore costly.

A nice way to represent this result is toassume that we are interested in retriev-ing a fixed fraction of at most f of theelements, in which case r can be written asr = µ− σ/√ f by Chebyschev’s inequality.In this case the lower bound becomes

r2

2σ 2 ln n = (µ− σ/√ f )2

2σ 2 ln n

= ρ

(1− 1√

2ρ f

)2

ln n

=(√ρ − 1√

2 f

)2

ln n

which is valid for f ≥ 1/(2ρ). We havejust proved the following.

THEOREM 1. Any pivot-based algo-rithm using random pivots has a lowerbound (

√ρ − 1/

√2 f )2 ln n in the average

number of distance evaluations performed

for a random range query retrieving atmost a fraction f of the set, where ρ is theintrinsic dimension of the metric space.

This result matches that of Baeza-Yates[1997] and Baeza-Yates and Navorro[1998] on FHQTs, about obtaining2(log n)search time using2(log n) pivots, but herewe are more interested in the “constant”term, which depends on the dimension.

The theorem shows clearly that theparameters governing the performanceof range-searching algorithms are ρ andf . One can expect that f keeps constantas ρ grows, but it is possible that in someapplications the query q is known to bea perturbation of some element of U andtherefore we can keep a constant searchradius r as the dimension grows. Even inthose cases the lower bound may grow ifσ shrinks with the dimension, as in theLs vector spaces with s > 2.

We have considered independent iden-tically distributed random variablesfor each pivot and the query. This is areasonable approximation, as we do notexpect much difference between the “viewpoints” from the general distribution ofdistances to the individual distributions(see Section 7.4). The expression givenin Equation (7.2b) cannot be obtainedwithout this simplification.

A stronger assumption comes from con-sidering all the variables as independent.This is an optimistic consideration equiv-alent to assuming that in order to discardeach element u of the set we take k newpivots at random. The real algorithm fixesk random pivots and uses them to try todiscard all the elements u of the set. Thelatter alternative can suffer from depen-dencies from a point u to another, whichcannot happen in the former case (e.g., if uis close to the third pivot and u′ is close to uthen the distance from u′ to the third pivotcarries less information). Since the as-sumption is optimistic, using it to reducethe joint distribution in Equation (7.2a) tothe expression given in Equation (7.2b)keeps the lower bound valid.

Figures 15 and 16 show an experimenton the search cost in (R`, L2) using a differ-ent number of pivots k and dimensions `.



Fig. 15 . Internal, external, and overall distance evaluations in eight dimensions, using differ-ent numbers of pivots k.

Fig. 16 . Overall distance evaluations as the dimension grows for fixed k.



The n = 100, 000 elements are generatedat random and the pivots are randomlychosen from the set. We average over 1,000random queries whose radius is set to re-trieve 10 elements of the set. We count thenumber of distance evaluations. Figure 15shows the existence of an optimum k∗ =110, and Figure 16 shows the predictedO(n(1 − 1/2(`))k) behavior. We have notenough memory in our machine to showthe predicted growth in k∗ ≈ 2(`) ln(n).

Figure 17 shows the effect in a dif-ferent way. As the dimension grows,the histogram of L2 moves to the right(µ = 2(

√`)). Yet the pivot distance D

(in the projected space (Rk , L∞)) remainsabout the same for fixed k. Increasingk from 32 to 512 moves the histogramslightly to the right. This shift is effectivein low-dimensional metric spaces, but it isuneffective in high dimensions. The plotsof these two histograms can measure howgood are the pivoting algorithms. Intu-itively, the overlap between the histogramfor the pivot distance and the histogramfor the original distance is directly propor-tional to the discriminative power of thepivot mapping. As the overlap increasesthe algorithms become more effective.

The particular behavior of D inFigure 17 is due to the fact that D is themaximum of k random variables whosemean is 2(σ ) (i.e., |d (p, q)− d (p, x)|). Thefact that D does not grow with ` meansthat, as a lower bound for d , it gets lesseffective in higher dimensions.

7.3. A Lower Bound for CompactPartitioning Algorithms

We now try to obtain a lower bound tothe search cost of algorithms based on thecompact partition. The result is surpris-ingly similar to that of the previous sec-tion. Our lower bound considers only thehyperplane criterion, which is the mostpurely associated with the compact parti-tion. We assume just the same facts aboutthe distribution of distances as in the pre-vious section: all of them are independentidentically distributed random variables.

Let (q, r)d be a range query over ametric space indexed by means of m ran-

dom centers {c1, . . . , cm}. The m distancesd (q, ci) can be considered as randomvariables X 1, . . . , X m whose distributionis that of the histogram of the metricspace. The distribution of the distancefrom q to its closest center c is that ofY = min{X 1, . . . , X m}. The hyperplanecriterion specifies that a class [ci] cannotbe excluded if d (q, c)+r ≥ d (q, ci)−r. Theprobability that this happens is Pr(Y ≥X i−2r). But since Y is the minimum overm variables with the same distribution,the probability is Pr(Z ≥ X − 2r)m, whereX and Z are two random variables dis-tributed according to the histogram. UsingChebyschev’s inequality and noticing thatif Z < X −2r then X or Z are at distanceat least r from their mean, we can say that

Pr(not discarding [ci])

= Pr(Z ≥ X − 2r)m ≥(

1− σ2

r2

)m

.

On average each class has n/m ele-ments, so that the external complexityis n×Pr (not discarding [ci]). The inter-nal cost to find the intersected classesdeserves some discussion. In all the hier-archical schemes that exist, we considerthat the real partition is that induced bythe leaves of the trees, that is, the mostrefined ones. We see all the rest of thehierarchy as a mechanism to reduce theinternal complexity of finding the smallclasses (hence the m we use here is not,say, the m of GNATs, but the total numberof final classes). It is difficult to determinethis internal complexity (an upper boundis m), so we call it CI (m), knowing thatit is between �(log m) and O(m). Then alower bound to the search complexity is

Cost ≥ CI (m)+ n(

1− σ2

r2

)m

which indeed is very similar to thelower bound on pivot-based algorithms.Optimizing on m yields

m∗ = ln n+ ln ln(1/t ′)− ln C ′I (m∗)ln(1/t ′)

,



Fig

.17

.H

isto

gram

sco

mpa

rin

gth

eL

2di

stan

cein

diff

eren

t`-d

imen

sion

alE

ucl

idea

nsp

aces

and

the

pivo

tdi

stan

ce(M

axD

ist)

obta

ined

usi

ng

diff

eren

tn

um

bers

kof

pivo

ts.I

nth

eto

pro

w`=

16an

din

the

bott

omro

w`=

128.

On

the

left

k=

32an

don

the

righ

tk=

512.



Fig. 18 . Overall distance evaluations using a hierarchical compact partitioning with differentarities.

where t ′ = 1 − σ 2/r2. Using the optimalm∗ the search cost is lower bounded by

Cost = �(CI (log1/t ′ n)) = �(CI

(r2

σ 2 ln n))

which also shows an increase in the costas the dimensionality grows. As beforewe can write r = µ − σ/√ f . We have justproved the following.

THEOREM 2. Any compact partitioningalgorithm based on random centers hasa lower bound CI (2(

√ρ − 1/

√2 f )2) in the

average number of distance evaluationsperformed for a random range queryretrieving a fraction of at most f of thedatabase, where ρ is the intrinsic dimen-sion of the space and CI () is the internalcomplexity to find the relevant classes,satisfying �(log m) = CI (m) = O(m).

This result is weaker than Theorem 1because of our inability to give a goodlower bound on CI , so we cannot ensuremore than a logarithmic increase withrespect to ρ. However, even assumingCI (m) = 2(m) (i.e., exhaustive search of

the classes), when the theorem becomesvery similar to Theorem 1, there is animportant reason that explains why thecompact partitioning algorithms can inpractice be better than pivot-based ones.We can in this case achieve the optimalnumber of centers m∗, which is impossiblein practice for pivot-based algorithms. Thereason is that it is much more economicalto represent the compact partition usingm centers than the pivot partition usingk pivots.

Figure 18 shows an experiment on thesame dataset, where we have used differ-ent m values and a hierarchical compactpartitioning based on them. We have usedthe hyperplane and the covering radiuscriteria to prune the search. As can beseen, the dependency on the dimension ofthe space is not so sharp as for pivot-basedalgorithms, and is closer to a dependencyof the form 2(`).

The general conclusion is that, evenif the lower bounds using pivot-basedor compact partitioning algorithms looksimilar, the first ones need much morespace to store the classes resulting from kpivots than the last ones using the same



number of partitions. Hence, the lattercan realistically use the optimal numberof classes, whereas the former cannot. Ifpivot-based algorithms are given all thenecessary memory, then using the optimalnumber of pivots they can improve overcompact partitioning algorithms, becauset is better than t ′, but this is more andmore difficult as the dimension grows.

Figure 19 compares both types ofpartitioning. As can be seen, the pivotingalgorithm improves over the compactpartitioning if we give it enough pivots.However, “enough” is a number thatincreases with the dimension and withthe fraction retrieved (i.e., ρ and f ). For ρand f large enough, the required numberof pivots will be unacceptably high interms of memory requirements.

7.4. Pivot and Center Selection Techniques

In Farago et al. [1993], they prove formallythat if the dimension is constant, thenafter properly selecting (not at random!)a constant number k of pivots the exhaus-tive search costs O(1). This contrasts withour�(log n) lower bound. The difference isthat they do not take the pivots at randombut select a set of pivots that is assumedto have certain selectivity properties. Thisshows that the way in which the pivotsare selected can affect the performance.Unfortunately, their pivot selection tech-nique works for vector spaces only.

Little is known about pivot/center (letus call them collectively, “references”)selection policies, and in practice mostmethods choose them at random, with afew exceptions. For instance, in Shapiro[1977] it is recommended to select pivotsoutside the clusters, and Baeza-Yateset al. [1994] suggest using one pivot fromeach cluster. All authors agree in thatthe references should be far apart fromeach other, which is evident since closereferences will give almost the same in-formation. On the other hand, referencesselected at random are already far apartin a high-dimensional space.

The histogram of distances gives a for-mal characterization of good references.Let us start with a definition.

Definition 10. The local histogramof an element u is the distribution ofdistances from u to every x ∈ X.

A good reference has a flatter histogram,which means that it will discard moreelements at query time. The measureρ = µ2/(2σ 2) of intrinsic dimensionality(now defined on the local histogram of u)can be used as a good parameter to eval-uate how good a reference is (good refer-ences have local histograms with small ρ).

This is related to the difference inviewpoints (histograms) between differ-ent references, a subject discussed indepth in Ciaccia et al. [1998a]. The ideais that the local histogram of a referenceu may be quite different from the globalhistogram (especially if u is not selectedat random). This is used in Ciaccia et al.[1998a] to show that if the histogramsfor different references are similar thenthey can accurately predict the behaviorof instances of their data structure (theMT), a completely different issue.

Note also that good reference selectionbecomes harder as the intrinsic dimensiongrows. As the global histogram reducesits variance, it becomes more difficult tofind references that deviate significantlyfrom it, as already noted in Ciaccia et al.[1998a].

The histogram characterization ex-plains a well-known phenomenon: todiscriminate among the elements in theclass of a local partition (or in a cluster),it is a good idea to select a pivot from thesame class. This makes it more probableto select an element close to them (theideal would be a centroid). In this case,the distances tend to be smaller andthe histogram is not so concentrated inlarge values. For instance, for LAESA,Mico et al. [1994] do not use the pivotsin a fixed order, but the next one is thatwith minimal L1 distance to the currentcandidates. On the other hand, outlierscan be good at a first approximation, inorder to discriminate among clusters, butlater they are unable to separate well theelements of the same cluster.

Selecting good individual references,that is, with a flatter histogram, is not



Fig. 19 . Distance evaluations for increasing dimension. We compare the compact partition-ing algorithm of Figure 18 using 128 centers per level against the filtration using k pivotsfor k = 4, 16, 64. On top the search radius captures 0.01% of the set, on the bottom 0.1%.



enough to ensure a good performance. Forexample, a reference close to a good oneis probably good as well, but using both ofthem gives almost the same informationas using only one. It is necessary toinclude a variety of independent view-points in the selection. If we consider theselection of references as an optimizationproblem, the goal is to obtain a set ofreferences that are collectively good forindexing purposes. The set of referencesis good if it approximates the originaldistance well. This is more evident inthe case of pivoting algorithms, where acontractive distance is implicitly used. Apractical way to decide whether our setof references is effective is to compare thehistograms of the original distance andthe pivot distance as in Figure 17. A goodreference set will give a pivot distancehistogram with a large intersection withthe histogram of the original distance.

7.5. The Effect on the Discriminative Power

Another reason that explains the curse ofdimensionality is related to the decreasein the discriminative power of a partition,because of the odd “shapes” of the classes.

As explained before, a nonlocal partitionsuffers from the problem of being unable todiscriminate between points that are actu-ally far away from each other, which leadsto unnecessarily high external complexity.The solution is to select enough pivots orto use a compact partitioning method thatyields a local partition. However, even ina local partition, the shape of the cell isfixed at indexing time, while the shapeof the space region defined as interestingfor a query q is a ball dynamically deter-mined at query time. The discriminativepower can be visualized as the volume ofthe query ball divided by the total volumeof the classes intersected by that ball.

A good example is that of a ball inside abox in a k-dimensional L2 space. The boxis the class obtained after using k orthog-onal pivots, and the partition obtained isquite local. Yet the volume of the ball issmaller, and the ratio with respect to thevolume of the box (i.e., the discriminativepower) decreases exponentially with k.

This means that we have to examinea volume (and a number of candidateelements) that, with respect to the size ofthe final result, grows exponentially withthe dimension k. This fact is behind theexponential dependency on the dimensionin data structures for vector spaces suchas the kd -tree or the R-tree.

The same phenomenon occurs in gen-eral metric spaces, where the “shapes” ofthe cells cannot possibly fit an unknownquery ball. We can add more pivots tobetter bound any possible ball, but thisincreases the internal complexity andbecomes more difficult as the intrinsicdimension grows (the smaller the vari-ance, the more pivots are needed to makea difference).

8. A TAXONOMY OF SEARCH ALGORITHMS

In this section we apply our unify-ing model to organize all the knownapproaches in a taxonomy. This helpsto identify the essential features of allthe existing techniques, to find possiblecombinations of algorithms not noticed upto now, and to detect which are the mostpromising areas for optimization.

We first concentrate on pivoting algo-rithms. They differ in their method toselect the pivots, in when the selectionis made, and in how much informationon the comparisons is used. Later, weconsider the compact partitioning algo-rithms. These differ in their methodsto select the centers and to bound theequivalence classes.

Figure 20 summarizes our classifica-tion of methods attending to their mostimportant features (explained throughoutthe section).

8.1. Pivot-Based Algorithms

8.1.1. Search Algorithms. Once we havedetermined the equivalence relation touse (i.e., the k pivots), we preprocess thedictionary by storing, for each element ofU, its k coordinates (i.e., distance to thek pivots). This takes O(kn) preprocessingtime and space overhead. The “index” can



Fig. 20 . Taxonomy of the existing algorithms. The methods in italics are combinations that appear naturallyas soon as the taxonomy is built.

be seen as a table of n rows and k columnsstoring d (ui, pj ).

At query time, we first compare thequery q against the k pivots, hence obtain-ing its k coordinates [q] = ( y1, . . . , yk) inthe target space, that is, its equivalenceclass. The cost of this is k evaluationsof the distance function d , which corre-sponds to the internal complexity of thesearch. We have now to determine, in thetarget space, which classes may be rele-vant to the query (i.e., which ones are atdistance r or less in the L∞ metric, whichcorresponds to the D distance). This doesnot use further evaluations of d , but itmay take extra CPU cost. Finally, the ele-ments belonging to the qualifying classes(i.e., those that cannot be discarded afterconsidering the k pivots) are directlycompared against q (external complexity).

The simplest search algorithm proceedsrowwise: consider each element of theset (i.e., each row (x1, . . . , xk) of the table)and see if the triangle inequality allowsdiscarding that row, that is, whether

maxi=1..k{|xi − yi|} > r. For each row notdiscarded using this rule, compare theelement directly against q. This is equiv-alent to traversing the quotient space,using D to discard uninteresting classes.

Although this traversal does not per-form more evaluations of d than neces-sary, it is not the best choice. The reasonsare made clear later, as we discover theadvantages of alternative approaches.First, notice that the amount of CPUwork is O(kn) in the worst case. However,as we abandon a row as soon as we finda difference larger than r along a coordi-nate, the average case is much closer toO(n) for queries of reasonable selectivity.

The first improvement is to processthe set columnwise. That is, we comparethe query against the first pivot p1. Now,we consider the first column of the tableand discard all the elements that satisfy|x1− y1| > r. Then, we consider the secondpivot p2 and repeat the process only on theelements not discarded up to now. An algo-rithm implementing this idea is LAESA.



It is not hard to see that the amount ofevaluations of d and the total CPU workremains the same as for the rowwise case.However, we can do better now, since eachcolumn can be sorted so that the range ofqualifying rows can be binary instead ofsequentially searched [Nene and Nayar1997; Chavez et al. 1999]. This is possiblebecause we are interested, at column i, inthe values [ yi − r, yi + r]. The extra CPUcost gets closer to O(k log n) than to O(n)by using this technique.

This is not the only improvementallowed by a columnwise evaluation thatcannot be done rowwise. A very importantone is that it is not necessary to considerall the k coordinates (recall that we haveto perform one evaluation of d to obtaineach new query coordinate yi). As soonas the remaining set of candidates issmall enough, we can stop considering theremaining coordinates and directly verifythe candidates using the d distance. Thispoint is difficult to estimate beforehand:despite the (few) theoretical results ex-isting [Farago et al. 1993; Baeza-Yates1997; Baeza-Yates and Navarro 1998],one cannot normally understand theapplication well enough to predict theactual optimal number of pivots k∗ (i.e.,the point where it is better to switch toexhaustive evaluation).

Another improvement that can be donewith columnwise evaluation is that theselection of the pivots can be done on thefly instead of beforehand as we have pre-sented it. That is, once we have selectedthe first pivot p1 and discarded all theuninteresting elements, the second pivotp2 may depend on which was the resultof p1. However, for each potential pivot pwe have to store the coordinates of all theelements of the set for p (or at least some,as we show later). That is, we select kpotential pivots and precompute the tableas before, but we can choose in whichorder the pivots are used (according to thecurrent state of the search) and where westop using pivots and compare directly.

An extreme case of this idea is AESA,where k = n (i.e., all the elements arepotential pivots), and the new pivot ateach iteration is selected among the

remaining elements. Despite its practi-cal inapplicability because of its O(n2)preprocessing time and space overhead(i.e., all the distances among the knownelements are precomputed), the algorithmperforms a surprisingly low number ofdistance evaluations, much better thanwhen the pivots are fixed. This shows thatit is a good idea to select pivots from thecurrent set of candidates (as discussed inthe previous sections).

Finally, we notice that instead of a se-quential search in the mapped space, wecould use an algorithm to search in vectorspaces of k dimensions (e.g., kd -trees orR-trees). Depending on their ability tohandle larger k values, we could be ableto use more pivots without significantlyincreasing the extra CPU cost. Recall alsothat, as more pivots are used, the searchstructures for vector spaces performworse. This is a very interesting subjectwhich has not been pursued yet, thataccounts for balancing between distanceevaluations and CPU time.

8.1.2. Coarsening the Equivalence Relation.The alternative of not considering all thek pivots if the remaining set of candidatesis small is an example of coarseningan equivalence relation. That is, if wedo not consider a given pivot p, we aremerging all the classes that differ onlyin that coordinate. In this case we preferto coarsen the pivot equivalence relationbecause computing it with more precisionis worse than checking it as is.

There are many other ways to coarsenthe equivalence relation, and we coverthem here. However, in these cases thecoarsening is not done for the sake ofreducing the number of distance evalu-ations, but to improve space usage andprecomputation time, as O(kn) can be pro-hibitively expensive for some applications.Another reason is that, via coarsening, weobtain search algorithms that are sublin-ear in their extra CPU time. We considerin this section range coarsening, bucketcoarsening, and scope coarsening. Theirideas are roughly illustrated in Figure 21.

It must be clear that all these typesof coarsenings reduce the discriminative



Fig. 21 . Different coarsification methods.

power of the resulting equivalence classes,making it necessary to exhaustivelyconsider more elements than in the un-coarsened versions of the relations. In theexample of the previous section this isamortized by the lower cost to obtain thecoarsened equivalence relation. Here wereduce the effectiveness of the algorithmsvia coarsening, for the sake of reducedpreprocessing time and space overhead.

However, space reduction may have acounterpart in time efficiency. If we useless space, then with the same amountof memory we can have more pivots (i.e.,larger k). This can result in an overallimprovement. The fair way to comparetwo algorithms is to give them the sameamount of memory to use.

8.1.3. Range Coarsening. The auxiliarydata structures proposed by most authorsfor continuous distance functions areaimed at reducing the amount of spaceneeded to store the coordinates of theelements in the mapped space, as wellas the time to find the relevant classes.The most popular form is to reduce theprecision of d . This is written as

x ∼p,{ri} y ⇔ ∃i, ri ≤ d (x, p) < ri+1 andri ≤ d ( y , p) < ri+1

with {ri} a partition of the interval [0,∞).That is, we assign the same equivalenceclass to elements falling in the same rangeof distances with respect to the same pivot

p. This is obviously a coarsening of theprevious relation ∼p and can be naturallyextended to more than one pivot.

Figure 2 exemplifies a pivoting equiva-lence relation where range coarsening isapplied, for one pivot. Points in the samering are in the same equivalence class,despite the fact that their exact distanceto the pivot may be different.

A number of actual algorithms useone or another form of this coarseningtechnique. VPTs and MVPTs divide thedistances in slices so that the same num-ber of elements lie in each slice (note thatthe slices are different for each pivot).VPTs use two slices and MVPTs usemany. Their goal is to obtain balancedtrees. BKTs, FQTs, and FHQTs, on theother hand, propose range coarsening forcontinuous distance functions but do notspecify how to coarsen.

In this work we consider that the“natural” extension of BKTs, FQTs, andFHQTs assigns slices of the same widthto each branch, and that the tree has thesame arity across all its nodes. At eachnode, the slice width is recomputed sothat using slices of that fixed width thenode has the desired arity.

Therefore, we can have slices of fixedwidth (BKT, FQT, FHQT) or determinedby percentiles (VPT, MVPT, FQA). Wemay have a different pivot per node (BKT,VPT, MVPT) or per level (FQT, FHQT,FQA). Among the last, we can define the



Table II. Different Options for Range CoarseningFixed Percentiles Fixed Width

Different pivotper node VPT, MVPT BKT(scope coarsening)

Different sliceFMVPT FQT, FHQTDifferent pivot per node

per level Different sliceFMVPA (FQA) FHQAper level

aWe put in italics the new structures created to fill the empty holes.

slices at each node (FQT, FHQT) or forthe whole level (FQA). All these choicesare drawn in Table II. The empty slotshave been filled with new structures thatare now defined.

8.1.3.1. FHQA. This is similar to anFQA except that the slices are of fixedwidth. At each level the slice width isrecomputed so that a maximum arity isguaranteed. In the FHQT, instead, eachnode has a different slice width.

8.1.3.2. FMVPT. This is a cross be-tween an MVPT and a FHQT. The rangeof values is divided using the m− 1uniform percentiles to balance the tree,as in MVPTs. The tree has a fixed heighth, as FHQTs. At each node the rangesare recomputed according to the elementslying in that subtree. The particular casewhere m = 2 is called FHVPT.

8.1.3.3. FMVPA. This is just a newname for the FQA, more appropriate forour discussion since it is to MVPTs asFHQAs are to FHQTs: the FMVPA usesvariable width slices to ensure that thesubtrees are balanced in size, but the sameslices are used for all the nodes of a singlelevel, so that the balance is only levelwise.

The combinations we have just cre-ated allow us to explain some importantconcepts.

8.1.3.4. Amount of Range Coarsening. Letus consider FHQAs and FMVPAs. Theyare no more than LAESA with differentforms of range coarsening. They use kfixed pivots and use b bits to representthe coordinates (i.e., the distances fromeach point to each of the h pivots). So only2b different values can be expressed. Thetwo structures differ only in how they

coarsen the distances to put them into 2b

ranges. Their total space requirement isthen reduced to kbn bits.

However, range coarsening is not justa technique to reduce space, but the samespace can be used to accommodate morepivots. It is not immediate how conve-nient it is to coarsen in order to use morepivots, but it is clear that this techniquecan improve the overall effectiveness ofthe algorithm.

8.1.3.5. Percentiles Versus Fixed Width.Another unclear issue is whether fixedslices are better or worse than percentilesplitting. A balanced data structure hasobvious advantages because the internalcomplexity may be reduced. Fixed slicesproduce unbalanced structures since theouter rings have many more elements (es-pecially on high dimensions). On the otherhand, in high dimensions the outer ringstend to be too narrow if percentile split-ting is used (because a small incrementin radius gets many new elements insidethe ring). If the rings are too narrow,many rings will be frequently included inthe radius of interest of the queries (seeFigure 22). An alternative idea is shown inChavez [1999], where the slices are opti-mized to minimize the number of branchesthat must be considered. In this case,each class can be an arbitrary set of slices.

8.1.3.6. Trees Versus Arrays. FHQTsand FMVPTs are almost tree versions ofFHQAs and FMVPAs, respectively. Theyare m-ary trees where all the elementsbelonging to the same coarsened classare stored in the same subtree. Instead ofexplicitly storing the m coordinates, thetrees store them implicitly: the elementsare at the leaves, and their paths from the



Fig. 22 . The same query intersects more rings when using uniformpercentiles.

root spell out the coarsened coordinatevalues. This makes the space require-ments closer to O(n) in practice, instead ofO(bkn) (although the constant is very lowfor the array versions, which may actuallytake less space). Moreover, the search forthe relevant elements can be organizedusing the tree: if all the interestingelements have their first coordinate inthe ith ring, then we just traverse the ithsubtree. This reduces the extra CPU time.If the distance is too fine-grained, how-ever, the root will have nearly n childrenand the subtrees will have just one child.The structure will be very similar to atable of k coordinates per element and thesearch will degenerate into a linear row-wise traversal. Hence, range coarsening isalso a tool to reduce the extra CPU time.

We have given the trees the ability todefine the slices at each node instead of ateach level as do the array versions. Thisallows them to adapt better to the data,but the values of the slices used needmore space. It is not clear whether it paysto store all these slice values.

Summarizing, range coarsening canbe applied using fixed-width or fixed-percentile slices. They can reduce thespace necessary to store the coordinates,which can allow the use of more pivotswith the same amount of memory. There-fore, it is not just a technique to reducespace but it can improve the searchcomplexity. Range coarsening can also beused to organize treelike search schemesthat are sublinear in extra CPU time.

8.1.4. Bucket Coarsening. To reduce spacerequirements in the above trees, we canavoid building subtrees that have fewelements. Instead, all their elementsare stored in a bucket. When the searcharrives at a bucket, it has to exhaustivelyconsider all the elements.

This is a form of coarsening, since forthe elements in the bucket we do notconsider the last pivots, and it resemblesthe previous idea (Section 8.1.1) of notcomputing the k pivots. However, inthis case the decision is taken offline, atindex construction time, and this allowsreducing space by not storing those coor-dinates. In the previous case the decisionwas taken at search time. The crucialdifference is that if the decision is takenat search time, we can know exactly thetotal amount of exhaustive work to do bynot taking further coordinates. However,in an offline decision we can only considerthe search along this branch of the tree,and we cannot predict how many brancheswill be considered at search time.

This idea is used for discrete distancefunctions in FQTs, which are similar toFHQTs except for the use of buckets. Ithas been also applied to continuous setupsto reduce space requirements further.

8.1.5. Scope Coarsening. The last andleast obvious form of coarsening is the onewe call “scope coarsening.” In fact, the useof this form of coarsening makes it diffi-cult to notice that many algorithms basedon trees are in fact pivoting algorithms.



This coarsening is based on, insteadof storing all the coordinates of all theelements, just storing some of them.Hence, comparing the query against somepivots helps to discriminate on somesubset of the database only. To use thisfact to reduce space, we must determineoffline which elements will store theirdistance to which pivots. There are manyways to use this idea, but it has been usedonly in the following way.

In FHVPTs there is a single pivot perlevel of the tree, as in FHQTs. The left andright subtrees of VPTs, on the other hand,use different pivots. That is, if we have toconsider both the left and right subtrees(because the radius r does not allow us tocompletely discard one), then comparingthe query q against the left pivot will beuseful for the left subtree only. There isno information stored about the distancesfrom the left pivot to the elements of theright subtree, and vice versa. Hence, wehave to compare q against both pivots.This continues recursively. The same ideais used for BKTs and MVPTs.

Although at first sight it is clear that wereduce space, this is not the direct way inwhich the idea is used in those schemes.Instead, they combine it with a huge in-crease in the number of potential pivots.For each subtree, an element belongingto the subtree is selected as the pivot anddeleted from the set. If no bucketing isused, the result is a tree where each ele-ment is a node somewhere in the tree andhence a potential pivot. The tree takesO(n) space, which shows that we can suc-cessfully combine a large number of pivotswith scope coarsening to have low spacerequirements (n instead of n2 as in AESA).

The possible advantage (apart fromguaranteed linear space and slightlyreduced space in practice) of thesestructures over those that store all thecoordinates (as FQTs and FHQTs) isthat the pivots are better suited to thesearched elements in each subtree, sincethey are selected from the same subset.This same property makes AESA a good(although impractical) algorithm.

Shapiro [1977] and Bozkaya andOzsoyoglu [1997] propose hybrids (for

BKT and VPT, resp.) where a number offixed pivots are used at each node, and foreach resulting class a new set of pivotsis selected. Note that, historically, FQTsand FHQTs are an evolution over BKTs.

8.2. Compact Partitioning Algorithms

All the remaining algorithms (GHTs,BSTs, GNATs, VTs, MTs, SATs, LCs) relyon a hierarchical compact partition of themetric space. A first source of differencesis in how the centers are selected at eachnode. GHTs and BSTs take two elementsper level. VTs repeat previous centerswhen creating new nodes. GNATs select mcenters far apart. MTs try to minimize cov-ering radii. SATs select a variable numberof close neighbors of the parent node. LCchooses at random one center per cluster.

The main difference, however, lies inthe search algorithm. While GHTs usepurely the hyperplane criterion, BSTs,VTs, MTs, and LCs use only the coveringradius criterion. SATs use both criteria toincrease pruning. In all these algorithmsthe query q is compared against allthe centers of the current node and thecriteria are used to discard subtrees.

GNATs are a little different, as theyuse none of the above criteria. Instead,they apply an AESA-like search overthe m centers considering their “range”values. That is, a center ci is selected, andif the query ball does not intersect a ringaround ci that contains all the elementsof c j , then c j and all its class (subtree)can be safely discarded. In other words,GNATs limit the class of each centerby intersecting rings around the othercenters. This way of limiting the extent ofa class is different from both the hyper-plane and the covering radius criteria. Itis probably more efficient, but it requiresstoring O(m2) distances at each level.

Table III summarizes the differences. Itis clear that there are many possible com-binations that have not been tried, but wedo not attempt to enumerate all of them.

The compact partition is an attempt toobtain local classes, more local than thosebased on pivots. A general technique to dothis is to identify clusters of close objects



Table III. Different Options for Limiting ClassesWith hyperplanes GHT, SATWith balls BST, VT, MT, SAT, LCWith rings GNAT

in the set. There exist many clusteringalgorithms to build equivalence relations.However, most are defined on vectorspaces instead of general metric spaces.An exception is Brito et al. [1996], whoreport very good results. However, it isnot clear that good clustering algorithmsdirectly translate into good algorithms forproximity searching. Another clusteringalgorithm, based on cliques, is presentedin Burkhard and Keller [1973], but theresults are similar to the simpler BKT.This area is largely unexplored, and thedevelopments here could be convertedinto improved search algorithms.

9. CONCLUSIONS

Metric spaces are becoming a popularmodel for similarity retrieval in manyunrelated areas. We have surveyed thealgorithms that index metric spaces toanswer proximity queries. We have notjust enumerated the existing approachesto discuss their good and bad points. Wehave, in addition, presented a unifiedframework that allows understandingthe existing approaches under a commonview. It turns out that most of the existingalgorithms are indeed variations on afew common ideas, and by identifyingthem, previously unnoticed combinationshave naturally appeared. We have alsoanalyzed the main factors that affect theefficiency when searching metric spaces.The main conclusions of our work aresummarized as follows.

1. The concept of intrinsic dimensional-ity can be defined on general metricspaces as an abstract and quan-tifiable measure that affects thesearch performance.

2. The main factors that affect the effi-ciency of the search algorithms are theintrinsic dimensionality of the spaceand the search radius.

3. Equivalence relations are a commonground underlying all the index-

ing algorithms, and they divide thesearch cost in terms of internal andexternal complexity.

4. A large class of search algorithms relieson taking k pivots and mapping themetric space onto Rk using the L∞distance. Another important class usescompact partitioning.

5. The equivalence relations can be coars-ened to save space or to improve theoverall efficiency by making better useof the pivots. A hierarchical refinementof classes can improve performance.

6. Although there is an optimal numberof pivots to use, this number is toohigh in terms of space requirements.In practical terms, a pivot-based indexcan outperform a compact partitioningindex if it has enough memory.

7. As this amount of memory becomesinfeasible as the dimension grows,compact partitioning algorithms nor-mally outperform pivot-based ones inhigh dimensions.

8. In high dimensions the search radiusneeded to retrieve a fixed percentageof the database is very large. This isthe reason for the failure to overcomethe brute force search with an exactindexing algorithm.

A number of open issues require furtherattention. The main ones follow.

—For pivot-based algorithms, understandbetter the effect of pivot selection, devis-ing methods to choose effective pivots.The subject of the appropriate numberof pivots and its relation to the intrinsicdimensionality of the space plays a rolehere. The histogram of distances maybe a good tool for pivot selection.

—For compact partitioning algorithms,work more on clustering schemes inorder to select good centers. Find waysto reduce construction times (which arein many cases too high).

—Search for good hybrids betweencompact partitioning and pivotingalgorithms. The first ones cope betterwith high dimensions and the secondones improve as more memory is given



to them. After the space is clustered theintrinsic dimension of the clusters issmaller, so a top-level clustering struc-ture joined with a pivoting scheme forthe clusters is an interesting alterna-tive. Those pivots should be selectedfrom the cluster because the clustershave high locality.

—Take extra CPU complexity into ac-count, which we have barely consideredin this work. In some applications thedistance is not so expensive that onecan disregard any other type of CPUcost. The use of specialized searchstructures in the mapped space (espe-cially Rk) and the resulting complexitytradeoff deserves more attention.

—Take I/O costs into account, which mayvery well dominate the search time insome applications. The only existingwork on this is the M-tree [Ciacciaet al. 1997].

—Focus on nearest neighbor search. Mostcurrent algorithms for this problem arebased on range searching, and despitethat the existing heuristics seem diffi-cult to improve, truly independent waysto address the problem could exist.

—Consider approximate and probabilis-tic algorithms, which may give muchbetter results at a cost that, especiallyfor this problem, seems acceptable.A first promising step is Chavez andNavarro [2001b].

REFERENCES

APERS, P., BLANKEN, H., AND HOUTSMA, M. 1997.Multimedia Databases in Perspective. Springer-Verlag, New York.

ARYA, S., MOUNT, D., NETANYAHU, N., SILVERMAN, R.,AND WU, A. 1994. An optimal algorithm forapproximate nearest neighbor searching infixed dimension. In Proceedings of the FifthACM–SIAM Symposium on Discrete Algorithms(SODA’94), 573–583.

AURENHAMMER, F. 1991. Voronoi diagrams—A sur-vey of a fundamental geometric data structure.ACM Comput. Surv. 23, 3.

BAEZA-YATES, R. 1997. Searching: An algorithmictour. In Encyclopedia of Computer Science andTechnology, vol. 37, A. Kent and J. Williams,Eds., Marcel Dekker, New York, 331–359.

BAEZA-YATES, R. AND NAVARRO, G. 1998. Fast ap-

proximate string matching in a dictionary. InProceedings of the Fifth South American Sym-posium on String Processing and InformationRetrieval (SPIRE’98), IEEE Computer SciencePress, Los Alamitos, Calif., 14–22.

BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Mod-ern Information Retrieval. Addison-Wesley,Reading, Mass.

BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU, S.1994. Proximity matching using fixed-queriestrees. In Proceedings of the Fifth CombinatorialPattern Matching (CPM’94), Lecture Notes inComputer Science, vol. 807, 198–212.

BENTLEY, J. 1975. Multidimensional binary searchtrees used for associative searching. Commun.ACM 18, 9, 509–517.

BENTLEY, J. 1979. Multidimensional binary searchtrees in database applications. IEEE Trans.Softw. Eng. 5, 4, 333–340.

BENTLEY, J., WEIDE, B., AND YAO, A. 1980. Optimalexpected-time algorithms for closest point prob-lems. ACM Trans. Math. Softw. 6, 4, 563–580.

BERCHTOLD, S., KEIM, D., AND KRIEGEL, H. 1996. TheX-tree: An index structure for high-dimensionaldata. In Proceedings of the 22nd Conference onVery Large Databases (VLDB’96), 28–39.

BHANU, B., PENG, J., AND QING, S. 1998. Learningfeature relevance and similarity metrics inimage databases. In Proceedings of the IEEEWorkshop on Content-Based Access of Image andVideo Libraries (Santa Barbara, Calif.), IEEEComputer Society, Los Alamitos, Calif., 14–18.

BIMBO, A. D. AND VICARIO, E. 1998. Using weightedspatial relationships in retrieval by visualcontents. In Proceedings of the IEEE Workshopon Content-Based Access of Image and Video Li-braries (Santa Barbara, Calif.), IEEE ComputerSociety, Los Alamitos, Calif., 35–39.

BOHM, C., BERCHTOLD, S., AND KEIM, A. 2002.Searching in high-dimensional spaces—indexstructures for improving the performance ofmultimedia databases. ACM Comput. Surv. Toappear.

BOZKAYA, T. AND OZSOYOGLU, M. 1997. Distance-based indexing for high-dimensional metricspaces. In Proceedings of ACM SIGMOD Inter-national Conference on Management of Data,SIGMOD Rec. 26, 2, 357–368.

BRIN, S. 1995. Near neighbor search in large met-ric spaces. In Proceedings of the 21st Conferenceon Very Large Databases (VLDB’95), 574–584.

BRITO, M., CHAVEZ, E., QUIROZ, A., AND YUKICH, J.1996. Connectivity of the mutual k-nearestneighbor graph in clustering and outlierdetection. Stat. Probab. Lett. 35, 33–42.

BUGNION, E., FHEI, S., ROOS, T., WIDMAYER, P.,AND WIDMER, F. 1993. A spatial index forapproximate multiple string matching. InProceedings of the First South AmericanWorkshop on String Processing (WSP’93), R.Baeza-Yates and N. Ziviani, Eds., 43–53.



BURKHARD, W. AND KELLER, R. 1973. Some ap-proaches to best-match file searching. Commun.ACM 16, 4, 230–236.

CASCIA, M. L., SETHI, S., AND SCLAROFF, S. 1998.Combining textual and visual cues for content-based image retrieval on the world wide web.In Proceedings of the IEEE Workshop onContent-Based Access of Image and Video Li-braries (Santa Barbara, Calif.), IEEE ComputerSociety, Los Alamitos, Calif., 24–28.

CHAVEZ, E. 1999. Optimal discretization forpivot based algorithms. Manuscript. ftp://garota.fismat.umich.mx/pub/users/elchavez/minimax.ps.gz.

CHAVEZ, E. AND MARROQUıN, J. 1997. Proximityqueries in metric spaces. In Proceedings of theFourth South American Workshop on StringProcessing (WSP’97), R. Baeza-Yates, Ed.Carleton University Press, Ottawa, 21–36.

CHAVEZ, E. AND NAVARRO, G. 2000. An effectiveclustering algorithm to index high dimensionalmetric spaces. In Proceedings of the SeventhString Processing and Informational Retrieval(SPIRE’00), IEEE CS Press, 75–86.

CHAVEZ, E. AND NAVARRO, G. 2001a. Towardsmeasuring the searching complexity of met-ric spaces. In Proceedings of the MexicanComputing Meeting, vol. II, 969–972.

CHAVEZ, E. AND NAVARRO, G. 2001b. A proba-bilistic spell for the curse of dimensionality.In Proceedings of the Third Workshop on Al-gorithm Engineering and Experimentation(ALENEX’01), 147–160. Lecture Notes inComputer Science v. 2153.

CHAVEZ, E. AND NAVARRO, G. 2002. A metric indexfor approximate string matching. In Proceed-ings of the Fifth Latin American Symposium onTheoretical Informatics (LATIN’02), To appear,Lecture Notes in Computer Science.

CHAVEZ, E., MARROQUıN, J., AND BAEZA-YATES, R.1999. Spaghettis: An array based algorithmfor similarity queries in metric spaces. In Pro-ceedings of String Processing and InformationRetrieval (SPIRE’99), IEEE Computer SciencePress, Los Alamitos, Calif., 38–46.

CHAVEZ, E., MARROQUIN, J. L., AND NAVARRO, G. 2001.Fixed queries array: a fast and economical datastructure for proximity searching. MultimediaTools and Applications 14 (2), 113–135. Kluwer.

CHAZELLE, B. 1994. Computational geometry:A retrospective. In Proceedings of the 26thACM Symposium on the Theory of Computing(STOC’94), 75–94.

CHIUEH, T. 1994. Content-based image indexing.In Proceedings of the Twentieth Conference onVery Large Databases (VLDB’94), 582–593.

CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M-tree: An efficient access method for similaritysearch in metric spaces. In Proceedings ofthe 23rd Conference on Very Large Databases(VLDB’97), 426–435.

CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998a. Acost model for similarity queries in metricspaces. In Proceedings of the Seventeenth ACMSIGACT–SIGMOD-SIGART Symposium onPrinciples of Database Systems (PODS’98).

CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998b.Processing complex similarity queries withdistance-based access methods. In Proceed-ings of the Sixth International Conference onExtending Database Technology (EDBT ’98).

CLARKSON, K. 1999. Nearest neighbor queries inmetric spaces. Discrete Comput. Geom. 22, 1,63–93.

COX, T. AND COX, M. 1994. MultidimensionalScaling. Chapman and Hall, London.

DEHNE, F. AND NOLTEMEIER, H. 1987. Voronoitrees and clustering problems. Inf. Syst. 12, 2,171–175.

DEVROYE, L. 1987. A Course in Density Estimation.Birkhauser, Boston.

DUDA, R. AND HART, P. 1973. Pattern Classificationand Scene Analysis. Wiley, New York.

FALOUTSOS, C. AND KAMEL, I. 1994. Beyond uni-formity and independence: Analysis of R-treesusing the concept of fractal dimension. In Pro-ceedings of the Thirteenth ACM Symposium onPrinciples of Database Principles (PODS’94),4–13.

FALOUTSOS, C. AND LIN, K. 1995. Fastmap: A fastalgorithm for indexing, data mining and visual-ization of traditional and multimedia datasets.ACM SIGMOD Rec. 24, 2, 163–174.

FALOUTSOS, C., EQUITZ, W., FLICKNER, M., NIBLACK, W.,PETKOVIC, D., AND BARBER, R. 1994. Efficientand effective querying by image content. J.Intell. Inf. Syst. 3, 3/4, 231–262.

FARAGO, A., LINDER, T., AND LUGOSI, G. 1993. Fastnearest-neighbor search in dissimilarity spaces.IEEE Trans. Patt. Anal. Mach. Intell. 15, 9,957–962.

FRAKES, W. AND BAEZA-YATES, R., EDS. 1992. Infor-mation Retrieval: Data Structures and Algo-rithms. Prentice-Hall, Englewood Cliffs, N.J.

GAEDE, V. AND GUNTHER, O. 1998. Multidimen-sional access methods. ACM Comput. Surv. 30, 2,170–231.

GUTTMAN, A. 1984. R-trees: A dynamic indexstructure for spatial searching. In Proceedingsof the ACM SIGMOD International Conferenceon Management of Data, 47–57.

HAIR, J., ANDERSON, R., TATHAM, R., AND BLACK, W.1995. Multivariate Data Analysis with Read-ings, 4th ed. Prentice-Hall, Englewood Cliffs,N.J.

HJALTASON, G. AND SAMET, H. 1995. Ranking inspatial databases. In Proceedings of FourthInternational Symposium on Large SpatialDatabases, 83–95.

JAIN, A. AND DUBES, R. 1988. Algorithms for Clus-tering Data. Prentice-Hall, Englewood Cliffs,N.J.



KALANTARI, I. AND MCDONALD, G. 1983. A datastructure and an algorithm for the nearest pointproblem. IEEE Trans. Softw. Eng. 9, 5.

MEHLHORN, K. 1984. Data Structures and Al-gorithms, Volume III—MultidimensionalSearching and Computational Geometry.Springer-Verlag, New York.

MICO, L., ONCINA, J., AND CARRASCO, R. 1996. Afast branch and bound nearest neighbour clas-sifier in metric spaces. Patt. Recogn. Lett. 17,731–739.

MICO, L., ONCINA, J., AND VIDAL, E. 1994. A newversion of the nearest-neighbor approximatingand eliminating search (AESA) with linearpreprocessing-time and memory requirements.Patt. Recog. Lett. 15, 9–17.

NAVARRO, G. 1999. Searching in metric spacesby spatial approximation. In Proceedings ofString Processing and Information Retrieval(SPIRE’99), IEEE Computer Science Press, LosAlamitos, Calif., 141–148.

NAVARRO, G. AND REYES, N. 2001. Dynamic spatialapproximation trees. In Proceedings of theEleventh Conference of the Chilean ComputerScience Society (SCCC’01), IEEE CS Press,213–222.

NENE, S. AND NAYAR, S. 1997. A simple algorithmfor nearest neighbor search in high dimensions.IEEE Trans. Patt. Anal. Mach. Intell. 19, 9,989–1003.

NIEVERGELT, J. AND HINTERBERGER, H. 1984. Thegrid file: An adaptable, symmetric multikeyfile structure. ACM Trans. Database Syst. 9, 1,38–71.

NOLTEMEIER, H. 1989. Voronoi trees and appli-cations. In Proceedings of the InternationalWorkshop on Discrete Algorithms and Complex-ity (Fukuoka, Japan), 69–74.

NOLTEMEIER, H., VERBARG, K., AND ZIRKELBACH, C.1992. Monotonous bisector∗ trees—A tool forefficient partitioning of complex schenes ofgeometric objects. In Data Structures and Ef-ficient Algorithms, Lecture Notes in ComputerScience, vol. 594, Springer-Verlag, New York,186–203.

PRABHAKAR, S., AGRAWAL, D., AND ABBADI, A. E. 1998.Efficient disk allocation for fast similaritysearching. In Proceedings of ACM SPAA ’98(Puerto Vallarta, Mexico).

ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995.Nearest neighbor queries. In Proceedings of theACM SIGMOD ’95, 71–79.

SALTON, G. AND MCGILL, M. 1983. Introduction toModern Information Retrieval. McGraw-Hill,New York.

SAMET, H. 1984. The quadtree and related hi-erarchical data structures. ACM Comput.Surv. 16, 2, 187–260.

SANKOFF, D. AND KRUSKAL, J., EDS. 1983. TimeWarps, String Edits, and Macromolecules: TheTheory and Practice of Sequence Comparison.Addison-Wesley, Reading, Mass.

SASHA, D. AND WANG, T. 1990. New techniques forbest-match retrieval. ACM Trans. Inf. Syst. 8, 2,140–158.

SHAPIRO, M. 1977. The choice of reference pointsin best-match file searching. Commun. ACM 20,5, 339–343.

SUTTON, R. AND BARTO, A. 1998. ReinforcementLearning : An Introduction. MIT Press,Cambridge, Mass.

UHLMANN, J. 1991a. Implementing metric treesto satisfy general proximity/similarity queries.Manuscript.

UHLMANN, J. 1991b. Satisfying general proximity/similarity queries with metric trees. Inf. Proc.Lett. 40, 175–179.

VERBARG, K. 1995. The C-Ttree: A dynamicallybalanced spatial index. Comput. Suppl. 10,323–340.

VIDAL, E. 1986. An algorithm for finding nearestneighbors in (approximately) constant averagetime. Patt. Recogn. Lett. 4, 145–157.

WATERMAN, M. 1995. Introduction to Computa-tional Biology. Chapman and Hall, London.

WEBER, R. SCHEK, H.-J., AND BLOTT, S. 1998.A quantitative analysis and performancestudy for similarity-search methods in high-dimensional spaces. In Proceedings of the Inter-national Conference on Very Large Databases(VLDB’98).

WHITE, D. AND JAIN, R. 1996. Algorithms andstrategies for similarity retrieval. Tech. Rep.VCL-96-101 (July), Visual Computing Labora-tory, University of California, La Jolla.

YAO, A. 1990. Computational Geometry, J. VanLeeuwen, Ed. Elsevier Science, New York,345–380.

YIANILOS, P. 1993. Data structures and algorithmsfor nearest neighbor search in general metricspaces. In Proceedings of the Fourth ACM–SIAMSymposium on Discrete Algorithms (SODA ’93),311–321.

YIANILOS, P. 1999. Excluded middle vantagepoint forests for nearest neighbor search.In DIMACS Implementation Challenge,ALENEX ’99 (Baltimore, Md).

YIANILOS, P. 2000. Locally lifting the curse ofdimensionality for nearest neighbor search.In Proceedings of the Eleventh ACM–SIAMSymposium on Discrete Algorithms (SODA ’00),361–370.

YOSHITAKA, A. AND ICHIKAWA, T. 1999. A surveyon content-based retrieval for multimediadatabases. IEEE Trans. Knowl. Data Eng. 11, 1,81–93.

Received July 1999; revised March 2000; accepted December 2000


Searching in Metric Spaces - ALGORITMICA:...

Documents

Transcript of Searching in Metric Spaces - ALGORITMICA:...