Distance approximation techniques to reduce the dimensionality for

Knowl Inf Syst (2011) 28:227–248DOI 10.1007/s10115-010-0322-z

REGULAR PAPER

Distance approximation techniques to reducethe dimensionality for multimedia databases

Yongkwon Kim · Chin-Wan Chung · Seok-Lyong Lee ·Deok-Hwan Kim

Received: 14 August 2009 / Revised: 17 April 2010 / Accepted: 25 June 2010 /Published online: 9 July 2010© Springer-Verlag London Limited 2010

Abstract Recently, databases have been used to store multimedia data such as images,maps, video clips, and music clips. In order to search them, they should be represented byvarious features, which are composed of high-dimensional vectors. As a result, the dimen-sionality of data is increased considerably, which causes ‘the curse of dimensionality’. Theincrease of data dimensionality causes poor performance of index structures. To overcomethe problem, the research on the dimensionality reduction has been conducted. However,some reduction methods do not guarantee no false dismissal, while others incur high com-putational cost. This paper proposes dimensionality reduction techniques that guarantee nofalse dismissal while providing efficiency considerable by approximating distances with afew values. To provide the no false dismissal property, approximated distances should alwaysbe smaller than original distances. The Cauchy–Schwarz inequality and two trigonometricalequations are used as well as the dimension partitioning technique is applied to approximatedistances in such a way to reduce the difference between the approximated distance and theoriginal distance. As a result, the proposed techniques reduce the candidate set of a queryresult for efficient query processing.

Y. Kim · C.-W. Chung (B)Division of Computer Science, KAIST, Daejeon 305-701, Koreae-mail: [email protected]

Y. Kime-mail: [email protected]

S.-L. LeeSchool of Industrial and Information Engineering,Hankuk University of Foreign Studies, Yongin-si, Gyeonggi-do 449-701, Koreae-mail: [email protected]

D.-H. KimSchool of Electronics and Electrical Engineering,Inha University, Incheon 402-751, Koreae-mail: [email protected]

123

228 Y. Kim et al.

Keywords Dimensionality reduction · Inner product · Dimension partition · Selectivity ·Multimedia databases

1 Introduction

For many decades, databases have been used to store numeric or string data and search themefficiently. Key values that are used to search such data are in just a few dimensional data.

But, recently, databases are used to store and search various types of data, especiallymultimedia data such as images, maps, video clips, and music clips. These data are usu-ally represented by some features in multi-dimensional vectors. For images, features aboutcolor, shape, and texture are mainly used. Features that describe images or video clips arecomposed of high-dimensional data. For example, the color-structure descriptor (color his-togram), edge histogram, and ART(Angular Radial Transformation) are defined in MPEG-7 as color, texture, and shape features for images. They are composed of 256, 150, and140-dimensional data, respectively. If only these features are used, a 546-dimensional vectoris required to describe an image. As more and more features are used, the dimensionalityincreases rapidly. For other fields except the image search, many techniques are researchedto reduce the dimensionality such as data stream [14], back-propagation network [20], andtext retrieval [24].

The R-tree (Guttman et al. 1984) is the start of multi-dimensional index structures.Following that, many tree-based index structures have been proposed [2–4,23,30]. But, asthe dimensionality of data increases, the efficiency of the index structures decreases, and thisis caused by ‘the Curse of dimensionality’ [6].

To solve the famous problem, index structures using the vector-file compression are pro-posed, such as VA-file [29] and LPC-file [5]. But they are not essential solutions since theyalso have the same problem in very high dimensional spaces. According to A-tree [23],VA-file shows the performance degradation as the dimensionality of data increases.

While some researchers studied index structures, others were interested in dimensionalityreduction techniques. Dimensionality reduction techniques reduce the dimensionality of datato solve ‘the curse of dimensionality’ by using new values instead of the original values of data.Singular Value Decomposition(SVD) [15] is one of the most famous technique. DFT (DiscreteFourier Transform) [1] and DWT (Discrete Wavelet Transform) [31] are also well-knowndimensionality reduction techniques. We explain them more specifically in the next section.

The contributions of this paper are as follows:

• Efficient distance approximation techniques: We devise efficient distance approxima-tion techniques, to reduce the dimensionality of data, which can use the traditional indexstructures. The proposed techniques use no matrix operation and reduce the dimension-ality of datasets using simple mathematical equations. Therefore, they can be used inmany areas, which are related to high-dimensional data.

• Guarantee of no false dismissal: In the proposed techniques, we devise a distancescheme such that the approximated distance is always less than or equal to the distancein original data. It means that the techniques guarantee no false dismissal. Therefore,they are suitable for accuracy critical applications.

• An indexing technique to use the existing index structures: Approximated distancesare obtained by the devised distance scheme which is different from the Euclidean dis-tance scheme. We devise a technique to use the traditional index structures which use theEuclidean distance scheme. Thus, the proposed techniques do not require a new indexstructure since the existing index structures can be used with no modification.

123

Distance approximation techniques to reduce the dimensionality for multimedia databases 229

• Extensive experimental study: Through extensive experiments, the reduction of thecandidate set is validated and the characteristics of datasets for which the techniques arewell applicable are identified.

The rest of the paper is organized as follows. Section 2 reviews related works. In Sect. 3, theproposed techniques, inner2 and inner1, are presented. In Sect. 4, the indexing algorithm ofthe proposed techniques is provided. In Sect. 5, the relationship between inner2 and MS ispresented. Section 6 shows experimental results. The conclusion is made in Sect. 7.

2 Related work

The research on the dimensionality reduction can be divided into two main streams. The firstuses the transformation, which transforms data in a high-dimensional space into data in alow-dimensional one, such as DFT (Discrete Fourier Transform) [1], DWT (Discrete WaveletTransform) [31], and SVD [15,18,19,32]. DFT is a kind of the Fourier transform, and it isused to reduce the dimensionality of the time series data. DFT transforms the time seriesdata into the frequency domain. Using first few frequencies, it reduces the dimensionality.DWT is similar to DFT. But, DWT transforms the time series data into the time/frequency orspace/frequency domain. Each component of transformed data is mapped to the coefficientof wavelets. SVD represents data as a matrix and then finds the most effective axes from alldimensions of data by using matrix operations. It is very powerful, but it is very costly. It ishard for SVD to use the index structures since SVD is very sensitive to data updates, and theentire index structure must be rebuilt.

But these have a big drawback. Usually, they use a part of the original dimensions or a fewdimensions that are simple combinations (e.g. linear combination) of the original dimensionsas axes of a low-dimensional space. So there can be some loss of information. It can makesome false dismissals, that is, data that should be included in the answer set can be missed.To guarantee no false dismissal, the distance after the transformation process should be thelower bound of the distance before the transformation process. But, the above techniquescannot guarantee it.

The second is the techniques which do not use transformations. Data are represented by afew values according to the original vectors for indexing. The technique used by Egeciogluet al. (2004) belongs to this category, and it proposed an inner product approximation tech-nique to reduce the dimensionality of data. But it cannot guarantee no false dismissal. TheMS [26,27] and the OMNI-family technique [10] also belong to this category. MS uses meanand standard deviation values of each data vector for indexing. MS is described below. OMNIrepresents data points by their distances to a set of foci. Both MS and OMNI use a nonlinearmapping of data and a bounded approximation which is proved using geometric tools toguarantee no false dismissal. Since no false dismissal is an important property, our researchis focused on this.

OMNI was compared with MS [27]. The result of OMNI was shown to be less effectivethan MS. So, we do not compare our approaches with OMNI.

MS is a dimensionality reduction technique that can be applied to the spherical search.The spherical search is to find all points whose Euclidean distances from the query are lessthan a user- specified search radius ε. Let P be an n-dimensional point in the database and Qbe the query. MS uses μ which is the average of values of dimensions of a point and σ whichis the standard deviation of those. First MS bounds the search space using lower bounds andupper bounds of μ and σ of P which are calculated from μ and σ of Q and ε. Then, find the

123

230 Y. Kim et al.

points that satisfy the following formula.

[(μP − μQ

)2 + (σP − σQ

)2](1/2) ≤ ε√

n

where μP and σP denote the average and standard deviation of values of dimensions of P,respectively, ε is the query radius, and n is the dimensionality of P and Q. The space thatsatisfies the above formula always includes the entire answer set. It is proved geometricallyin Vu et al. [27].

MS uses a dimension partitioning technique to get a more accurate candidate set. MS usestwo values (μ, σ ) for each dimension partition.

3 Proposed techniques

In this section, we focus on our proposed techniques, inner2 and inner1. They originatedfrom the same idea. The features to represent images or video clips are composed ofhigh-dimensional vectors, as mentioned before. They can also be represented by points in ahigh-dimensional space. So, we use the point and the vector to indicate features, interchange-ably. The distance between two n-dimensional vectors using Euclidean distance (L2), Pand Q, is defined as Formula (1). The last term of Formula (1) is the inner product of twovectors.

‖P − Q‖2 = ‖P‖2 + ‖Q‖2 − 2〈P, Q〉 (1)

To calculate the inner product of P and Q, it is needed to know the values of all dimensionsfor each vector. But there are two ideas to approximate the inner product result with fewernumbers of values. It is the main ideas of inner2 and inner1 to reduce the dimensionality ofdata for indexing. In the case of inner2, only 2 values are required. And, for inner1, just 1value.

But the approximation with 1 or 2 values is not accurate. So, we use the dimension partition-ing which is used in MS. We first divide the dimension set into multiple partitions. From eachdimension partition, we produce a subvector which is 0-masked n-dimensional vector of theoriginal vector. The dimensions of a subvector which are not included by the correspondingdimension partition are masked by 0. We calculate the approximated distance (called sub-distance) using a subvector for each dimension partition. Then, we combine approximatedsubdistances into an approximated distance of whole dimensions. The formulas below showthe relationship between the original distance and the approximated distance.

‖P − Q‖2 = d2

= (p1 − q1)2 + (p2 − q2)

2 + · · · + (pn − qn)2

=m∑

t=1

(pt − qt )2 +

2m∑t=m+1

(pt − qt )2 + · · · +

n∑t=n−m+1

(pt − qt )2

= d21 + d2

2 + · · · + d2k

≥ ad21 + ad2

2 + · · · + ad2k = ad2

In the above formulas, d is the total distance of the whole dimension set, di is the subdis-tance for the i-th subvector, k is the number of subvectors, and m(= n/k) is the number of

123


Table 1 Notations Notation Meaning

P Each data, vector in DB

Q The query which is issued by a user

Subvector Modified vector corresponding to each dimensionpartition

Subdistance Distance using subvectors of the query and data inDB

Pi , Qi The i-th subvectors of P and Q

S Pi , QQ

i Sum of values of the i-th subvector of P and Q

S Pi,sq , QQ

i,sq Sum of squares of values of the i-th subvector of Pand Q

Ii The i-th axis to approximate cos θ i

dimensions included in each dimension partition of P and Q. The problem that n/k is not anexact integer division can be solved by inserting dummy 0s to the vector to make n/k an exactinteger division. In inner1 and inner2, the summation of values of dimensions and that ofsquares of them are used. Thus, zero does not influenced the converted vectors and the searchresults. adi and ad are approximated distance for di , and d, respectively. Inner2 requires 2values to calculate an approximated subdistance of each subvector. So, inner2 reduces therequired dimensionality for indexing to 2k (n�k). Similarly, inner1 requires only k dimen-sions for indexing, i.e. inner2 and inner1 can reduce the dimensionality of vectors from n to2k and k, respectively.

In the case of both inner2 and inner1, approximated distances are less than or equal totheir original distances. It means that there is no false dismissal in the result set when inner2or inner1 is used for the range search. Assume the query radius of a given range query equalsto ε. The answer set of the range query contains all data whose distances to the query are lessthan or equal to ε. Similarly, a candidate set contains all data whose approximated distancesto the query are less than or equal to ε. So all data in the answer set is included in the candidateset since ad2 ≤ d2 ≤ ε2.

In the next sections, the proposed techniques are presented in detail. Notations that areused in this paper are shown in Table 1.

3.1 Inner2

As mentioned in the previous section, inner2 uses 2 values for calculating each approx-imated subdistance by approximating an inner product between two vectors. When X =(x1, x2, . . ., xn) is an n-dimensional vector, those two values are as follows.

SXi =

m∗(i+1)∑t=m∗i+1

xt , SXi,sq =

m∗(i+1)∑t=m∗i+1

x2t

SXi is the sum of the values of each dimension of the i-th subvector of X, and SX

i,sq is the sumof squares of the same values. In the rest of this section, we explain how they are used.

Inner2 uses the following trigonometric formulas.

sin2 θ + cos2 θ = 1 (2)

cos(θ1 ± θ2) = cos θ1 cos θ2 ∓ sin θ1 sin θ2 (3)

123

232 Y. Kim et al.

Fig. 1 Optimal case

Fig. 2 Approximated case

Formula (2) and (3) are related to the cosine function since the inner product of two n-dimen-sional vectors can be defined as follows:

〈P, Q〉 =n∑

t=1

pt · qt = ‖P‖‖Q‖ cos θ

If lengths of the vectors, P and Q, are already known, one additional value, cos θ , is requiredto calculate the inner product, as shown in above formula. It is applicable to each subvec-tor similarly. The inner product between the ith subvectors of P and Q can be calculatedusing their lengths, S P

i,sq , SQi,sq , and cos θi . The inner2 technique approximates cos θi with

just S Pi , SQ

i , S Pi,sq , and SQ

i,sq to approximate the distance between i-th subvectors of P and

Q. (i.e. to find ad2i )

The subdistance that we want to obtain is the distance between subvectors of a queryvector Q and the data vector P in a database. Because index structures are made before thequery processing steps, all values for indexing are computed before the query is given. Butcos θi cannot be calculated before the query is issued. Therefore, cos θi is not suitable forindexing. So, we approximate cos θi using already known values and replace cos θi with theprecomputed value.

To approximate cos θi , some vectors are defined as axes for dimension partitions. It isnatural that an appropriate subvector of a query vector is used as an axis, but, as mentionedearlier, it is not a good idea for indexing since the query is not determined when an indexstructure is already built. So, predefined vectors are used as axes for each dimension partition.In this way, it is possible to calculate cosine values between subvectors of each data pointand predefined axes when they are stored in the database.

The gap between actual cos θi and approximated one is shown in Figs. 1 and 2. They showthe volume of the candidate set in which cosine similarities between all data and the queryare greater than or equal to cos θi (As θ increases, cos θ decreases.).

Figure 1 shows an optimal case that uses the query point as the axis. The volume ofthe cone-shape space is minimum. On the other hand, the volume of space which uses a

123


Fig. 3 Approximated case usingtwo axes

predefined vector is greater than that of an optimal case as shown in Fig. 2. In the case ofFig. 1, two subvectors of a query point and a data point, and the axis are always in the sameplane, since the subvector of the query point is used as the axis. In Fig. 2, however, they maynot be in the same plane depending on the position of the subvector of a data point. Thismakes the candidate set of an approximate case larger than that of an optimal case.

The concept of dimension partitioning is effective to reduce the gap between the volumesof spaces which are generated by the two cases. Figure 3 shows the effect when two axes areused. Each cone-shape space represents the candidate set of each dimension partition. Theintersection of two cone-shape spaces corresponds to the candidate set using two axes sinceall of points in the final candidate set should be in the candidate sets of every dimensionpartition. The size of the candidate set can decrease quickly if more axes are used.

To approximate the cosine value (cos θi ), Formula (2) and (3) are used. Let α be the anglebetween an axis and the subvector of a query point and β be the angle between an axis andthe subvector of a data point in a database. Then, θi can be approximated by |α − β|, eventhough θi ≥ |α − β|. So cos θi can be calculated as follows using formula (3).

cos θi ≤ cos(|α − β|)= cos(α − β)

= cos α cos β + sin α sin β

Assume that all values in data vectors are positive. Most of features used in the image searchare represented as histograms, and all of their values are positive. However, there can benegative values in the case of general datasets. These negative values can be removed easilyby simple normalization techniques such as the min-max normalization. Then, α and β areat most 90◦. So, the range of |α − β| is [−90◦, 90◦], and cos(|α − β|) = cos(α − β) isreasonable in that range. cos θi can be calculated, if cos α and cos β are known. sin α andsin β can be calculated by Formula (2). cos α is calculated as follows. We use the conceptof inner product again. Using the inner product of subvectors of the query point and an axis,

123

234 Y. Kim et al.

cos α can be represented as follows.

〈Qi , Ii 〉 = ‖Qi‖ ‖Ii‖ cos α

cos α = 〈Qi , Ii 〉‖Qi‖ ‖Ii‖

Qi and Ii stand for the i-th subvector of the query Q and the i-th axis, respectively. Ii isdefined as follows.

Ii = (ii1, ii2, . . . , iin){ii j = 1, if i-th partition contains j-th dimensionii j = 0, otherwise

Similarly, cos β can be represented as follows. Pi also stands for the i-th subvector of datapoint P.

cos β = 〈Pi , Ii 〉‖Pi‖‖Ii‖

Using cos α, cos β, Formula (2), and Formula (3), we can obtain the following expressionfor the approximation of cos θi .

cos θi ≤ cos α cos β + sin α sin β

= cos α cos β +√

1 − cos2 α

√1 − cos2 β

= SQi√

m√

SQi,sq

S Pi√

m√

S Pi,sq

+√√√√1 − SQ

i SQi

mSQi,sq

√√√√1 − S Pi S P

i

mS Pi,sq

cos θi is the angle between the i-th subvectors of the query point Q and a data point P. So,〈Pi , Qi 〉 can be approximated using cos θi .

〈Pi , Qi 〉 = ‖Pi‖‖Qi‖ cos θi =√

S Pi,sq

√S P

i,sq cos θi

≤ S Pi SQ

i

m+

√S P

i,sq − S Pi S P

i

m

√SQ

i,sq − SQi SQ

i

m

=S P

i SQi +

√(S P

i,sq − S Pi S P

i

) (SQ

i,sq − SQi SQ

i

)

m(4)

Finally, we can approximate cos θi using S Pi , SQ

i , S Pi,sq , and SQ

i,sq . We are able to calculatethe approximated subdistance between the subvectors of P and Q using Formula (4) and (1)as follows. The approximated subdistance is less than or equal to the original subdistance.

‖Pi − Qi‖2 ≥ S Pi,sq + SQ

i,sq

−2 ∗S P

i SQi +

√(S P

i,sq − S Pi S P

i

) (SQ

i,sq − SQi SQ

i

)

m(5)

The approximated distance is obtained by summing approximated subdistances. Thus, tocalculate the approximated distance using inner2, only 2k values are required, where k is thenumber of dimension partitions.

123


3.2 Inner1

Inner1 is a dimensionality reduction technique using a single value for each dimensionpartition. Using the value, inner1 approximates the inner product of the query and data indatabases. The basic framework of inner1 is similar to that of inner2.

The Cauchy–Schwarz inequality, |〈P, Q〉| ≤ ‖P‖ · ‖Q‖, is utilized for the approximationof the inner product of two vectors. Inner1 also uses S P

i,sq , SQi,sq as follows, since S P

i,sq equals

to the length of a subvector of P and SQi,sq equals to that of Q.

〈Pi , Qi 〉 ≤ ‖Pi‖‖Qi‖=

√S P

i,sq

√SQ

i,sq

The approximated distance using inner1 is defined with above formula and Formula (1) asfollows. Using the following equation, inner1 can reduce the dimensionality of vectors, fromn to k.

‖Pi − Qi‖2 ≥ S Pi,sq + SQ

i,sq − 2 ∗√

S Pi,sq

√SQ

i,sq (6)

Inner1 is simpler than inner2. The simplicity means using less information of each subvector.So when using inner1, the gap between a subdistance and an approximated subdistance islarger than that of inner2.

A strong point of inner1 is that it uses a single value for each subvector. With the sameindexing dimensionality, inner1 can use two times more subvectors compared with inner2.For example, if we want to search 100-dimensional data and use the 20-dimensional index,to obtain the approximated distance, inner1 uses 20 subvectors which include 5 dimensionseach. But, inner2 can use 10 subvectors only. If inner1 and inner2 use the same number ofdimension partitions, inner2 is more effective than inner1. However, in general, the indexingdimensionality is more important than the number of dimension partitions. The comparisonis shown in Sect. 6.

4 Indexing for inner2 and inner1

The indexing process for inner1 and inner2 consists of two main phases. First, to prune datapoints that cannot belong to the candidate set roughly, a rectangular query using the querypoint is sent to the index structure. After that, the result of the rectangular query is refinedto find the candidate set. It is based on the filter-and-refine technique which is developedin Faloutsos et al. (1996). The distance functions of the proposed techniques are differentfrom the Euclidean distance scheme which is used in the existing index structures in general.Thus, there is no existing index structure for the techniques. Thus, the approximated distancefunction shown before cannot be used directly in the indexing steps. For the generality of ourtechniques, we use the filter-and-refine technique to utilize existing index structures whichuse the Euclidean distance instead of devising our own index structure. After the filteringstep, the approximated distance function is used to refine the results which are returned bythe index structure.

The obtained candidate set always includes the whole answer set since approximated dis-tances are less than or equal to the exact distances. In other words, there is no false dismissal.

123

236 Y. Kim et al.

The rest of this section is about the indexing process for inner1 and inner2. S Pi,sq is used

in both inner1 and inner2, and S Pi is used only in inner2. The formula below is about the

range of S Pi,sq values. It starts from a formula about the relationship between the length of

the point P and that of the query Q.∣∣∣∣√

S Pi,sq −

√SQ

i,sq

∣∣∣∣ ≤ ε

−ε ≤√

S Pi,sq −

√SQ

i,sq ≤ ε

max

(0,

√SQ

i,sq − ε

)≤

√S P

i,sq ≤√

SQi,sq + ε

{max

(0,

√SQ

i,sq − ε

)}2

≤√

S Pi,sq ≤

(√SQ

i,sq + ε

)2

(7)

In the above formula, the max function can be used since S Pi,sq values are always positive.

Without the max function, the range of S Pi,sq shrinks if

√SQ

i,sq − ε < 0. It can be the reasonto make the false dismissal.

In the case of S Pi , we use the range calculated using formula in Vu et al. as

∣∣∣μPi − μ

Qi

∣∣∣ ≤ε√m

. μPi and μ

Qi denote the average of the values included in the i-th dimension partition of

P and Q, respectively, ε is the query radius, and m is the number of dimensions which areincluded in each dimension partition. We can find easily S P

i = m × μPi . Applying these, we

obtain the range of S Pi .

∣∣∣μPi − μ

Qi

∣∣∣ ≤ ε√m

m∣∣∣μP

i − μQi

∣∣∣ ≤ ε√

m

∣∣∣S Pi − SQ

i

∣∣∣ ≤ ε√

m

max(

0, SQi − ε

√m

)≤ S P

i ≤ SQi + ε

√m

(8)

Figure 4 shows the algorithm for building an index structure and searching the candi-date set. The ConvertVector procedure can convert the vector into k (using inner1) or 2k(using inner2) dimensional vector. If inner1 is used, each Sv

i does not have to be calculated.So, k-dimensional vector is returned (line 13). But if inner2 is used, each Sv

i is also usedfor indexing. So, 2k-dimensional vector which includes not only each Sv

i,sq but also Svi is

returned.When building an index structure, each data point in P is converted by ConvertVector

procedure using method m (inner2 or inner1). Then, converted vectors are inserted into theindex structure (line 1–4, BuildIndex procedure). Searching the candidate set is composedof two phases. First, the query q is also converted using ConvertVector. Line 6–7 are stepsfor the first phase, a rectangular query. Line 8–14 present the second phase, where all datapoints whose approximated distances are less or equal to the query radius ε are included inthe candidate set.

After searching the candidate set, we get real vectors of all data in the candidate set fromDB, and calculate the actual distance using Formula (1).

123


Fig. 4 An indexing algorithm for inner2 and inner1

5 The relationship between inner2 and MS

This section shows the relationship between inner2 and MS. Their formulas to calculateapproximated distances look very different each other, but they are equivalent.

As shown before, inner2 uses the following formula to calculate the i-th approximatedsubdistance between P and Q.

d2inner2,i = S P

i,sq + SQi,sq − 2

m∗

(S P

i SQi +

√(S P

i,sq − S Pi S P

i

) (SQ

i,sq − SQi SQ

i

))

where m is the number of used dimensions to calculate the i-th approximated subdistance.MS uses μ and σ which denote the average and standard deviation of vectors’ values,

respectively. The i-th approximated subdistance between P and Q can be calculated asfollows.

d2ms,i = m

((μP,i − μQ,i

)2 + (σP,i − σQ,i

)2)

123

238 Y. Kim et al.

Theorem 7.1 d2inner2,i and d2

ms,i are equivalent.

Proof The variables of inner2, Si , Si,sq , can be expressed by those of MS, μi , σi .

Si,sq = m(σ 2

i + μ2i

), Si = mμi

Then,

d2inner2,i = S P

i,sq + SQi,sq − 2

m∗

(S P

i SQi +

√(S P

i,sq − S Pi S P

i

) (SQ

i,sq − SQi SQ

i

))

= m(σ 2

P,i + μ2P,i + σ 2

Q,i + μ2Q,i

) − 2

m∗ m2μP , iμQ,i

− 2

m

√m2

(σ 2

P,i + μ2P,i − μ2

P,i

)m2

(σ 2

Q,i + μ2Q,i − μ2

Q,i

)

= m(σ 2

P,i + μ2P,i + σ 2

Q,i + μ2Q,i − 2μP,iμQ,i − 2σP,iσQ,i

)

= m((

μP,i − μQ,i)2 + (

σP,i − σQ,i)2

)

= d2ms,i �

Although the formulas used in inner2 and MS are equivalent, inner2 can find the candidateset more efficiently than MS. MS also uses the indexing scheme which consists of two phaseslike inner2. The size of intermediate result means that the number of data in the result ofthe 1st phase of the index scheme. According to the experimental result in Sect. 6.3, the sizeof the intermediate result of MS is larger than that of inner2. The size of the intermediateresult represents the cost required to find the candidate set. Thus, the query processing cost ofMS is higher than that of inner2. In the point of view of the speed of the database mapping,MS shows a slightly better performance than inner2. We check the index building time of MSand inner2 for hist and samp data. For hist data, MS takes 64.59 s and inner2 takes 66.05 s.For samp data, MS takes 65.23 s and inner2 takes 67.20 s. However, database mapping stepsare conducted in the preprocessing steps, before the query processing steps. The purpose ofthe indexing is to reduce the query processing time before the query is issued.

6 Experimental results

6.1 Experimental setup

We performed experiments on several datasets. Two datasets, hist and samp, are obtainedfrom the author of Vu et al. [27]. These are two of four datasets in Vu et al.. Both hist and sampinclude 15766 data points with 256 dimensions. The third dataset is optdigits, a normalizedbitmap of handwritten digits (NIST dataset) [25]. It consists of 5620 data points with 64dimensions. The keypoints dataset is about keypoints [21] of the gray scale satellite image,with 128 dimensions. Using the Caltech-101 image set [9,28], two datasets are obtained,101-color, 101-edge. Those datasets are color histogram and edge histogram data. The colorhistogram and edge histogram are defined in the MPEG-7 document [22]. Table 2 shows thedescription of all datasets. The indexing dimensionality means the number of dimensionsused for indexing. For example, 8-dimensional index structure is used for the hist dataset.Inner1 uses 8 subvectors for the hist dataset. But inner2 can use only 4 subvectors.

In this experiment, we use the SR-tree [16] as an index structure. The SR-tree performswell up to 48 dimensional space [17]. Any other index structures which use the Euclidean

123


Table 2 Dataset description

Name of datasets # of data points Originaldimen-sions

Indexingdimension-ality

Descriptions of datasets

hist 15766 256 8 From MS

samp 15766 256 8 From MS

Optdigits 5620 64 4 Normalized bitmap

Keypoints 21444 128 8 SIFT keypoints

101-color 9197 256 8 Color histogram of Caltech 101

101-edge 9197 150 6 Edge histogram of Caltech 101

Caltech 34029 256 8 Color histogram of Caltech 101 and 256

ALOI 72000 256 8 Color histogram of ALOI

ObjCateg 106029 256 8 Merging Caltech and ALOI

distance scheme can be used instead of the SR-tree. As mentioned before, there is a phaseto refine the result of the range search using the index structure. Thus, which sort of indexstructure is used in experiments does not influence experimental results.

We compare the selectivity of inner1 and inner2 for every dataset. We repeat experi-ments using 100 queries for each dataset, and then average the results of all the queries. Theselectivity means the ratio of the size of the candidate set to the size of the whole dataset.Dimensionality reduction techniques are used to speed up the query processing by reducingthe candidate set. The size of each candidate set indicates each dimensionality reduction tech-nique’s performance. It means that the selectivity can be used as a metric for the efficiency.

Although the scope of this research is the dimensionality reduction supporting efficientquery processing, we do not compromise accuracy for efficiency. The proposed techniquesefficiently perform index search and reduce the candidate set as much as possible on condi-tion that the candidate set contains all the query result (i.e. no false dismissal). Therefore,the proposed techniques do not affect the accuracy negatively.

6.2 Experimental results

Figure 5 and 6 provide the result of hist and samp datasets. Two datasets are used in theMS paper. Two dashed lines in these figures show the intermediate result of the 1st phase ofthe indexing algorithm. Inner2-1 and inner1-1 represent the selectivities of the intermediateresults of inner2 and inner1, respectively. In Fig. 5, inner1 shows a better result than inner2for the hist dataset. But, for the samp dataset, inner2 performs better. It indicates the trade-offmentioned in Sect. 3.2. Inner2 uses a more precise approximation technique than inner1. Butinner1 uses more subvectors than inner2. In the case of the hist dataset, the number of sub-vectors is more important than the accuracy of approximation, while for the samp dataset, itis opposite. Intermediate results of two datasets are in proportion to the corresponding finalresults.

Figures 7, 8, 9, and 10 show experimental results on other datasets, optdigits, keypoints,101-color, and 101-edge datasets. Dashed lines show the effect of 1st stage of indexing.Inner1 shows better selectivity than inner2 for most of datasets. But inner2 shows betterselectivity than inner1 for the 101-edge dataset. Intermediate results of these datasets alsoshow the same tendency as the final results.

123

240 Y. Kim et al.

Fig. 5 Selectivity of hist dataset

Fig. 6 Selectivity of samp dataset

As shown before, inner1 performs better than inner2 for most of datasets, but there aresome datasets for which inner2 shows better results than inner1. To explain this, we use thecorrelation between all pairs of dimensions for each dataset.

The correlation indicates the strength and direction of a linear relationship between tworandom variables. In this paper, it is applied to each pair of dimensions instead of randomvariables. After obtaining a whole correlation matrix, we calculate the average of all corre-lation coefficients. Correlation coefficients are in the range [–1,1]. We use absolute values ofevery correlation coefficient so that they can show the property of datasets more accurately.Average values of datasets are shown in Table 3. They are sorted in ascending order.

As shown before, inner2 performs better than inner1 for only samp and 101-edge data-sets. They also have large correlation coefficients. In general, as the dimensionality of dataincreases, the average of correlation coefficients of the data becomes smaller. Therefore,

123


Fig. 7 Selectivity of optdigits dataset

Fig. 8 Selectivity of keypoints dataset

hist and samp are unusual. A dimension may be correlated with a few dimensions, but it isimpossible that every dimension is correlated with all other dimensions.

We experiment for about 20 datasets and most of them show the same tendency. Inner1performs better than inner2 for most of datasets. These datasets have small correlation coef-ficients and the high dimensionality.

In addition, we did more experiments using larger datasets, Caltech and ALOI. Caltechconsists of color histograms of 34029 images, which are from the Caltech-101 image setmentioned before and the Calteg-256 image set [12]. ALOI is also color histogram data of72000 images, which are from the ALOI (Amsterdam Library of Object Images) image set[11]. And we merge two large datasets into the ObjCateg dataset in order to make the largerdataset with 106029 images. Figures 11, 12 and 13 show experimental results of Caltech,ALOI, and ObjCateg. Inner1 performs better than inner2 for all the three datasets. Theircorrelation values are smaller than that of the Keypoints dataset.

123

242 Y. Kim et al.

Fig. 9 Selectivity of 101-color dataset

Fig. 10 Selectivity of 101-edge dataset

Table 3 Correlation values ofdatasets

Datasets Dimensionality Average

ALOI 256 0.067735345

101-color 256 0.077969812

Caltech 256 0.079698056

ObjCateg 256 0.083181496

Keypoints 128 0.097685576

Optdigits 64 0.111339427

101-edge 150 0.152093143

hist 256 0.209989637

samp 256 0.270788264

123


Fig. 11 Selectivity of CalTech dataset

Fig. 12 Selectivity of ALOI dataset

6.3 Pruning effects of the indexing of inner2 and MS

The indexing process proposed in this paper consists of two main phases, the rectangularquery and intermediate result processing. Inner2 and MS generate the same result set afterthe intermediate result processing. Their intermediate results, however, can be different.Figures 14 and 15 are about the 1st phase’s pruning effect (i.e. the sizes of intermediateresults) of the indexing process, with respect to hist and samp datasets, respectively. In thesefigures, the 2nd phase indicates the result set size after the intermediate result processingof inner2 and MS. According to Figs. 14 and 15, inner2 produces smaller results of the 1stphase of the indexing scheme than MS. It means that inner2 can find the candidate sets moreefficiently than MS.

123

244 Y. Kim et al.

Fig. 13 Selectivity of ObjCateg dataset

Fig. 14 Intermediate result of hist dataset

Fig. 15 Intermediate result of samp dataset

123


Fig. 16 Results using the various indexing dimentionality (hist)

Fig. 17 Results using the various indexing dimentionality (samp)

6.4 Effects of the indexing dimensionality

The indexing dimensionality can change according to the dimensionality of data, featuretypes, index structure types and so on. Indexing dimensions may affect the size of the can-didate set.

Figures 16 and 17 show the selectivities of the various indexing dimensionality. Figure 16is the result on the hist dataset, and Fig. 17 is on the samp dataset. In both figures, the morethe indexing dimensionality is used, the smaller the size of each candidate set is.

It shows that we can obtain better results if we use a larger indexing dimensionality.Because of ‘the curse of dimensionality’ [6], most of index structures suffer from the perfor-mance degradation as the indexing dimensionality increases. Thus, the indexing performancewill be best when the indexing dimensionality is determined to be the largest value beforestarting the performance degradation.

7 Conclusion

This paper proposed two-dimensionality reduction techniques to solve ‘the curse ofdimensionality’. The proposed techniques approximate the distance between them by

123

246 Y. Kim et al.

approximating the inner product of two vectors. They are able to reduce the dimension-ality by using a few values instead of all values of two vectors to approximate the innerproduct of two vectors. They can find candidate sets of spherical queries in large databasesefficiently.

In our techniques, simple formulas are used to approximate the distances between twodata vectors. The Cauchy–Schwarz inequality is used for inner1, and simple trigonometricalequations are used for inner2.

They are useful to find candidate sets which provide no false dismissal. To increase theefficiency, we use a dimension partitioning technique. We also provide the algorithm to useexisting index structures directly. Using it, inner1 or inner2 can be used for the dimensionalityreduction in various situations easily.

Acknowledgments This work was supported by the National Research Foundation of Korea(NRF) grantfunded by the Korea government (MEST) (No. 2010-0000863).

References

1. Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: Pro-ceedings of the International Conference of Foundations of Data Organization and Algorithms, pp 69–84

2. Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust accessmethod for points and rectangles. In: Proceedings of the ACM SIGMOD International Conference onManagement of Data, pp 322–331

3. Berchtold S, Keim DA, Kriegel HP (1996) The x-tree : An index structure for high-dimensional data.In: Proceedings of International Conference on Very Large Data Bases, pp 28–39

4. Cha GH, Chung CW (2002) The gc-tree: a high-dimensional index structure for similarity search in imagedatabases. IEEE Trans Multimed 4(2):235–247

5. Cha GH, Zhu X, Petkovic P, Chung CW (2002) An efficient indexing method for nearest neighbor searchesin high-dimensional image databases. IEEE Trans Multimed 4(1):76–87

6. Donoho DL (2000) High-dimensional data analysis: the curses and blessings of dimensionality. In: AMSConference Mathematical Challenges of the 21st Century

7. Egecioglu O, Ferhatosmanoglu H (2004) Dimensionality reduction and similarity computation by innerproduct approximations. IEEE Trans Knowl Data Eng 16(6):714–726

8. Faloutsos C (1996) Searching multimedia databases by content. Kluwer Academic Publishers, Dordrecht9. Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an

incremental bayesian approach tested on 101 object categories. In CVPR 2004, Workshop on Genera-tive-Model Based Vision

10. Filho RFS, Traina AJM, Jr., CT, Faloutsos C (2001) Similarity search without tears: the OMNI fam-ily of all-purpose access methods. In: Proceedings of the seventeenth International Conference on DataEngineering, pp 623–630

11. Geusebroek JM, Burghouts GJ, Smeulders AWM (2005) The Amsterdam library of object images. Int JComput Vis 61(1):103–112

12. Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset, TR-7694, California Instituteof Technology

13. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACMSIGMOD international conference on Management of Data, pp 47–57

14. Huang Z, Sun S, Wang W (2009) Efficient mining of skyline objects in subspaces over data streams,Knowledge and Information Systems, Online published

15. Kanth KVR, Agrawal D, Abbadi AE, Singh A (1999) Dimensionality reduction for similarity searchingin dynamic databases. Comput Vis Image Underst 75(1-2):59–72

16. Katayama N, Satoh S (1997) The sr-tree: an index structure for high-dimensional nearest neighbor queries.In: Proceedings ACM SIGMOD International Conference on Management of Data, pp 369–380

17. Katayama N, Satoh S (2000) Application of multidimensional indexing methods to massive processingof multimedia information. Syst Comput Jpn 31(13):31–41

123


18. Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001) Locally adaptive dimensionality reduction forindexing large time series databases. In: Proceedings of the ACM SIGMOD international conference onManagement of data, pp 369-380

19. Keogh EJ, Chakrabarti K, Pazzani MJ, Mehrotra S (2001) Dimensionality reduction for fast similaritysearch in large time series databases. Knowl Inform Syst 3(3):263–286

20. Lin S, Chen S, Wu W, Chen C (2009) Parameter determination and feature selection for back-propagationnetwork by particle swarm optimization. Knowl Inform Syst 21(2):249–266

21. Lowe D (2003) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–11022. Martinez JM (2002) Mpeg-7: overview of mpeg-7 description tools, part 2. IEEE Multimed 9(3):83–9323. Sakurai Y, Yoshikawa M, Uemura S, Kojima H (2000) The A-tree: an index structure for high-dimensional

spaces using relative approximation. In: Proceedings of the International Conference on Very Large DataBases, pp 516–526

24. Song G, Cui B, Zheng B, Xie K, Yang D (2009) Accelerating sequence searching: dimensionality reduc-tion method. Knowl Inform Syst 20(3):301–322

25. UCI Machine Learning repository (1998) ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/26. Vu K, Hua K, Cheng H, Lang SD (2008) Bounded approximation: a new criterion for dimensionality

reduction approximation in similarity search. IEEE Trans Knowl Data Eng 20(6):768–78327. Vu K, Hua KA, Cheng H, Lang SD (2006) A non-linear dimensionality-reduction technique for fast

similarity search in large databases. In: Proceedings of the ACM SIGMOD international conference onManagement of data, pp 527–538

28. Wang JZ, Boujemaa N, Bimbo AD, Geman D, Hauptmann AG, Tesic J (2006) Diversity in multimediainformation retrieval research. In: Proceedings of the ACM international workshop on Multimedia infor-mation retrieval, pp 5–12

29. Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-searchmethods in high-dimensional spaces. In: Proceedings International Conference on Very Large Data Bases,pp 194–205

30. White DA, Jain R (1996) Similarity indexing with the ss-tree. In: Proceedings of the International Con-ference on Data Engineering, pp 516–523

31. Wu YL, Agrawal D, Abbadi AE (2000) A comparison of DFT and DWT based similarity search intime-series databases. In: Proceedings of the ACM CIKM International Conference on Information andKnowledge Management, pp 488–495

32. Yi BK, Faloutsos C (2000) Fast time sequence indexing for arbitrary lp norms. In: Proceedings of theInternational Conference on Very Large Data Bases, pp 385–394

Author Biographies

Yongkwon Kim received B.S. and M.S. degrees in computer sciencefrom Korea Advanced Institute of Science and Technology (KAIST).He is currently a Ph.D. Candidate at KAIST. His research inter-ests include Multimedia Databases, Dimensionality Reduction, DataMining.

123

ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/

248 Y. Kim et al.

Chin-Wan Chung received a B.S. degree in electrical engineeringfrom Seoul National University, Korea, in 1973, and a Ph.D. degreein computer engineering from the University of Michigan, Ann Arbor,USA, in 1983. From 1983 to 1993, he was a Senior Research Scien-tist and a Staff Research Scientist in the Computer Science Depart-ment at the General Motors Research Laboratories. Since 1993, he hasbeen a professor in the Department of Computer Science at the KoreaAdvanced Institute of Science and Technology, Korea. He was in theprogram committees of major database conferences including ACMSIGMOD, VLDB, and IEEE ICDE. He was an associate editor of ACMTOIT, and is an associate editor of WWW Journal. His current researchinterests include the semantic Web, mobile Web, social networks, sen-sor networks and stream data management, and multimedia databases.

Seok-Lyong Lee is a professor in School of Industrial and Man-agement Engineering at Hankuk University of Foreign Studies. Hereceived his Ph.D. in Information and Communication Engineeringfrom Korea Advanced Institute of Science and Technology (KAIST).He received B.S. degree in Mechanical Engineering and M.S. degreein Industrial Engineering from Yonsei University. He was an AdvisoryS/W Engineer at IBM Korea from 1984 to 1995. His research interestsinclude Multimedia Databases, Image analysis, Data Mining and Ware-housing, and Information Retrieval.

Deok-Hwan Kim is an associate professor in the School of ElectronicEngineering at Inha University, Incheon, Korea. From March 1987 toFeb. 1997, he was with LG Electronics, as a senior engineer. From Jan.2004 to Feb. 2005, he was with University of Arizona, Tucson, in apostdoctoral position to work on multimedia systems and embeddedsoftware. He received the B.S. degree in computer science and statisticsfrom Seoul National University, Korea in 1987 and the M.S. and Ph.D.degrees in computer engineering from Korea Advanced Institute of Sci-ence and Technology, Daejeon, Korea, in 1995 and 2003, respectively.His research interests include embedded systems, intelligent storagesystems, multimedia system, and data mining. He is a member of theIEICE and IEEE Computer Society.

123

Distance approximation techniques to reduce the dimensionality for

Documents

Transcript of Distance approximation techniques to reduce the dimensionality for