Title Page Estimating the Performance of Multidimensional...

1

Title Page

Estimating the Performance of Multidimensional Access Methods based on Non-Overlapping Regions

IJIS #: 6021

A final manuscript accepted for publication in the INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS (IJIS)

Byunggu Yu1*, Thomas Bailey2, and Ratko Orlandic3

1,2Department of Computer Science

Dept. 3315 1000 E. University Avenue

University of Wyoming Laramie, WY 82071

Email: [email protected], Phone: (307) 766-2440, Fax: (307) 766-4036 Email: [email protected]

3Computer Science Department

University of Illinois at Springfield One University Plaza, MS UHB 3100

Springfield, IL 62703, U.S.A. Phone: (217) 206-6788 Fax: (217) 206-6217

Email: [email protected]

* The corresponding author

Acknowledgements: This research was supported in part by the National Science Foundation (NSF), grant IIS-0312266; and the NSF Wyoming EPSCoR, grant NSFLOC4304.

Jun 08, 2006

2

ABSTRACT Most cost models for the prediction of the performance of multidimensional access methods are based on Minkowski operations. However, this approach does not correspond to the space-partitioning strategy of multidimensional access methods that produce non-overlapping index regions, e.g. the KDB-tree and its variants. This paper proposes a new cost model for the prediction of the selection (search) performance of the related access methods, including its adaptation for non-uniform data. The results of an extensive set of experiments conducted on both simulated and real data show that the cost model and its extension provide a basis for a highly accurate analysis of these structures in both low- and high-dimensional situations.

1 INTRODUCTION Multidimensional access methods (MAMs) have been deployed in various systems including information (document) retrieval systems, biomedical information systems, multimedia database systems, and data-warehousing systems. An increasing number of applications must deal with large sets of high-dimensional data. In these environments, high-performance multidimensional access methods are required. The cost models are required for run-time estimation of query performance and query optimization. For example, the query processing modules of the database management system (DBMS) rely on the cost models of the underlying access methods to select the best possible access pathi for a given query. Thus, when a multidimensional access method is integrated into a DBMS, an accurate cost model for the prediction of its performance is also required. In multidimensional access methods, each data record (or object) is represented as a point or region in a multidimensional space. A multidimensional query is often specified as a rectangle: q = [l, h] where l and h are the low and high endpoints of q, respectively. Observe that a multidimensional exact-match selection is a special case of range selection (i.e., l = h). To efficiently support multidimensional range selections, such as select every data record whose corresponding point (or region) is enclosed by (or, respectively, intersects) the given region q, most multidimensional access methods build a tree-like index structure (i.e., a hierarchy of index pages) on each data set. Each index page is associated with a bounding region, called the index region, that encloses all data points (or regions) stored in the leaf-level pages of the subtree rooted at the index page. A multidimensional selection starts at the root page and propagates downward, traversing possibly multiple paths in the tree. Every index page whose index region intersects the given selection range q is accessed in the process.16,20,29,30 In database systems, the typical performance measure of an access path is the number of page accesses.16,17 Popular cost models for the performance prediction of multidimensional access methods are based on the probability of overlap (or not-disjoint) between two regions in the space. Given a selection predicate represented as a region in the space, the probability that an index region overlaps the selection region represents the probability that the index region is accessed.

3

However, most existing cost models4,7,11,13,16,18,19,26 are designed for multidimensional access methods that allow overlapping index regions, e.g., R-trees9 and R*-trees2. Since overlapping index regions degrade the search performance rapidly as the number of dimensions increases5,29,30, many multidimensional access methods, such as X-trees5, R+-trees25, LSD-trees10, Buddy-trees24, KD-trees23, QSF-trees17,30, KDBHD-trees16, the r2G object-transformation scheme29, KDBKD-trees28, and Γ techniques15, try to eliminate (or minimize) the region overlap and produce disjoint index regions. As our experimental results in this paper show, when applied to these methods, the conventional cost models incur significant errors that become progressively worse as data dimensionality increases. This paper proposes an accurate cost model for the prediction of the query (selection) performance for an important subclass of multidimensional access methods. Our targeted access methods are the above mentioned access methods that typically produce disjoint rectangular index regions. The rest of the paper is organized as follows. Section 2 reviews the targeted access methods. Section 3 presents the conventional cost models based on Minkowski operations. Section 4 presents our new cost model. Section 5 improves our cost model for non-uniformly distributed data. Finally, Section 6 concludes the paper. 2 TARGETED ACCESS METHODS As shown in Figure 1, a KDB-tree is a height-balanced hierarchy (tree) of index pages, each of which represents a portion of the space. At every level of the structure, the d-dimensional universe is recursively divided into non-intersecting hyper-rectangles (index regions) by means of (d-1)-dimensional hyper-planes, each of which is perpendicular to one axis (dimension). Each leaf-level page contains <P, Tp> pairs (called data entries), where P is a d-dimensional vector (point in the space), and Tp is a pointer to the corresponding data record. Each interior page contains <R, Cp> pairs (called index entries), where Cp is a pointer to a child page in the next level, and R is the hyper-rectangular space (index region) of the child. To insert a new point, the KDB-tree finds the leaf-level page whose region encloses the point and adds the data entry to the page. When the page overfills, it is split along a selected dimension, which is usually chosen in a circular fashion. That is, an overfilled leaf-level page is split along dimension (i % d) + 1, where i is the axis along which the space was split to create the overfilled page (initially, i=0). A leaf-level split can propagate up to the root. This is called upward split propagation. In more recent KDB-tree variants, including KDBFD-trees16, KDBHD-trees16, and KDBKD-trees28, each overfilled interior page is split along the dividing hyper-plane by which the index region R of the page was split for the first time. Since this hyper-plane divides the entire region R, no other subspace needs to be split. This eliminates possible downward split propagation and improves page utilization16.

4

To locate each data point r enclosed by the selection region q, the selection algorithm visits every index page whose region R has a non-empty intersection with the minimum bounding rectangle (MBR) q' of q. When the search reaches the leaves, every data entry whose point P is enclosed by q' is selected, and the entries that are not enclosed by q are discarded.16,30 In the rest of the paper, we assume a normalized d-dimensional universe U, which is a unit cube [0, 1]d. Typical performance metric is the number of pages accessed during the search. In other words, the cost of a multidimensional selection is the total number of index pages that are not-disjoint with q'.ii Since the root page is always accessed, the number Q (d) of page accesses per selection in a KDB-tree is:

∑−

=

+=1

1

),,(1)(h

k

kdQdQ (1)

where Q (d, k) is the number of pages accessed at level k < h of the tree. Note that the level k=1 is the leaf level and the level k=h is the root level. Q (d, k) is determined by the number N(d, k) of pages at level k and the probability that a page at this level is accessed. Given a selection MBR q', the probability that a page is accessed depends on the size of the index region of the page. Under the assumption that points are uniformly distributed, our analysis of KDB-trees has shown that, at any level k < h with N(d, k) pages, the KDB-tree division of the space produces only two types of index regions. Let j = log2(N(d, k)) / d and j’ = log2(N(d, k)) – d·j. Observe that j ≥ 0 is the maximum integer for which 2d·j ≤ N(d, k) and that j′ ≥ 0 is the maximum integer for which 2d·j+j′ ≤ N(d, k). Then, at each level k < h, there are: (1) approximately 2(N(d, k) – 2d·j+j’ ) index regions whose average extent is 1/2j+1 along the first j’ +1 dimensions and 1/2j along the remaining dimensions; and (2) about 2d·j+j’ +1 – N(d, k) index regions whose average extent is 1/2j+1 along the first j’ dimensions and 1/2j along the remaining dimensions. A formal proof of this can be found in the literature16. Now, let P(Ri, q'i) denote the probability that, along dimension i, an index region R, whose extent along axis i is Ri, is not-disjoint with the selection MBR q'. Then, the number Q (d, k) of page accesses at each level k < h of a KDB-tree can be estimated as follows:

( )

( ) .)',2/1()',2/1(),(2

)',2/1()',2/1(2),(2

),(

'

1 1'

11'

1'

1 2'

11'

∏ ∏

∏ ∏

= +=

+++⋅

+

= +=

+++⋅

−+

−

≈

j

i

d

jii

ji

jjjd

j

i

d

jii

ji

jjjd

qPqPkdN

qPqPkdN

kdQ

(2)

Substituting Eq. (2) in Eq. (1), one arrives at the following theorem: Theorem 1. Given a multi-dimensional range selection with a selection MBR q', the expected number of page accesses in a KDB-tree is:

5

( )

( )∑

∏ ∏

∏ ∏−

=

= +=

+++⋅

+

= +=

+++⋅

−+

−+

≈

1

1'

1 1'

11'

1'

1 2'

11'

,

)',2/1()',2/1(),(2

)',2/1()',2/1(2),(2

1

)(

h

kj

i

d

jii

ji

jjjd

j

i

d

jii

ji

jjjd

qPqPkdN

qPqPkdN

dQ

where j = log2(N(d, k)) / d and j′ = log2(N(d, k)) – d·j < d. The probability function P in the theorem is discussed in Sections 3 (conventional approach) and 4 (new approach proposed in this paper). The only remaining task is to estimate N(d, k). Let CI(d) and CL(d) be the page capacity (the maximum number of entries in a single page) of the interior pages and the page capacity of the leaf-level pages, respectively, of a d-dimensional index tree. In KDBFD-trees, for a large number of uniformly distributed objects, the average page utilization produced by the splitting algorithm is approximately ln2.16 Therefore, for all k < h, N(d, k) ≈ (n/(ln2·CL(d)))/(ln2·CI(d))k–1, where n is the total number of data points. In KDBFD-trees, each region R is represented by two d-dimensional endpoints. The KDBKD-tree is the same as the KDBFD-tree except for the index entries. Assuming that a KDBKD-tree interior page contains Nc' index entries, it will maintain only Nc' – 1 coordinate values and Nc' child pointers. Thus, given the same data set, the page capacity of the KDBKD-tree is greater than in the equivalent KDBFD-tree, and the difference grows as dimensionality increases.28 The KDBHD-tree is identical to the KDBFD-tree except that the KDBHD-tree applies a compaction method to its interior pages, which is different from that of the KDBKD-tree. While KDB-tree variants are designed for point data, the QSF-tree and its variants17,29,30 are designed for regional data (multidimensional regions). In QSF-trees, each region is approximated by its MBR represented by two endpoints. Any KDB-tree variant can be used to index the low endpoints of the MBRs, and the corresponding high endpoints are stored only in the data entries. Given a selection MBR q', the search algorithm computes the L-region and the H-region30, which represent respectively the MBR of the low endpoints and the MBR of the high endpoints of all data MBRs that can possibly intersect q'. The selection algorithm searches the tree with the L-region; the H-region is consulted when the search reaches the leaves to dismiss false entries. Therefore, for QSF-trees, we replace q' in Theorem 1 with the corresponding L-region. Since the data entry size is different due to an additional high endpoint, the leaf-level page capacity is also different. In a similar manner, Theorem 1 can be modified for object transformation schemes29 employing a KDB-tree variant. In addition, this model can be easily applied to many other multidimensional access methods, such as X-trees5, R+-trees25, LSD-trees10, Buddy-trees24, KD-trees23, the r2G object-transformation scheme29, and Γ techniques15, which produce disjoint index regions at each level of the index structure.

6

3 NOT-DISJOINT PROBABILITY We define two functions l(r’ ) and h(r’ ) that give, respectively, the low endpoint and the high endpoint of a rectangle r’ . In addition, let lengthi(r’ ) = hi(r’ ) – l i(r’ ) be the extent of a rectangle r’ along the axis i. Given two d-dimensional geometric figures r and q, the Minkowski sum of r and q, denoted by r + q, is the region defined by the union of every possible q whose “anchor” (any point in the interior or on the boundary of q) is covered by r. Note that, once the anchor of q is set, it never changes. Note that q + r produces the same region. Using this operation, one can compute a region called the configuration-space (or C-space). As shown in Figure 2, the C-space of two regions r and q is the region of anchor points of every possible q that is not-disjoint with r (i.e., overlaps or meets r). The C-space of a pair <r, q> is defined as r – q (note that r – q = r + (- q), where + and – are Minkowski sum and Minkowski difference, respectively). The object q meets (or overlaps) r if and only if the anchor of q is on the boundary (or, respectively, in the interior) of the C-space. Figure 3 illustrates the relatively simple computation of the C-space for a pair of hyper-rectangles R (index region) and q' (selection MBR). A d-dimensional rectangle q' is not-disjoint with a d-dimensional rectangle R (in the rest of the paper, the anchor of a hyper-rectangle is the low endpoint of the rectangle) iff l(q’) is in the interior or on the boundary of the C-space. The not-disjoint probability P(Ri, q'i) in Section 2 is, therefore, the probability that the low endpoint l i (q') is located in the interior or on the boundary of the C-space [l i (R)- lengthi (q'), hi

(R)] (in this paper, we denote this C-space as C(Ri, q'i)). Since lengthi (C(R, q')) is lengthi (R) + lengthi (q'), Eq. (3) below gives a popular probability of P(Ri, q'i) that is used in the analysis of the R-tree and its variants.7,11,13 Note that two hyper-rectangles R and q' satisfy not-disjoint (R, q') iff they satisfy not-disjoint (Ri, q'i) for every axis i.

}.1),'()({)',( qlengthRlengthminqRP iiii += (3)

Depending on the size of q', the low endpoint of C(Ri, q'i) along an axis i can be smaller than the low endpoint of the universe l i (U) (i.e., 0). In addition, because hi (q') is always smaller than or equal to the high endpoint of the universe hi (U) (i.e., 1), l i (q') cannot appear in the interior or on the boundary of the range (hi (U)-lengthi (q'), hi (U)]. Therefore, Eq. (3) can be modified as Eq. (4) (a similar investigation was presented in other R-tree analysis papers13).

,0,'1

}0,')({}'1,)({)',(

−−−−+=

i

iiiiiii q

qRlmaxqRRlminmaxqRP (4)

where Ri = lengthi (R) and q'i =lengthi (q'). Eq. (4) was used in the performance analysis of point access methods (i.e., the KDB-tree and its

7

variants).4 However, this equation assumes that the low and high endpoints of every index region R are known. As noted in Section 2, accurately estimating the low and high endpoints of the individual index regions is not feasible. Alternatively, one can swap R and q' in Eq. (4) as shown in Eq. (5).

.0,1

}0,)'({}1,')'({)',(

−−−−+=

i

iiiiiii R

RqlmaxRqqlminmaxqRP (5)

Eq. (5) is very useful, especially for MAMs, such as the R-tree and its variants, whose index regions are allowed to overlap.16 This is because Eq. (5) does not require the exact coordinates of the low and high endpoints of individual index regions. Although Minkowski operations are not specifically mentioned, almost all cost models for the prediction of the performance of multidimensional access methods are based on either Eq. (3), or Eq. (4), or Eq. (5).4,7,11,13,16,18,19,26 Some additional techniques, such as fractal dimensions7 and buffer model13, are used to accommodate real (i.e., non-uniformly distributed) data or to take into account the existence of a buffer. Although these models can accurately estimate the performance of region-overlapping access methods, such as the R-tree and some of its variants, the basic assumption of random index regions that can overlap does not hold for a large class of important MAMs, including the access methods discussed in Section 2. 4 EQUALLY SPACED DIVIDING HYPER-PLANES Eq. (5) assumes unconstrained index regions, i.e. that the endpoints of index regions can appear in the interior or on the boundary of another index region. However, when the targeted MAM is based on a multilevel (tree-like) index structure in which index regions at the same level do not overlap, Eq. (5) can lead to a greater error. For example, in Figure 4a, which shows 16 disjoint subregions that cover the entire universe, Eq. (5) would give the probability of 0.64 that an index region is not-disjoint with the selection MBR (shaded region) along each axis i. However, one can easily see that at most two index regions can overlap or meet the shaded region along each axis, which means that the probability must be smaller than 0.5 along each dimension. The error in the estimate of Eq. (5) originates in the assumption of overlapping index regions (i.e., that the low endpoint of an index region can appear in the interior or on the boundary of another index region; or, alternatively, that any position of the low endpoint of an index region is equally likely). The above assumption is less harmful when there is a virtually infinite number of infinitesimally small index regions or when the query region overlaps almost all index regions along each dimension. However, since two hyper-rectangles intersect iff they intersect along every axis i, for the same data-set size, the error incurred by the assumption of unconstrained index regions is much more pronounced in higher dimensional spaces. Thus, for MAMs (including those mentioned in Section 2) that produce non-overlapping index regions at each level of the index tree, one needs a more accurate estimate of the probability that two hyper-rectangles are not-

8

disjoint. For MAMs producing non-overlapping index regions, assuming equally spaced dividing hyperplanes, one can compute the expected probability of not-disjoint between a given selection MBR q' and an arbitary index region R in an index level k by counting the k-level dividing hyper-planes that intersect q'. Let us assume that the coordinate of the projection of a dividing hyper-plane on axis i represents the low endpoint of the right-hand-side index region R along dimension i. Furthermore, let Fi(R, q') be the coordinate of the first dividing hyper-plane along axis i, for which Fi(R, q') > l i(q') (i.e., the first dividing plane that is greater than the left boundary of q' on axis i). Assuming that the average size of each R along i is approximately Ri', Fi(R, q')=(l i(q')/Ri'+1)×Ri'. Now, let I i(R, q') denote the number of index regions that intersect q' along the axis i. One can observe that I i(R, q') is (hi(q')-Fi(R, q'))/Ri'+2 if hi(q') ≥ Fi(R, q'). However, if hi(q') < Fi(R, q'), there is no division hyper-plane crossing the interior of q' and, thus, I i(R, q') = 1. Now, the probability of not-disjoint along this dimension is I i(R, q') divided by the number (1/Ri') of index regions along the axis i. That is,

),',(')',( qRIRqRP iiii ×≈ (6)

where

,

, 1

)',( , 2'/))',()'(( )',(

>+−

=otherwise

qRF(q')hifRqRFqhqRI iiiii

i

where

( ).1'/)'(')',( +×= iiii RqlRqRF Figure 4b shows an example of this new estimate. In order to verify the probability equations, we performed a set of experiments with KDBFD-trees and various analytical models. As described in Section 2, unlike the original KDB-trees22, KDBFD-trees use first-division splitting of the interior pages.16 All values (coordinate values and pointer values) stored in the KDB-trees were 4 bytes long. The number of dimensions was varied between 2 and 30. The page size was fixed at 2K bytes. For each d-dimensional space, the experiments were conducted on the same file of exactly 131,072 (217) randomly generated points. Data objects (records) of each file were inserted into the KDBFD-tree. The retreival performance of KDBFD-trees was measured over 2,000 spatial selections with the same set of random selection MBRs. Each side of a selection MBR was obtained as a pair of random numbers between 0 and 1 without any size restriction. The results of the experiment given in Figure 5 show that our new model E3 (i.e., Eq. (6)) increases the accuracy dramatically. The error rate of E3 rarely reached 5%. The estimates E1,

9

E2, and E3 were computed using Theorem 1 with the not-disjoint probablity based on Eq. (3), Eq. (5), and Eq. (6), respectively. Estimating the processing cost of multidimensional selections on a non-uniformly distributed data set is discussed in Section 5. Even though Eq. (5) represents selection MBRs more accurately than Eq. (3), the estimate E2 showed worse results than the estimate E1. As we have seen, both Eq.’s (3) and (5) use the assumption of overlapping index regions, which leads to an error for non-disjoint index regions in the KDB-tree variants. However, by assuming that every position of the low endpoint of any given selection predicate q’ is equally likely, which expands the event space of the low endpoints, Eq. (3) leads to an additional error, which has a tendency to diminish the first error. By carefully inspecting Eq. (5), one can see that the expanded event space always results in a smaller probability of not-disjoint. Still, Eq. (5) gives a much better estimate for overlapping-region schemes, such as the R-tree and its variants. 5 ENHANCEMENTS As presented in Section 4, our new probability model of Eq. (6) increases the accuracy of the cost model. However, the assumption of a uniform data distribution is still embodied in the model. Practical data sets tend to have highly skewed distributions. For a skewed data set, the assumption of uniformity may result in a substantial error in the selection-cost estimate. This section extends our cost model to accommodate non-uniform data distributions. Two approaches to representing multidimensional data distributions have been investigated: One is the fractal dimension approach7; the other is the histogram-based approach1 for selectivity estimation. In this paper, we present our investigation based on histograms. Multidimensional histograms try to group multidimensional data into buckets in such a way that data objects in each bucket are more uniformly distributed. One of the multidimensional histogram techniques considered in the literature uses the root page of a multidimensional tree-like access method as a histogram representing the distribution of the indexed data objects.1 This technique is often less accurate than other approaches with a separate histogram structure due to several reasons: (1) index structures are not designed to be used as a histogram – their index regions do not necessarily contain more uniformly distributed data objects; (2) the number of index regions represented in the root widely varies – as a result, the accuracy can fluctuate widely. Although more efficient histograms have been developed in recent years, for the purposes of our study of cost models for the prediction of the selection performance of MAMs, this technique of using the top few index levels that are pinned in the buffer as a histogram is more practical than other histogram techniques. This is because: (1) there is no separate histogram that needs to be created and maintained; (2) generally, at least a few uppermost levels of the index structures that are currently in use can be always found in main memory (pinning index pages in the buffer13); and, most importantly, (3) the histogram buckets are, in fact, upper level index regions of the underlying index structure; therefore, each bucket represents the actual region of the corresponding subtree of the index structure. For these reasons, we have extended the proposed cost model using index regions as histogram buckets.

10

To further enhance the model, we also modified Eq. (6), which assumes a constant uniform distribution (the data distribution is constantly uniform over time). This leads us to an assumption that a split always occurs at the center of the given local range of an index region (i.e., the assumption of equally spaced dividing hyper-planes). However, with a less uniformly distributed data, the split coordinates tend to be distributed more widely. According to our extensive simulations, when the page capacity is reasonably large (e.g., greater than eight entries), the local positions of the KDB-tree split coordinates, which divide a set of randomly generated points (that need not be constantly uniform over time) into two equal size subsets, typically resembles a bounded normal distribution along each dimension within the local index-page range. This can be represented by the beta distribution21 with the two adjusting parameters α and β satisfying the following condition: α=β>1. Figure 6 shows this typical probability density function of a split within the range of an index page. The curve in Figure 7a shows a cumulative example of such distributions. Based on this finding (Figure 6), we defined two extreme cases that can represent a space in which the actual distribution of splits is likely confined. In Figure 7a, one can find that the curve is confined in the shaded region that is represented by the two extreme cases (Figures 7b and 7c). Figure 7b represents the case in which a split always occurs at the center of the given range (i.e.,

∞→= βα ); Figure 7c is the case in which all split coordinates are equally likely (i.e., 1== βα ). The following equation, which is a weighted sum of the two extreme cases, gives a good estimate of the probability that two hyper-rectangles are not-disjoint for arbitrary data distributions:

,')1'

1('12)6.(1)',( i

iiii R

RqWEqWqRP ⋅

−⋅+⋅+⋅≈ (7)

where 0<W1<1, 0<W2<1, and W1+W2=1. Observe that, for each individual split within a local index-region interval, Eq. (6) represents the case of Figure 7b, which results in equally spaced dividing hyper-planes, and the other term represents the case of Figure 7c. In the latter case, given a query interval q’ i, the probability that an arbitrary split is within the query interval is simply represented by the query interval size multiplied by the number of random split points (i.e., )1'/1(' −⋅ ii Rq ). In this section, we approximate the actual split coordinates by the arithmetic average of these two cases (i.e., W1 = W2 = 0.5). Now, given a selection MBR q’, we can estimate the number of index page accesses with Algorithm 1 that makes use of the histogram technique and Eq. (7). Algorithm 1. Estimating selection costs over non-uniform data distributions

(1) Set QM(d) to 0; (2) Find a set S of all index regions that are not-disjoint with q’ from an upper level M of

the d-dimensional index structure;

11

(3) For each element Si of S, do: (3.1) Compute the overlapping region O of q’ and Si; (3.2) Compute the expected number ni of data objects in Si as follows: ni = n/Na(d,

M), where n is the number of data objects and Na(d, M) is the actual number of index pages constituting the level M of the index tree.

(3.3) Compute Q(d) (Theorem 1 with Eq. (7)) with n = ni, h = M–1, and q’ = O; (3.4) QM(d) = QM(d)+Q(d);

If only the root page is used as the histogram, then M = h. If the child pages of the root are available in memory, M should be set to h–1. In the experiments of Sections 5.1 and 5.2, M was set to h–1 (i.e., the child pages of the root were pinned in the buffer). The experiments were conducted over both synthetic and real data sets of different size, with varying data dimensionality. 5.1 Synthetic data In the experiments with synthetic data, the number of dimensions was varied between 2 and 30. The page size was fixed at 2K bytes. For each d-dimensional space, the experiments were conducted over the same file of 131,072 (217) randomly generated points focused mainly in ten different clusters. Each cluster was randomly located in the universe and had an extent that varied between 0.05 and 0.3 along each dimension. The clusters were populated using 101,072 random data objects. Then 30,000 points were randomly scattered through the universe. Figure 8 shows a reduced set of 214 2-dimensional points produced by our synthetic data generator. The data objects of each file were inserted into a KDBFD-tree. The selection performance of the KDBFD-tree was measured after inserting all 131,072 objects. The expected performance of spatial selections was estimated with the same set of 2,000 rendomly generated selection MBRs used in the earlier experiments. In the experiment, each side of a selection MBR was obtained as a pair of random numbers between 0 and 1 without any size restriction. Figures 9 and 10 show the results. The estimates, E1, E2, E3, E4 were computed using Eq. (3), Eq. (5), Eq. (6), and Eq. (7) respectively. Figure 9a shows the estimates based on Theorem 1, and Figure 9b shows the estimates based on Algorithm1. By comparing Figures 9a and 9b, one can find that Algorithm 1 effectively improved the accuracies of every estimate. However, the improvements of Algorithm 1 were affected by the number of available histogram buckets (i.e., the number of index regions at level M). In particular, Figure 10 shows that the number of histogram buckets drops at dimensionalities 5, 12, 21, and 28. By closely inspecting Figure 9b one can see that the deviations between the estimates and the real performance increase for the same dimensionalities. In addition, by comparing E3 and E4 in Figure 9b, one can see that Eq. (7) is better than Eq. (6). In all cases, our new model was the best performer showing the highest accuracy. 5.2 Real data Our real data set represented a database table of 1,028,872 records. This data table was obtained

12

from a database of a telecommunication company. The original table has 19 attributes of different types. However, by breaking the values of certain attributes and applying simple transformations, we obtained an array of 25-dimensional points with 4-byte unsigned long coordinates (nulls were replaced by a value of zero). Using an order-preserving domain transformation, the values of each attribute were normalized to a range of floating-point numbers between 0 and 1. Then, by multiplying each attribute value by 232 and rounding the result, we obtained the data points in a universe with 25 dimensions, each of which was 4 bytes long. Analyzing this set, we found that exactly 303,278 objects had no null values and that each of the rest of the data objects (i.e., 725,594 records) had about 2.18 nulls on average. The objects having one or more nulls appear on the boundary of the universe. That is, over 70% of the data points were located on the boundary of the universe. For the experiments conducted over this high-dimensional real data, the page size of the KDB-trees was set to 4K bytes. The performance of KDBFD-trees was measured after inserting 8192, 16384, 32768, 65536, 131072, 262144, 524288, and 1028872. For each tree size, the expected performance of spatial selections was estimated with the same set of selection MBRs. For each measure, 10,000 selections were generated with random selection MBRs. In the experiment, each side of a selection MBR was obtained as a pair of random numbers between 0 and 1 without any size restriction. The experimental results shown in Figures 11 and 12 confirm that our new model E4 is the best, that the impact of Algorithm 1 is determined by the number of index regions at level M (see Section 5.1), and that the modification of Eq. (7) is effective. 6 SUMMARY AND DISCUSSION Most cost models for the performance prediction of MAMs (Multidimensional Access Methods), which are based on the probability of overlap (or not-disjoint) between two multidimensional rectangles, were designed for overlapping-region schemes. These models usually rely on Minkowski operations, which provide good estimates in situations when the index regions are uniformly distributed in the space. However, for performance reasons, many MAMs produce non-overlapping index regions, which are not uniformly distributed in the space. In these MAMs, conventional probability models may incur significant errors that tend to increase as data dimensionality grows. In this study, we developed an effective, yet simple, cost model for an important group of MAMs that rely on non-overlapping rectangular index regions. As we have seen in Section 4, our new probability model can significantly increase the accuracy of performance predictions. However, the assumption of a uniform data distribution is still embodied in the model. In Section 5, we presented an extended cost model for non-uniformly distributed data sets. Even for highly skewed data, the extended model can provide a good estimate of the selection cost. The proposed model is both easy to implement and quick to solve, thus providing a useful methodology for further studies. The new estimates for the performance prediction of an important class of MAMs are the main contributions of this paper to the area of multidimensional access methods and multidimensional

13

databases in general. They provide the basis for a highly accurate analysis of these MAMs even in high-dimensional situations. This research also shows that there are no universally acceptable estimates of the probability of overlap: To enable tractable analysis, each estimate makes certain assumptions not only about the data set, but also about the strategy used to partition the space. In the analysis of a spatial structure, it is important to use the estimate whose assumptions reflect the space-partitioning policy of the indexing structure.

REFERENCES 1. Acharya S, Poosala V, Ramaswamy S. Selectivity estimation in spatial databases. In:

Proceedings of ACM SIGMOD International Conference on Management of Data. Philadelphia, Pennsylvania, June 1-3. New York: ACM Press; 1999. pp 13-24.

2. Beckmann N, Kriegel H, Schneider R., Seeger B. The R*-tree: an efficient and robust access

method for points and rectangles. In: Proceedings of ACM SIGMOD International Conference on Management of Data. Atlantic City, NJ, May 23-25. New York: ACM Press; 1990. pp 322-331.

3. Berchtold S, Bohm C, Keim D, Kriegel HP. A cost model for nearest neighbor search in

high-dimensional data space. In: Proceedings of ACM PODS Symposium on Principles of Database Systems. Tucson, Arizona, May 12-14. New York: ACM Press; 1997. pp 78-86.

4. Berchtold S, Bohm C, Kriegel HP. The Pyramid-technique: towards breaking the curse of

dimensionality. In: Proceedings of ACM SIGMOD International Conference on Management of Data. Seattle, Washington, June 2-4. New York: ACM Press; 1998. pp 142-153.

5. Berchtold S, Keim D, Kriegel HP. The X-tree: an index structure for high-dimensional data.

In: Proceedings of VLDB International Conference on Very Large Data Bases. Bombay, India, Sept. 3-6. San Francisco: Morgan Kaufmann; 1996. pp 28-39.

6. Beyer KS, Goldstein J, Ramakrishnan R, Shaft U. When is "nearest neighbor" meaningful?

In: Proceedings of IEEE ICDT International Conference on Database Theory. Jerusalem, Israel, Jan. 10-12. Los Alamitos: IEEE Computer Society; 1999. pp 217-235.

7. Faloutsos C, Kamel I. Beyond uniformity and independence: analysis of R-trees using the

concept of fractal dimension. In: Proceedings of ACM PODS Principles of Database Systems. Minneapolis, Minnesota, May 24-26. New York: ACM Press; 1994. pp 4-13.

8. Gunther O, Bilmes J. Tree-based access methods for spatial databases: implementation and

performance evaluation. IEEE Transactions on Knowledge and Data Engineering 1991;3(3):342-356.

9. Guttman A. R-trees: a dynamic index structure for spatial searching. In: Proceedings of

14

ACM SIGMOD International Conference on Management of Data. Boston, Massachusetts, June 18-21. New York: ACM Press; 1984. pp 47-54.

10. Henrich A, Six HW, Widmayer P. The LSD-tree: spatial access to multidimensional point

and non-point objects. In: Proceedings of VLDB International Conference on Very Large Data Bases. Amsterdam, The Netherlands, Aug. 22-25. San Francisco: Morgan Kaufmann; 1989. pp 45-53.

11. Kamel I, Faloutsos C. On packing R-trees. In: Proceedings of ACM CIKM International

Conference on Information and Knowledge Management. Washington, DC, Nov. 1-5. New York: ACM Press; 1993. pp 490-499.

12. Kamel I, Faloutsos C. Hilbert R-tree: an improved R-tree using fractals. In: Proceedings of

VLDB International Conference on Very Large Data Bases. Santiago de Chile, Chile, Sept. 12-15. San Francisco: Morgan Kaufmann; 1994. pp 500-509.

13. Leutenegger ST, López MA. The effect of buffering on the performance of R-Trees. IEEE

Transactions on Knowledge and Data Engineering 2000;12(1):33-44. 14. Lin K, Jagadish H, Faloutsos C. The TV-tree: an index structure for high-dimensional data.

VLDB Journal 1995;3:517-542. 15. Orlandic R, Lukaszuk J, Swietlik C. The design of a retrieval technique for high-

dimensional data on tertiary storage. SIGMOD Record 2002;31(2):15-21. 16. Orlandic R, Yu B. A retrieval technique for high-dimensional data and partially specified

queries. Data & Knowledge Engineering 2002;42(2):1-21. 17. Orlandic R, Yu B. Scalable QSF-Trees: retrieving regional objects in high-dimensional

spaces. Journal of Database Management 2004;15(3):45-59. 18. Pagel BU, Six HW, Toben H, Widmayer P. (1993). Towards an analysis of range query

performance in spatial data structures. In: Proceedings of ACM PODS Symposium on Principles of Database Systems. Washington, DC, May 25-28. New York: ACM Press; 1993. pp 214-221.

19. Pagel BU, Six HW, Winter M. Window query-optimal clustering of spatial objects. In:

Proceedings of ACM PODS Symposium on Principles of Database Systems. San Jose, California, May 22-25. New York: ACM Press; 1995. pp 86-94.

20. Papadias D, Theodoridis Y, Sellis T, Egenhofer MJ. Topological relations in the world of

minimum bounding rectangles: A study with R-trees. In: Proceedings of ACM SIGMOD International Conference on Management of Data. San Jose, California, May 22-25. New York: ACM Press; 1995. pp 92-103.

15

21. Pritsker A, Alan B. Introduction to simulation and SLAM II. Third Edition. New York: John

Wiley & Sons; 1986. 22. Robinson JT. The K-D-B tree: a search structure for large multidimensional dynamic

indexes. In: Proceedings of ACM SIGMOD International Conference on Management of Data. Ann Arbor, Michigan, April 29-May 1. New York: ACM Press; 1981. pp 10-18.

23. Samet H. 1995. Applications of spatial data structures. Reading: Addison Wesley; 1995. 24. Seeger B, Kriegel HP. The Buddy-tree: an efficient and robust access method for spatial

data base systems. In: Proceedings of VLDB International Conference on Very Large Data Bases. Brisbane, Queensland, Australia, Aug. 13-16. San Francisco: Morgan Kaufmann; 1990. pp 590-601.

25. Sellis T, Roussopoulos N, Faloutsos C. The R+-tree: a dynamic index for multidimensional

objects. In: Proceedings of VLDB International Conference on Very Large Data Bases. Brighton, England, Sept. 1-4. San Francisco: Morgan Kaufmann; 1987. pp 507-518.

26. Theodoridis Y, Sellis T. A model for the prediction of R-tree performance. In: Proceedings

of ACM PODS Symposium on Principles of Database Systems. Montreal, Canada, June 3-5. New York: ACM Press; 1996. pp 161-171.

27. White DA, Jain R. Similarity indexing with the SS-tree. In: Proceedings of IEEE ICDE

International Conference on Data Engineering. New Orleans, Louisiana, Feb. 26-March 1. Los Alamitos: IEEE Computer Society; 1996. pp 516-523.

28. Yu B, Bailey T, Orlandic R, Somavaram J. KDBKD-tree: a compact KDB-tree structure for

indexing multidimensional data. In: Proceedings of IEEE ITCC International Conference on Information Technology: Coding and Computing. Las Vegas, Nevada, April 28-30. Los Alamitos: IEEE Computer Society; 2003. pp 676-680.

29. Yu B, Orlandic R. Object and query transformation: supporting multidimensional queries

through code reuse. In: Proceedings of ACM CIKM International Conference on Information and Knowledge Management. McLean, VA, Nov. 6-11. New York: ACM Press; 2000. pp 141-149.

30. Yu B, Orlandic R, Evens M. Simple QSF-trees: an efficient and scalable spatial access

method Proc. ACM CIKM International Conference on Information and Knowledge Management. Kansas City, Missouri, Nov. 2-6. New York: ACM Press; 1999. pp 5-14.

FIGURE LEGENDS Please find two footnotes (endnotes) i and ii after this figure section.

16

Figure 1. A KDB-tree and its partition of space

r -q r

Anchor

-q

r + (-q)

-q -q

q

(a) (b) (c)

Figure 2. The C-space of two extended objects r and q

r’

length1 (q’)

length2 (q’) R

Figure 3. The C-space of two rectangles R and q'

R1 R2 R3 R4

a b g c h i l d f k e j

R1 R2

R3 R4

a

b

c

d e

f

g

h i

j k

l

17

0.25

<0.26 0.26>

0.25 <0.49 0.49>

P(Ri, q’ i) = (min{0.49, 0.79} – max{0.01, 0})/0.75 = 0.64

0.5

P=0.25×0.5

P=0.75×0.5 0.25

(a) (b)

Figure 4. (a) Eq. (5) vs. (b) Eq. (6)

0

200

400

600

800

1000

1200

1400

1600

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

dimensionality

pag

e ac

cess

es

E2E1E3KDBFD-Tree

Figure 5. Conventional approaches (i.e., E1 and E2) vs. new approach (i.e., E3)

18

Split coordinates

Pro

bab

ility

Figure 6. A typical split coordinate distribution (probability density function)

Split coordinates

Cu

mu

lative Pro

bab

ility

Split coordinates

Cu

mu

lative Pro

bab

ility

Split coordinates

Cu

mu

lative P

rob

ability

(a) (b) (c)

Figure 7. A cumulative version of Figure 6 confined in a shaded region (a) represented by two extreme cases (b, c)

Figure 8. An example of synthetically generated skewed data distribution

19

0

200

400

600

800

1000

1200

1400

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

dimensionality

pag

e ac

cess

es

E2E1E3KDBFD-Tree

0

200

400

600

800

1000

1200

1400

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

dimensionality

pag

e ac

cess

es E2E1E3E4KDBFD-Tree

(a) (b)

Figure 9. (a) E1, E2, and E3 represent the results of Theorem 1 with Eq. (3), Eq. (5), and Eq. (6), respectively; (b) E1, E2, E3, and E4 represent the results of Algorithm 1 with Eq. (3), Eq. (5), Eq. (6), and Eq. (7), respectively

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

dimensionality

nu

mbe

r o

f h

isto

gra

m b

uck

ets

Figure 10. The number of index regions (histogram buckets) at level M=h-1

0

500

1000

1500

2000

2500

3000

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1028

872

data size

pag

e ac

cess

es

E2E1E3KDBFD-Tree

0

500

1000

1500

2000

2500

3000

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1028

872

data size

pag

e ac

cess

es E2E1E3E4KDBFD-Tree

(a) (b)

Figure 11. Experimentally observed performance of E1, E2, E3, and E4 for real data sets: (a) E1, E2, and E3 are Theorem 1 with Eq. (3), Theorem 1 with Eq. (5), and Theorem 1 with Eq. (6), respectively; (b) E1, E2, E3, and E4 are Algorithm 1 with Eq. (3), Algorithm 1 with Eq. (5), Algorithm 1 with Eq. (6), and Algorithm 1 with Eq. (7), respectively

20

0

20

40

60

80

100

120

140

160

180

200

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1028

872

data sizenu

mb

er o

f h

isto

gra

m b

uck

ets

Figure 12. The number of index regions at level M=h-1 of the tested index structures

i An access path represents a way records can be found and retrieved from a relation (i.e., a set of records). Typical access paths include index-scan, which utilizes an access method to find the record-IDs, or table-scan, which traverses the entire data table.

ii Total cost involves both CPU cost and I/O cost. However, because accessing secondary storage (performing disk I/Os) is usually much more expensive, in large databases, disk accesses are the dominant performance factor. Ignoring the effects of buffering and pre-fetching, each page access represents a disk access. The reader interested in the effect of buffering on the performance of multidimensional access methods is referred to the article13.

Title Page Estimating the Performance of Multidimensional...

Documents

Transcript of Title Page Estimating the Performance of Multidimensional...