Indexing Multidimensional Data
-
Upload
katell-potts -
Category
Documents
-
view
40 -
download
0
description
Transcript of Indexing Multidimensional Data
![Page 1: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/1.jpg)
Indexing Multidimensional Data
Rui Zhanghttp://www.csse.unimelb.edu.au/~rui
The University of MelbourneAug 2006
![Page 2: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/2.jpg)
Outline
Backgrounds
Multidimensional data and queries
Approaches Mapping based indexing
Z-curve iDistance
Hierarchical-tree based indexing R-tree k-d-tree Quad-tree
Compression based indexing VA-file
![Page 3: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/3.jpg)
Multidimensional Data Spatial data
Geographic Information: Melbourne (37, 145) Which city is at (30, 140)?
Computer Aided Design: width and height (40, 50) Any part that has a width of 40 and height of 50?
Records with multiple attributes Employee (ID, age, score, salary, …) Is there any employee whose
age is under 25 and performance score is greater than 80 andsalary is between 3000 and 5000
Multimedia data Color histograms of images Give me the most similar
image to
Multimedia Features: color, shape, texture
ID Age Score Salary …
…
(high-dimensionality)
(medium-dimensionality)
(low-dimensionality)
![Page 4: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/4.jpg)
Multidimensional Queries Point query
Return the objects located at Q(x1, x2, …, xd).
E.g. Q=(3.4, 6.6).
Window query
Return all the objects enclosed or intersected by the hyper-rectangle W{[L1, U1], [L2, U2], …, [Ld, Ud]}.
E.g. W={[0,4],[2,5]}
K-Nearest Neighbor Query (KNN Query)
Return k objects whose distances to Q are no larger than any other object’ distance to Q.
E.g. 3NN of Q=(4,1)
![Page 5: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/5.jpg)
Mapping Based Multidimensional Indexing
Story The CBD: [0,4][2,5] Blocks in the CBD are: [8,15], [32,33] and [36,37]
General strategy: three steps Data mapping and indexing Query mapping and data retrieval Filtering out false positive
Name x y Block Height
A 0.7 1.2 2 100
B 5.8 1.2 19 50
C 2.7 2.3 12 80
D 5.5 2.4 25 90
E 6.6 2.5 28 40
F 1.7 3.8 11 120
G 2.8 4.7 36 100
H 0.6 5.8 34 50
I 1.6 6.7 41 60
J 3.4 6.6 45 40
Name x y Block Height
A 0.7 1.2 2 100
F 1.7 3.8 11 120
C 2.7 2.3 12 80
B 5.8 1.2 19 50
D 5.5 2.4 25 90
E 6.6 2.5 28 40
H 0.6 5.8 34 50
G 2.8 4.7 36 100
I 1.6 6.7 41 60
J 3.4 6.6 45 40
Sort
![Page 6: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/6.jpg)
The Z-curve and Other Space-Filling Curves The Z-curve
Z-value calculation: bit-interleaving
Support efficient window queries Disadvantage
Jumps
Other space-filling curves Hilbert-curves Gray-code Column-wise scan
![Page 7: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/7.jpg)
3
2
1
Mapping for KNN Queries
Story continued New factory at Q[4,1] Find 3 nearest buildings to Q
Termination condition K candidates All in the current search circle
Name x y Street Height
A 0.7 1.2 14 100
B 5.8 1.2 32 50
C 2.7 2.3 12 80
D 5.5 2.4 31 90
E 6.6 2.5 32 40
F 1.7 3.8 13 120
G 2.8 4.7 24 100
H 0.6 5.8 23 50
I 1.6 6.7 22 60
J 3.4 6.6 24 40
Sort
11121314
21222324
3132
Name x y Street Height
C 2.7 2.3 12 80
F 1.7 3.8 13 120
A 0.7 1.2 14 100
I 1.6 6.7 22 60
H 0.6 5.8 23 50
G 2.8 4.7 24 100
J 3.4 6.6 24 40
D 5.5 2.4 31 90
B 5.8 1.2 32 50
E 6.6 2.5 32 40
Rank 1 2 3
Candidate A
Distance to Q 3.31
Q
Rank 1 2 3
Candidate B A F
Distance to Q 1.81 3.31 3.62
Rank 1 2 3
Candidate B E A
Distance to Q 1.81 3.00 3.31
Rank 1 2 3
Candidate A F
Distance to Q 3.31 3.62
Rank 1 2 3
Candidate B C E
Distance to Q 1.81 1.84 3.00
Rank 1 2 3
Candidate B C D
Distance to Q 1.81 1.84 2.05
||AQ|| = 3.31||FQ|| = 3.62||BQ|| = 1.81||EQ|| = 3.00||CQ|| = 1.84||DQ|| = 2.05
1234
R = 0.35R = 0.70R = 1.05R = 1.40R = 1.75R = 2.10
![Page 8: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/8.jpg)
The iDistance
Data partitioned into a number of clusters Streets are concentric circles
Data mapping Objects mapped to street numbers
Query mapping Search circle mapped to streets intersected
![Page 9: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/9.jpg)
Hierarchical Tree Structures R-tree
Minimum bounding rectangle (MBR) Incomplete and overlapping
partitioning Disk-based; Balanced
AD
C
EB
F
G
AD
C
EB
F
G
AD
CE
B
GF
AD
CE
B
GF
K-d-tree Space division recursively Complete and disjoint partitioning In-memory; Unbalanced There are algorithms to page
and balance the tree, but withmore complex manipulations
AN1
N2
N1 B C D
N1
A C D
N1
B E
N2
N1 N2
F G
N1
N3N3
A B C D
N1
0.5
N3
N1 N2
A D
N1
B C E
N2
N3
F
B C E
N2
F G
N4
N4
N5
0.3
N5
Problem: Overlap Problem: Empty space
![Page 10: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/10.jpg)
Hierarchical Tree Structures (continued) Quad-tree
Space divided into 4 rectanglesrecursively.
Complete and disjoint partitioning In-memory; Unbalanced There are algorithms to page
and balance the tree, but withmore complex manipulations
The point quad-tree
A
D
C
E
B
F
G
A
NW NE
SW
B
NW
SW SE
NE
CD
E FGSE
![Page 11: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/11.jpg)
Compression Based Indexing
The dimensionality curse
The Vector Approximation File (VA-File)
VA File Skewed data
![Page 12: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/12.jpg)
Summary of the Indexing TechniquesIndex Disk-based /
In-memoryBalanced Efficient qu
ery typeDimensionality
Comments
R-tree Disk-based Yes Point, window, kNN
Low Disadvantage is overlap
K-d-tree In-memory No Point, window, kNN(?)
Low Inefficient for skewed data
Quad-tree In-memory No Point, window, kNN(?)
Low Inefficient for skewed data
Z-curve + B+-tree
Disk-based Yes Point, window
Low Order of the Z-curve affects performance
iDistance Disk-based Yes Point, kNN High Not good for uniform data in
very high-D
VA-File Disk-based Point, window, kNN
High Not good for skewed data
![Page 13: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/13.jpg)
Index Implementations in major DBMS
SQL Server B+-Tree data structure Clustered indexes are sparse Indexes maintained as updates/insertions/deletes are
performed Oracle
B+-tree, hash, bitmap, spatial extender for R-Tree Clustered index Index organized table (unique/clustered) Clusters used when creating tables
DB2 B+-Tree data structure, spatial extender for R-tree Clustered indexes are dense Explicit command for index reorganization
![Page 14: Indexing Multidimensional Data](https://reader030.fdocuments.net/reader030/viewer/2022033104/56812cf0550346895d91bc79/html5/thumbnails/14.jpg)
Recommended Readings and References Survey on multidimensional indexing techniques
Christian Böhm, Stefan Berchtold, Daniel A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys 2001.
Volker Gaede, Oliver Günther. Multidimensional Access Methods. ACM Computing Surveys 1998
Mapping based indexing Rui Zhang, Panos Kalnis, Beng Chin Ooi, Kian-Lee Tan. Generalized Multi-dimensional Data Map
ping and Query Processing. ACM Transactions on Data Base Systems (TODS), 30(3), 2005.
Space-filling curves H. V. Jagadish. Linear Clustering of Objects with Multiple Atributes . ACM SIGMOD Conference
(SIGMOD) 1990.
iDistance H.V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, Rui Zhang. iDistance: An Adaptive B+-tree B
ased Indexing Method for Nearest Neighbor Search. ACM Transactions on Data Base Systems (TODS), 30(2), 2005.
R-tree Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching . ACM SIGMOD Co
nference (SIGMOD) 1984.
Quad-tree Hanan Samet. The Quadtree and Related Hierarchical Data Structures . ACM Computing Survey
s 1984.
VA-File Roger Weber, Hans-Jörg Schek, Stephen Blott. A Quantitative Analysis and Performance Study f
or Similarity-Search Methods in High-Dimensional Spaces. International Conference on Very Large Data Bases (VLDB) 1998.