Spatial data requires special data structures, similar to B-trees.

50

Transcript of Spatial data requires special data structures, similar to B-trees.

Page 1: Spatial data requires special data structures, similar to B-trees.
Page 2: Spatial data requires special data structures, similar to B-trees.

Spatial data requires special data structures, similar to B-trees

Page 3: Spatial data requires special data structures, similar to B-trees.

Types of spatial data Example of spatial and geometric data –

splines and Voronoi diagrams One-dimensional index – interval trees R-trees T-tree variants

Page 4: Spatial data requires special data structures, similar to B-trees.

Point Data◦ Points in a multidimensional space◦ E.g., Raster data such as satellite imagery,

where each pixel stores a measured value◦ E.g., Feature vectors extracted from text

Region Data◦ Objects have spatial extent with location and

boundary◦ DB typically uses geometric approximations

constructed using line segments, polygons, etc., called vector data.

Page 5: Spatial data requires special data structures, similar to B-trees.

Different types of sampling are used to collect data

Page 6: Spatial data requires special data structures, similar to B-trees.
Page 7: Spatial data requires special data structures, similar to B-trees.
Page 8: Spatial data requires special data structures, similar to B-trees.
Page 9: Spatial data requires special data structures, similar to B-trees.
Page 10: Spatial data requires special data structures, similar to B-trees.

Low-pass filter – the value for the cell is computed as average of other cells

High-pass-continuous surface –low pass

Page 11: Spatial data requires special data structures, similar to B-trees.

Window size has effect on filtering

Page 12: Spatial data requires special data structures, similar to B-trees.
Page 13: Spatial data requires special data structures, similar to B-trees.

Spatial Range Queries◦ Find all cities within 50 miles of Madison◦ Query has associated region (location, boundary)◦ Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries◦ Find the 10 cities nearest to Madison◦ Results must be ordered by proximity

Spatial Join Queries◦ Find all cities near a lake◦ Expensive, join condition involves regions and proximity

Page 14: Spatial data requires special data structures, similar to B-trees.

Geographic Information Systems (GIS)◦ E.g., ESRI’s ArcInfo; OpenGIS Consortium◦ Geospatial information◦ All classes of spatial queries and data are common

Computer-Aided Design/Manufacturing◦ Store spatial objects such as surface of airplane fuselage◦ Range queries and spatial join queries are common

Multimedia Databases◦ Images, video, text, etc. stored and retrieved by content◦ First converted to feature vector form; high

dimensionality◦ Nearest-neighbor queries are the most common

Page 15: Spatial data requires special data structures, similar to B-trees.

B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g.,

an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal.

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

11 12 13

70605040302010

80

B+ treeorder

Page 16: Spatial data requires special data structures, similar to B-trees.

A multidimensional index clusters entries so as to exploit “nearness” in multidimensional space.

Keeping track of entries and maintaining a balanced index structure presents a challenge!

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

Spatialclusters

70605040302010

80

B+ treeorder

11 12 13

Page 17: Spatial data requires special data structures, similar to B-trees.

Spatial queries (GIS, CAD).◦ Find all hotels within a radius of 5 miles from the

conference venue.◦ Find the city with population 500,000 or more that is

nearest to Kalamazoo, MI.◦ Find all cities that lie on the Nile in Egypt.◦ Find all parts that touch the fuselage (in a plane

design). Multidimensional range queries.

◦ 50 < age < 55 AND 80K < sal < 90K

Page 18: Spatial data requires special data structures, similar to B-trees.

Similarity queries (content-based retrieval).◦ Given a face, find the five

most similar faces/expressions.

Page 19: Spatial data requires special data structures, similar to B-trees.

Geometric, 1-dimensional tree Interval is defined by (x1,x2) Split at the middle (5), again at the middle

(3,7), again at the middle (2,8) All intervals intersecting a middle point are

stored at the corresponding root.

1 2 3 4 5 6 7 8 9

(4,6) (4,8)

(6,9)

(7.5,8.5)

(2,4)

Page 20: Spatial data requires special data structures, similar to B-trees.

Finding intervals – by finding x1, x2 against the nodes

Find interval containing specific value – from the root

Sort intervals within each node of the tree according to their coorsinates

Cost of the “stabbing query”– finding all intervals containing the specified value is O(log n + k), where k is the number of reported intervals.

Page 21: Spatial data requires special data structures, similar to B-trees.

Constructs the minimal bounding box (mbb)

Check validity (predicate) on mbb Refinement step verifies if actual

objects satisfy the predicate.

Page 22: Spatial data requires special data structures, similar to B-trees.

An index based on spatial location needed.◦ One-dimensional indexes don’t support

multidimensional searching efficiently. ◦ Hash indexes only support point queries; want

to support range queries as well.◦ Must support inserts and deletes gracefully.

Ideally, want to support non-point data as well (e.g., lines, shapes).

The R-tree meets these requirements, and variants are widely used today.

Page 23: Spatial data requires special data structures, similar to B-trees.

The R-tree is a tree-structured index that remains balanced on inserts and deletes.

Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension.

Example in 2-D:

X

Y

Root ofR Tree

Leaf level

Page 24: Spatial data requires special data structures, similar to B-trees.

Leaf entry = < n-dimensional box, rid >◦ This is Alternative (2), with key value being a box.◦ Box is the tightest bounding box for a data object.

Non-leaf entry = < n-dim box, ptr to child node >◦ Box covers all boxes in child node (in fact, subtree).

All leaves at same distance from root. Nodes can be kept 50% full (except root).

◦ Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.

Page 25: Spatial data requires special data structures, similar to B-trees.

R8R9

R10

R11

R12

R17

R18

R19

R13

R14

R15

R16

R1

R2

R3

R4

R5

R6

R7

Leaf entry

Index entry

Spatial objectapproximated by bounding box R8

Page 26: Spatial data requires special data structures, similar to B-trees.

R1 R2

R3 R4 R5 R6 R7

R8 R9 R10 R11R12 R13R14 R15R16 R17R18R19

Page 27: Spatial data requires special data structures, similar to B-trees.

Start at root.1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified by ptr.2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q.

Note: May have to search several subtrees at each node!(In contrast, a B-tree equality search goes to just one leaf.)

Page 28: Spatial data requires special data structures, similar to B-trees.

It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly.

But why not use convex polygons to approximate query regions more accurately?◦ Will reduce overlap with nodes in tree, and reduce

the number of nodes fetched by avoiding some branches altogether.

◦ Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.

Page 29: Spatial data requires special data structures, similar to B-trees.

Start at root and go down to “best-fit” leaf L.◦ Go to child whose box needs least enlargement to

cover B; resolve ties by going to smallest area child.

If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2.◦ Adjust entry for L in its parent so that the box now

covers (only) L1.◦ Add an entry (in the parent node of L) for L2. (This

could cause the parent node to recursively split.)

Page 30: Spatial data requires special data structures, similar to B-trees.

The entries in node L plus the newly inserted entry must be distributed between L1 and L2.

Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.

Idea: Redistribute so as to minimize area of L1 plus area of L2.

Exhaustive algorithm is too slow; quadratic and linear heuristics are popular in research. GOOD SPLIT!

BAD!

Page 31: Spatial data requires special data structures, similar to B-trees.

The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a node overflows, instead of splitting:◦ Remove some (say, 30% of the) entries and reinsert them

into the tree. ◦ Could result in all reinserted entries fitting on some existing

pages, avoiding a split. R* trees also use a different heuristic, minimizing box

perimeters rather than box areas during insertion. Another variant, the R+ tree, avoids overlap by

inserting an object into multiple leaves if necessary.◦ Searches now take a single path to a leaf, at cost of

redundancy.

Page 32: Spatial data requires special data structures, similar to B-trees.

The Generalized Search Tree (GiST) abstracts the “tree” nature of a class of indexes including B+ trees and R-tree variants.◦ Striking similarities in insert/delete/search and even

concurrency control algorithms make it possible to provide “templates” for these algorithms that can be customized to obtain the many different tree index structures.

◦ B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs.

◦ GiST provides an alternative for implementing other tree indexes.

Page 33: Spatial data requires special data structures, similar to B-trees.

Typically, high-dimensional datasets are collections of points, not regions.◦ E.g., Feature vectors in multimedia applications.◦ Very sparse

Nearest neighbor queries are common.◦ R-tree becomes worse than sequential scan for most

datasets with more than a dozen dimensions. As dimensionality increases contrast (ratio of

distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful.

Page 34: Spatial data requires special data structures, similar to B-trees.

Spatial data management has many applications, including GIS, CAD/CAM, multimedia indexing.◦ Point and region data◦ Overlap/containment and nearest-neighbor

queries Many approaches to indexing spatial data

◦ R-tree approach is widely used in GIS systems◦ Other approaches include Grid Files, Quad trees,

and techniques based on “space-filling” curves.◦ For high-dimensional datasets, unless data has

good “contrast”, nearest-neighbor may not be well-separated

Page 35: Spatial data requires special data structures, similar to B-trees.

Deletion consists of searching for the entry to be deleted, removing it, and if the node becomes under-full, deleting the node and then re-inserting the remaining entries.

Overall, works quite well for 2 and 3 D datasets. Several variants (notably, R+ and R* trees) have been proposed; widely used.

Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection.

Page 36: Spatial data requires special data structures, similar to B-trees.

"Print Gallery," by M.C. Escher. Curious about the blank spot in the middle of Escher’s 1956 lithograph, Hendrik Lenstra set out to learn whether the artist had encountered a mathematical problem he couldn’t solve. ©2002 Cordon Art B.V., Baarn, Holland. All rights reserved.

Page 37: Spatial data requires special data structures, similar to B-trees.

Fixed grid: Stored as a 2D array,

each entry contains a link to a list of points (object) stored in a grid.

a,b

Page 38: Spatial data requires special data structures, similar to B-trees.

Too many points in one grid cell:Solution A –overflow (linked list)Solution B- Split the cell and increase index!

Page 39: Spatial data requires special data structures, similar to B-trees.

Rectangles may share different grid cells Rectangle duplicates are stored Grid cells are of fixed size

Page 40: Spatial data requires special data structures, similar to B-trees.

In a grid file, the index is dynamically increased in size when overflow happens.

The space is split by a vertical or a horizontal line, and then further subdivided when overflow happens!

Index is dynamically growing Boundaries of cells of different sizes are

stores, thus point and stabbing queries are easy

Page 41: Spatial data requires special data structures, similar to B-trees.

Instead of using an array as an index, use tree!

Quadtree decomposition – cells are indexed by using quaternary B-tree.

All cells are squares, not polygons. Search in a tree is faster!

Page 42: Spatial data requires special data structures, similar to B-trees.

First three levels of a quad tree

Page 43: Spatial data requires special data structures, similar to B-trees.

Project #32: PICTURE REPRESENTATION USING QUAD TREES, McGill University:

8 x 8 pixel picture

represented in a quad

tree

Page 44: Spatial data requires special data structures, similar to B-trees.

Example of a grid file

Page 45: Spatial data requires special data structures, similar to B-trees.

B+ index – actual references to rectangles are stored in the leaves, saving more space+ access time

Label nodes according to Z or “pi” order

Page 46: Spatial data requires special data structures, similar to B-trees.

Level of detail increases as the number of quadtree decompositions increases!

Decompositions have indexes of a form:00,01,02,03,10,11,12,13, 2,300301 ,302 ,303 ,31 ,32 ,33 ◦ Stores as Bplus tree

Page 47: Spatial data requires special data structures, similar to B-trees.
Page 48: Spatial data requires special data structures, similar to B-trees.
Page 49: Spatial data requires special data structures, similar to B-trees.

What is spatial data structure? What is the difference between grid and

grid file? Explain how z or p ordering works? Define interval trees Provide example of R-tree List R-tree variants How spatial index structure differs from

regular B+ tree?

Page 50: Spatial data requires special data structures, similar to B-trees.

Text 1 instructor’s resources McGill University web space Wikepedia (z order images) Face recognition research SPARCS lab project on image processing