1 What we have covered? uIndexing and Hashing uData warehouse and OLAP uData Mining uInformation...

56
1 What we have covered? Indexing and Hashing Data warehouse and OLAP Data Mining Information Retrieval and Web Mining XML and XQuery Spatial Databases Transaction Management
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 1 What we have covered? uIndexing and Hashing uData warehouse and OLAP uData Mining uInformation...

1

What we have covered?

Indexing and Hashing Data warehouse and OLAP Data Mining Information Retrieval and Web Mining XML and XQuery Spatial Databases Transaction Management

2

Lecture 6: Spatial Data Management

3

Types of Spatial Data Point Data

Points in a multidimensional space E.g., Raster data such as satellite imagery,

where each pixel stores a measured value E.g., Feature vectors extracted from text

Region Data Objects have spatial extent with location

and boundary DB typically uses geometric approximations

constructed using line segments, polygons, etc., called vector data.

4

Applications of Spatial Data

Geographic Information Systems (GIS) E.g., ESRI’s ArcInfo; OpenGIS Consortium Geospatial information All classes of spatial queries and data are common

Computer-Aided Design/Manufacturing Store spatial objects such as surface of airplane fuselage Range queries and spatial join queries are common

Multimedia Databases Images, video, text, etc. stored and retrieved by content First converted to feature vector form; high

dimensionality Nearest-neighbor queries are the most common

5

Types of Spatial Queries Spatial Range Queries

Find all cities within 50 miles of MadisonQuery has associated region (location, boundary)Answer includes overlapping or contained data

regions

Nearest-Neighbor QueriesFind the 10 cities nearest to MadisonResults must be ordered by proximity

Spatial Join QueriesFind all cities near a lakeExpensive, join condition involves regions and

proximity

6

Spatial Indexing

Point Access Methods (PAMs) vs Spatial Access Methods (SAMs)

PAM: index only point data Hierarchical (tree-based) structures Multidimensional Hashing Space filling curve

SAM: index both points and regions Transformations Overlapping regions Clipping methods (non-overlapping)

Data partitioning vs Space partitioning

Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g., an

index on <age, sal>, we effectively linearize the 2-dimensional space since we

sort entries first by age and then by sal.

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

11 12 13

70605040302010

80

B+ treeorder

AGE

SAL

Multidimensional Indexes A multidimensional index clusters entries so as

to exploit “nearness” in multidimensional space. Keeping track of entries and maintaining a

balanced index structure presents a challenge!

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

Spatialclusters

70605040302010

80

B+ treeorder

11 12 13

Motivation for Multidimensional Indexes

Spatial queries (GIS, CAD).Find all hotels within a radius of 5 miles from the

conference venue.Find the city with population 500,000 or more that is

nearest to Kalamazoo, MI.Find all cities that lie on the Nile in Egypt.Find all parts that touch the fuselage (in a plane

design).

Similarity queries (content-based retrieval).Given a face, find the five most similar faces.

Multidimensional range queries.50 < age < 55 AND 80K < sal < 90K

What’s the difficulty?

An index based on spatial location needed.One-dimensional indexes don’t support

multidimensional searching efficiently. (Why?)

Hash indexes only support point queries; want to support range queries as well.

Must support inserts and deletes gracefully.

Ideally, want to support non-point data as well (e.g., lines, shapes).

11

PAMs

Point Access Methods

Hierarchical methods: kd-tree based

Space Filling Curves: Z-ordering

Multidimensional Hashing: Grid FileExponential growth of the directory

12

The problem

Given a point set and a rectangular query, find the points enclosed in the query

We allow insertions/deletions on line

Query

13

Tree-based PAMs

Most of tb-PAMs are based on kd-tree kd-tree is a main memory binary tree

for indexing k-dimensional pointsNeeds to be adapted for the disk model

Levels rotate among the dimensions, partitioning the space based on a value for that dimension

kd-tree is not necessarily balanced

14

kd-tree

At each level we use a different dimension

x=5

y=3 y=6

x=6A

BC

DE

x<5 x>=5

15

Kd-tree properties

Height of the tree O(log2 n)Search time for exact match:

O(log2 n)Search time for range query: O(n1/2

+ k)

16

kd-tree example

X=5

y=5 y=6

x=3

y=2

x=8 x=7

X=5 X=8

X=7X=3

Y=6

Y=2

17

External memory kd-trees

Similar to B-tree, tree nodes split many ways instead of two ways

insertion becomes quite complex and expensive.No storage utilization guarantee since when a

higher level node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks.

Pack many interior nodes (forming a subtree) into a block.

it may not be feasible to group nodes at lower level into a block productively.

Many interesting papers on how to optimally pack nodes into blocks recently published.

18

PAMs

Point Access Methods

Hierarchical methods: kd-tree based

Space Filling Curves: Z-ordering

Multidimensional Hashing: Grid FileExponential growth of the directory

Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g., an

index on <age, sal>, we effectively linearize the 2-dimensional space since we

sort entries first by age and then by sal.

Consider entries:<11, 80>, <12, 10><12, 20>, <13, 75>

11 12 13

70605040302010

80

B+ treeorder

AGE

SAL

20

Z-Curve

Fig 4.4

What is a Z-curve? A space filling curve Generated from interleaving bits

x, y coordinateSee Fig. 4.6

Alternative generation methodsee Fig. 4.5

Connecting points by z-ordersee Fig. 4.4looks like Ns or Zs

Implementing file operations

Fig 4.6

21

Example of Z-values

Fig 4.7

Figure 4.7 Left part shows a map with spatial object A, B, C Right part and Left bottom part Z-values within A, B and C Note C gets z-values of 2 and 8, which are not closeExercise: Compute z-values for B.

22

Hilbert Curve

Fig 4.5 A space filling curveExample: Fig. 4.5

More complex to generatedue to rotations

Illustration on next slide!

Implementing file operations

23

Calculating Hilbert Values (Optional Topic)

Fig 4.8

24

PAMs

Point Access Methods

Hierarchical methods: kd-tree based

Space Filling Curves: Z-ordering

Multidimensional Hashing: Grid FileExponential growth of the directory

25

Grid File

Hashing methods for multidimensional points (extension of Extensible hashing)

Idea: Use a grid to partition the space each cell is associated with one page

Two disk access principle (exact match)

26

Grid File Start with one bucket for the

whole space. Select dividers along each

dimension. Partition space into cells

Dividers cut all the way. Each cell corresponds to 1 disk

page. Many cells can point to the

same page. Cell directory potentially

exponential in the number of dimensions

27

Grid File Implementation

Dynamic structure using a grid directoryGrid array: a 2 dimensional array with pointers

to buckets (this array can be large, disk resident) G(0,…, nx-1, 0, …, ny-1)

Linear scales: Two 1 dimensional arrays that used to access the grid array (main memory) X(0, …, nx-1), Y(0, …, ny-1)

28

Example

Linear scale X

Linear scale

Y

Grid Directory

Buckets/Disk

Blocks

29

Grid File Search

Exact Match Search: at most 2 I/Os assuming linear scales fit in memory. First use liner scales to determine the index into the cell

directory access the cell directory to retrieve the bucket address

(may cause 1 I/O if cell directory does not fit in memory) access the appropriate bucket (1 I/O)

Range Queries: use linear scales to determine the index into the cell

directory. Access the cell directory to retrieve the bucket

addresses of buckets to visit. Access the buckets.

30

Grid File Insertions Determine the bucket into which insertion

must occur. If space in bucket, insert. Else, split bucket

how to choose a good dimension to split? If bucket split causes a cell directory to split do

so and adjust linear scales. insertion of these new entries potentially

requires a complete reorganization of the cell directory--- expensive!!!

31

Grid File Deletions

Deletions may decrease the space utilization. Merge buckets

We need to decide which cells to merge and a merging threshold

Buddy system and neighbor systemA bucket can merge with only one buddy in

each dimensionMerge adjacent regions if the result is a

rectangle

32

A

A A

(N=6)

1

2

34

5

6

1 2 3 4 5 6

Grid File Example

33

1 2 3 4 5 6A

A AA B A B

7

8 9

10 11

12

1 3 5 7 A

2 4 6 B

8

9

10

11 12

(N=6)

1

2

34

5

6

Grid File Example

34

A B A BA B

C

A B

C B

1 3 5 7 8 10A

2 4 6 9 11 12B

(N=6)

7

8 9

10 11

12

1

2

34

5

6

13

14

15

1 7 8 13 A

2 4 6 9 11 12B

3 5 10 C

14 15

Grid File Example

35

A B

C

A B

C B

A D B

C

A D

C C

B

B

(N=6)

7

8 9

10 11

12

1

2

34

5

6

13

14

15

1 3 5 7 8 10A

2 4 6 9 11 12B

1 7 8 13 A

2 4 6 9 11 12B

3 5 10 C

14 15

16

1 2 3 4 5 6A 1 3 5 7 A

2 4 6 B

1 7 8 13 A

2 4 6 9 11 12B

3 5 10 C

1 8 13 16 A

2 4 6 9 11 12B

3 5 10 C

7 14 15 D

Grid File Example

36

(N=6)

x1 x2 x3 x4

y4

y2

y1

A B

C

D

E

F

G

H

Iy3

A H

A I

D

D

F

F

B

B

A I G F B

E E G F B

C C C C B

Grid File Example

The R-Tree

The R-tree is a tree-structured index that remains balanced on inserts and deletes.

Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension.

Example in 2-D:

X

Y

Root ofR Tree

Leaf level

R-Tree Properties Leaf entry = < n-dimensional box, rid >

key value being a box.Box is the tightest bounding box for a data object.

Non-leaf entry = < n-dim box, ptr to child node >Box covers all boxes in child node (in fact,

subtree).

All leaves at same distance from root. Nodes can be kept 50% full (except root).

Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.

Example of an R-Tree

R8R9

R10

R11

R12

R17

R18

R19

R13

R14

R15

R16

R1

R2

R3

R4

R5

R6

R7

Leaf entry

Index entry

Spatial objectapproximated by bounding box R8

Example R-Tree (Contd.)

R1 R2

R3 R4 R5 R6 R7

R8 R9 R10 R11R12 R13R14 R15R16 R17R18R19

Search for Objects Overlapping Box Q

Start at root.1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified by ptr.2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q.

Note: May have to search several subtrees at each node!(In contrast, a B-tree equality search goes to just one leaf.)

42

Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly.

But why not use convex polygons to approximate query regions more accurately?Will reduce overlap with nodes in tree, and reduce

the number of nodes fetched by avoiding some branches altogether.

Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.

Insert Entry <B, ptr>

Start at root and go down to “best-fit” leaf L.Go to child whose box needs least enlargement to

cover B; resolve ties by going to smallest area child.

If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2.Adjust entry for L in its parent so that the box now

covers (only) L1.Add an entry (in the parent node of L) for L2. (This

could cause the parent node to recursively split.)

Splitting a Node During Insertion

The entries in node L plus the newly inserted entry must be distributed between L1 and L2.

Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.

Idea: Redistribute so as to minimize area of L1 plus area of L2.

GOOD SPLIT!

BAD!

45

Spatial Data Warehousing

Spatial data warehouse: Integrated, subject-oriented, time-variant, and nonvolatile spatial data repository for data analysis and decision making

Spatial data integration: a big issue Structure-specific formats (raster- vs. vector-based, OO

vs. relational models, different storage and indexing, etc.)

Vendor-specific formats (ESRI, MapInfo, Integraph, etc.)

Spatial data cube: multidimensional spatial database Both dimensions and measures may contain spatial

components

46

Dimensions and Measures in Spatial Data

Warehouse Dimension modeling

nonspatial e.g. temperature: 25-30

degrees generalizes to hot

spatial-to-nonspatial e.g. region “B.C.”

generalizes to description “western provinces”

spatial-to-spatial e.g. region “Burnaby”

generalizes to region “Lower Mainland”

Measures

numerical distributive (e.g. count,

sum)

algebraic (e.g. average)

holistic (e.g. median, rank)

spatial collection of spatial

pointers (e.g. pointers to all regions with 25-30 degrees in July)

47

Example: BC weather pattern analysis

Input A map with about 3,000 weather probes scattered in B.C. Daily data for temperature, precipitation, wind velocity, etc. Concept hierarchies for all attributes

Output A map that reveals patterns: merged (similar) regions

Goals Interactive analysis (drill-down, slice, dice, pivot, roll-up) Fast response time Minimizing storage space used

Challenge A merged region may contain hundreds of “primitive” regions

(polygons)

48

Star Schema of the BC Weather Warehouse

Spatial data warehouse Dimensions

region_nametimetemperatureprecipitation

Measurementsregion_mapareacount

Fact tableDimension table

49

Spatial Merge

Precomputing all: too much storage space

On-line merge: very expensive

50

Methods for Computation of Spatial

Data Cube On-line aggregation: collect and store pointers to

spatial objects in a spatial data cube expensive and slow, need efficient aggregation techniques

Precompute and store all the possible combinations huge space overhead

Precompute and store rough approximations in a spatial data cube accuracy trade-off

Selective computation: only materialize those which will be accessed frequently a reasonable choice

51

Spatial Association Analysis

Spatial association rule:A B [s%, c%] A and B are sets of spatial or nonspatial predicates

Topological relations: intersects, overlaps, disjoint, etc. Spatial orientations: left_of, west_of, under, etc. Distance information: close_to, within_distance, etc.

s% is the support and c% is the confidence of the rule

Examples is_a(x, large_town) ^ intersect(x, highway) adjacent_to(x, water) [7%, 85%] is_a(x, large_town) ^adjacent_to(x, georgia_strait) close_to(x,

u.s.a.) [1%, 78%]

52

Progressive Refinement Mining of Spatial Association Rules

Hierarchy of spatial relationship: g_close_to: near_by, touch, intersect, contain, etc. First search for rough relationship and then refine it

Two-step mining of spatial association: Step 1: Rough spatial computation (as a filter)

Using MBR or R-tree for rough estimation Step2: Detailed spatial algorithm (as refinement)

Apply only to those objects which have passed the rough spatial association test (no less than min_support)

53

Spatial classification Analyze spatial objects to derive classification schemes,

such as decision trees in relevance to certain spatial properties (district, highway, river, etc.)

Example: Classify regions in a province into rich vs. poor according to the average family income

Spatial trend analysis Detect changes and trends along a spatial dimension Study the trend of nonspatial or spatial data changing

with space Example: Observe the trend of changes of the climate or

vegetation with the increasing distance from an ocean

Spatial Classification and Spatial Trend Analysis

54

LSD-tree

Local Split Decision – treeUse kd-tree to partition the space.

Each partition contains up to B points. The kd-tree is stored in main-memory.

If the kd-tree (directory) is large, we store a sub-tree on disk

Goal: the structure must remain balanced: external balancing property

55

Example: LSD-tree

x1 x2 x3

y1

y3

y2

N1

N2 N6 N7

N8N5

N4N3

y4

N2 N3 N4 N5 N6 N7N1 N8

x:x1

y:y1 y:y2

x:x2 x:x3

y:y4y:y3

buckets

directory

(internal)

(external)

56

LSD-tree: main points

Split strategies:Data dependent Distribution dependent

Paging algorithmTwo types of splits: bucket splits

and internal node splits