Nearest Neighbor Search in Spatial and Spatiotemporal Databases

1

Nearest Neighbor Search in Spatial and Spatiotemporal Databases

Dimitris Papadias

Hong Kong University of Science and Technology

2

Spatial and spatiotemporal databases

• Spatial databases manage large collection of multi-dimensional objects.

• Important query types– Window query: Retrieve all rivers in CA

– Nearest neighbor: Find my nearest gas station

– Spatial join: Report pairs of (city C, river R) such that R crosses C

• Spatiotemporal databases deal with the same queries assuming, however, moving objects – Mobile computing

– Traffic supervision

– Flight control

– Weather forecasting

3

R-trees [Guttman SIGMOD 84. Sellis et al VLDB 87, Beckman et al SIGMOD 00]

20 4 6 8 10

2

4

6

8

10

x axis

y axis

b

c

aE3

a b c d e

E1 E2

E3 E4 E5

Root

E1 E2

E3E4

f g h

E5

d

e f

g h

i j

k

l

m

l m

E7

i j k

E6

E6 E7

Minimum Bounding Rectangle (MBR)

Node capacity depends on the page size

In practice in the order of 100window query

4

TPR-trees [Saltenis et al., SIGMOD 00, our group VLDB 03]

• Extends the R-tree by introducing the velocity bounding rectangle (VBR) in non-leaf entries.

• Objects are grouped together based on both their location and velocities.

-2

-2

c

2-2

d

1

-1

a

11 1

1

b

-11

-2

-2-2

2

-1

N1

N2

20 4 6 8 10

2

4

6

8

10

x axis

y axis

1

-2-2

1

at time 0

c

d

a

b

N1

N2

20 4 6 8 10

2

4

6

8

10

x axis

y axis

qR

at time 1

5

Conventional NN search with R-(TPR-) trees

• Depth-first [Roussopoulos et al., SIGMOD 95]

• Best-first traversal Hjaltason and Samet TODS 99], incremental and optimal

E

20 4 6 8 10

2

4

6

8

10

x axis

y axis

b

E f

query point

omitted

1 E2e

d

c

a

h

g

E3

E5

E6

E4

E7

8

searchregion

contents

E9i

E 11 E 22Visit Root

E 137

follow E1 E 22E 54 E 55 E 83 E 96

E 83

Action Heap

follow E2 E 28E 54 E 55 E 83 E 96

follow E8

Report h and terminate

E 179

E 137E 54E 55

E 83 E 96E 179

Result

{empty}

{empty}

{empty}

{(h, 2 )}

a5

b13

c

18

d

13

e

13

f

10

h

2

g

13

E11

E22

E38

E45

E55

E69

E713

E82

Rootnote: some nodes have been

E917

i

10

E1E2

E4 E5E8

omitted for simplicity

6

NN search - other approaches

• Several algorithms and theoretical performance bounds have been devised for exact and approximate processing in main memory. Here we care about I/O efficiency (minimization of node and page accesses) as well as cost models about the practical performance (suitable for query optimization).

• Several approaches for NN in high-dimensional spaces (but the problem is different due to the dimensionality curse). Here we consider low dimensional spaces (spatial and spatiotemporal databases).

• Ferhatosmanoglu et al [SSTD 01] discover the NN in a constrained area of the data space (e.g., find the NN to the south of the query point).

• Korn and Muthukrishnan [SIGMOD 00 ] discuss reverse nearest neighbor queries, where the goal is to retrieve the data points whose nearest neighbor is a specified query point.

• Korn et al. [VLDB 02] study the same problem in the context of data streams, where the data are not known in advance.

7

NN search for mobile queries

• [Zheng and Lee, SSTD 01]: return the current NN and the validity time of the result.

• Restrictions: (i) assumes a maximum speed (ii) applicable only to single NN (iii) requires voronoi diagrams.

o

q

a

Voronoi cell ofpoint o

• [Song and Roussopoulos, SSTD 01]: minimize the number of queries for moving clients by returning m>k NNs.

• Problem: how to determine m.

o

q

b

a

q'

c

dist(2)dist(4)

IF 2dist(q,q') dist(q,b)-dist(q,a), THEN the 2 NN at q' be among the 4 NN of the first query.

8

Time parameterized NN (our group, SIGMOD 02)

• Assuming a constant and known velocity, a TPNN returns:– The current query result R

– The validity period T of R

– The change C of the result at the end of T

20 4 6 8 10

2

4

6

8

10

x axis

y axis

b

c

a

d

e f

g h

i jk

l

m

query (speed 1)

its position at time 2

R={i}, T=2, C={j}

Result:

9

TP NN queries: Influence Time

20 4 6 8 10

2

4

6

8

10

x axis

y axis

b

c

a

d

e f

g h

i j (2)k (3)

l (4.5)

m (10)

query (speed 1)

its position at time 2

• Some objects have “infinite” influence time.

• The object that will become the next nearest neighbor is the one with the minimum influence time.

10

Processing TP NN with R- (TPR-) trees

• Influence time of a MBR: the earliest possible time that any object in the MBR will become the new NN.

20 4 6 8 10

2

4

6

8

10

x axis

y axis

PNN

position at time 2

E

q (speed 1)

at this time q has equaldistance to P NN

and E

• Algorithm: traverse the R-tree using depth-first or best-first traversal using the influence time instead of the mindist .

• Cost of TPNN queries about the same as that of conventional queries because we have to visit the influencing nodes anyway (to find the NN).

11

Continuous Nearest Neighbors (CNN) (our group, VLDB 02)

Given a line segment q=[s,e], find the NN of every point on q.Result representation: {s(.NN=a), s1(.NN=c), s2(.NN=f), s3(.NN=h), e}.

The points (s, s1, s2, s3, e) are the split points.

s

b d

f

e

a c

hg

s1 s2 s3

12

Main idea

s1e

a

s

bc

d

e

a

s

bc

dperpendicular

bisector

After processing a After processing c

• Maintain the set of split points incrementally.

13

Processing TP NN with an R- (TPR-) tree

• Avoid examination of all points.

• Given an MBR E and query segment q, E must be searched if and only if there exists a split point siSL such that dist(si,si.NN) > mindist(si, E).

SL={s (.NN=a), s1 (.NN=b), e (.NN=b)}

e

E

s1

bamindist(E,s)=

smindist(E,q)

mindist(E,e)

mindist(E,s1)

q

14

Location Based NN queries (LBNN) (our group, SIGMOD 03)

• A location-based kNN query q returns– The current k NNs – A validity region such that the result remains the same as long as q

remains in the region.– The validity region of q is the Voronoi Cell (VC) of the NN o.

15

Computing the Voronoi Cell on-the-fly

• Step 1 – Find the current NN

• Step 2 – Use time TP NN queries to tighten the validity region

v3

oq

v1 v2

v4

b

c

a

v3

oq

v1v2

v4

la c

a

v5

v6

v3

oq

v1v2

v4v6

b

c

a

v5

v7

v9

v8

v10 v3

o

q

v1 v2

v4

v8

v6

v7

v9

v10

b

c

a

v5

16

NN queries in road networks (our group, VLDB 03)

• Find my nearest gas station in terms of driving distance.• Answer: Hotel b (the Euclidean NN is d)

qd

bc

a

15km

12km

10km

Assumptions: • We can incrementally compute Euclidean NN using conventional NN algorithms.• We can compute the network distance between the query and any point (i.e., the length of the shortest

path connecting them) using Dijkstra's algorithm.

17

Euclidean Restriction Algorithm

q

pE1

dE(q,pE1)

dN(q,pE1)

dEmax=dN(q,pE1)q

pE1

dE(q,pE2)

dN(q,pE1)

dEmax=dN(q,pE2)

pE2

dN(q,pE2)

pE3

1st Euclidean NN1st Euclidean NN 2nd Euclidean NN2nd Euclidean NN

18

Network Expansion Algorithm

3

5

24

2

61

94 6

4

n1

n2n3

n4

n5

n6

n7n8 n9

p1p2

p3

p4

p5

q

19

NN in the presence of obstacles (not published)

• The NN of q in terms of obstructed distance is b, although the Euclidean NN is a.

a

q

b

Eucledian NN

Obstructed NN

20

Visibility graphs

• Problem: We cannot maintain the entire visibility graph in memory for real spatial datasets.

• Solution: We only need the obstacles and objects that affect the result of the query.

pstart pend

• Have been used widely in Computational Geometry for shortest path problems (e.g., find the shortest path from pstart to pend that does not cross any obstacle).

21

Obstacle nearest neighbor algorithm

a

q

f

0 2 4 6 8 10 12

b

g

c

d

i

x

y

jk

e

h

0

2

4

6

8

10

12

o1

o2

o3

o4

dO(f,q)

dO(a,q)

• Idea: Similar to the Euclidean Restriction algorithm for road networks.

• BUT how do we perform the obstructed distance computations?

22

Obstructed distance computation

p

q

o1

d1(p,q)d2(p,q)

o4

o3o5

o2

o1

• Goal: compute the obstructed distance between p and q.

• First retrieve obstacles o1, o2 in the Euclidean range.

• Compute a provisional distance d1(p,q) using only o1, o2.

• d1(p,q) is not enough because the shortest path is obstructed by o3.

• Perform a second Euclidean range query on the obstacle R-tree using d1(p,q) and retrieve o3, o4.

• Compute a new obstructed distance d2(p,q) taking o3, o4 into account.

• Repeat the process until the obstructed distance remains the same for two consecutive iterations.

23

Other related work

By our group: Similar concepts to the ones presented here, apply to several other spatial queries, i.e., TP spatial joins, Continuous window queries.

• Cost Models for TP and continuous queries [TODS 03].• Analysis of predictive NN (and other) queries [TODS to appear]. • An Efficient Cost Model for Optimization of Nearest Neighbor Search in

Low and Medium Dimensional Spaces [TKDE to appear].

By other groups: increasing interest for novel types of NN search in the context of mobile computing and data streams applications

• Iwerks et al [VLDB03] discuss continuous NN in the presence of object updates.

• Shekhar et al [ACM GIS 03] discuss the in-route nearest neighbor query, which, given a trajectory, retrieves the single NN (e.g., gas station) that results in the minimum diversion from the trajectory.

• Jensen et al [ACM GIS 03] discuss NN for objects moving on road networks.

24

Group NN queries (our group, ICDE 04)

• Input: a set P={p1,…,pN} of static data points in multidimensional space and a group of query points Q={q1,…,qn}.

• Output: the k (1) data point(s) with the smallest sum of distances to all points in Q. The distance between a data point p and Q is defined as dist(p,Q)=i=1~n|pqi|, where |pqi| is the Euclidean distance between p and query point qi.

• Example: three users at locations q1, q2 and q3 want to find a meeting point (e.g., a restaurant); the corresponding query returns the data point p that minimizes the sum of Euclidean distances |pqi| for 1i3

• Assumption: the data points are indexed by an R-trees. Q may or may not fit in main memory.

25

Multiple Query Method (MQM)

• Idea: Perform incremental NN queries for each point in Q and combine their results.

N1N3

N4

N5

N2

N6

p1

p2

p3p4

p5

p6

p7

p8

p9

p10

p11

p12

q1

2 33

51st NN of q1

1st NN of q22nd NN of q1

q2

• Problem: MQM may visit the same node and discover the same data point many times (for different query points).

• <p10, 7>, <p11, 6>, T=5 (2+3)

• <p11, 7>

T=6 (3+3)

MQM terminates

26

Minimum Bounding Method (MBM)

• Applies the MBR of Q to prune the search space.

• Heuristic 1: Let M be the MBR of Q, and best_dist be the distance of the best GNN found so far. A node N cannot contain qualifying points, if:

( , )best_dist

mindist N Mn

• Heuristic 2: A node N cannot contain qualifying points, if:

( , )i

iq Q

mindist N q best_dist

M

q1

mindist(N1,M)=3q2

N2

best_NN

mindist(N2,M)=mindist(N2,q2)=1

mindist(N2,q1)=5

N1

23

best_distance=5

27

File Multiple Query Method (F-MQM)

What happens if Q does not fit in memory.

• F-MQM sorts query points according to their Hilbert value and splits Q into blocks {Q1, .., Qm} that fit in memory.

• For each block, it computes the GNN using one of the main memory algorithms

• It finally combines their results using MQM.

Complication: once a NN of a group has been retrieved, we cannot compute its global distance (i.e., with respect to all data points) immediately.

28

F-MQM (cont)

Solution: lazy evaluation:

• First we find the GNN p1 of the first group Q1

• Then, we load in memory the second group Q2 and retrieve its NN p2. At the same time, we also compute the distance between p1 and Q2.

• Similarly, when we load Q3, we update the current distances of p1 and p2 taking into account the objects of the third group.

• After the end of the first round, we only have one data point (p1), whose global distance with respect to all query points has been computed.

29

File Minimum Bounding Method (F-MBM)• First, the points of Q are sorted by their Hilbert value and are assigned to

groups (that fit in memory) according to this order.

• For each group Qi, F-MBM keeps in memory its MBR Mi and cardinality ni (but not its contents).

• F-MBM descends the R-tree of P (in depth-first or best-first traversal), only following nodes that may contain qualifying points.

( , )i

i iQ Q

n mindist N M best_dist

best_NN

M2

M1

1

1

5

8 5

4

4N

best_dist=20

Heuristic: Let best_dist be the distance of the best GNN found so far. A node N can be safely pruned if:

Nearest Neighbor Search in Spatial and Spatiotemporal Databases

Documents

Transcript of Nearest Neighbor Search in Spatial and Spatiotemporal Databases