DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian...

DB group seminar 1

Density Estimation for Spatial Data Streams

Celilia M. Procopiuc and Octavian Procopiuc

AT&T Shannon Labs

SSTD’05

Presented by: Huiping Cao

DB group seminar 2

Outline

Background Related work Problem definition Online algorithm Experiments References

DB group seminar 3

Background

Streaming data Large volume Continuous arrival

Data stream algorithms One pass Small amount of space Fast updating

DB group seminar 4

Background

Data stream model Operations of elements

Insertion Most common case Deletion Updating Most difficult case

Validation time of elements Whole history

Landmark window [Geh01], cash register [this paper] Partial recent history

Sliding window [Geh01], turnstile model [this paper]

DB group seminar 5

Related work

Classified according to operations Aggregation

avg, max, min, sum [Dob02] count (distinct values) [Gib01] Quantile in 1D data [Gre01] Frequent items (Heavy hitter) [Man02]

Query estimation Join size estimation [Dob02]

K Nearest Neighbor seaching eKNN [Kou04], RNN aggregation[Kor01]

Techniques Histogram [Tha02], sample [Joh05], special synopsis ... ...

DB group seminar 6

Problem definition

D: <p1, p2, …>, where pi Rd

Cash register model Qi = <i, Ri>,

Ri: a d-dimensional hyper-rectangle Selectivity of Qi: sel(Qi) = |{pi| j ≤ i, pi Ri} |

These points arrive before time step i They lie in Ri

Problem: Estimating sel(Qi) Measurement, relative error

}1),(max{

|)(_)(|

Qsel

QselestimatedQselErr

DB group seminar 7

Online algorithm: rough steps

Get random samples S of D Using reservoir sampling method [vit85]

Using kd-tree([kd75])-like structure to index these sample points Maintenance of the sample and the kd-tree-like

structure online Compute range selectivity: estimated_sel(Q)

using kernel density estimator

DB group seminar 8

Random sampling Theorem1:

Let T be the data stream seen so far, size: |T| let S T be a random sample chosen via the reservoir

sampling technique, such that |S| = ((d/2)log (1/)+log(1/)), where 0 < , <1 and |S| is the size of S.

Then with probability 1-, for any axis-parallel hyper-rectangle| Q the following is true:

sel(Q) = |QT| is the selectivity of Q with respect to the data stream seen so far,

sel(Q, S) = |QS| is the selectivity of Q with respect to the random sample.

|||||

||),()(| TS

TSQselQsel

DB group seminar 9

Sampling

Random sampling

Problem: when sel(Q) is smaller, relative error is bigger Better selectivity estimator: kernel density estimator

||

||),()(_S

TSQselQselestimated

}1),(max{

||

Qsel

TErr

DB group seminar 10

Kernel density estimator

S = {s1, …, sm}: random subset of D

where x = (x1, …, xd) and si = (si1, …, sid) are d-dimensional points

Bj : kernel bandwidth along dimension j [Sco92] Global parameter

d

j j

j

d

dd

m

ii

B

x

BBBxxk

sxkm

xf

1

2

211

1

))(1(...

175.0),...,(

)(1

)(

)4/(15 djj mB

DB group seminar 11

One-dimensional kernels

(a) Kernel function, B = 1; (b) Contribution of multiple kernels to estimate of

range query

DB group seminar 12

Local kernel density estimator kd-tree structure T(S): index of the sample data

Each leaf contains one point si leaf(si) Two leaves are disjoint Union of all leaves is Rd

Each leaf maintain d+1 values: i, i1, i2,…, id

ij : approximates the standard distribution of the points in the cell centered at si along dimension j

R = [a1, b1] … [ad, bd]

Ti: subset of points in tree leaf leaf(si)

d

jba d

ij

ijj

idi

dii

jj

dxdxB

sx

BBTTRsel

1],[ 1

2

1

...))(1(...

175.0||),(

DB group seminar 13

Update T(S) Purpose: maintain i, ij (1 ≤ j ≤ d)

i is the number of stream points contained in leaf(si)

Assume p is the current point in the data stream If p is not selected in S according to sample algorithm

Find the leaf that contains p, leaf(si), Increment I Add (pj – sij)2 to ij

If p is selected in S A point q will be deleted from S Delete leaf(q) Add a new leaf corresponding to p

)1/(

1,)()(

2

iijij

sleafpijjij

i

djsp

DB group seminar 14

Delete leaf(q) u: parent node of leaf(q) v: sibling of leaf(q) box(u): axis parallel hyper-

rectangle of node u h(u): hyper-plane

orthogonal to a coordinate axis that divides box(u) into two smaller boxes associated with the children of u.

N(q), neighbors of leaf(q) leaves in the subtree of v

that have one boundary contained in h(u)

DB group seminar 15

Delete leaf(q) Redistribute points in leaf(q) to N(q)

Extending the bounding box of each neighbor of leaf(q) past h(u), until it hits the left boundary of leaf(q)

Update , values for all leaves in N(q) Notations:

Leaf(r) N(q) boxe(r): the expanded box of r

DB group seminar 16

Update of leaf(r)

Update value for every leaf leaf(r) N(q) compute selectivity sel(boxe(r)) of the boxe(r) w.r.t.

leaf(q) r = r +sel(boxe(r))

DB group seminar 17

Update of leaf(r)

[j, j] be the intersection of boxe(r) and the kernal function of q

along dimension j. Discretize it by equidistant points ( is a large constant)

j = v1, v2, …, v = j

Update rj as following:

Wti is the approximate number of points of leaf(q) whose j’th coordinate lies

in the interval [vi,vi+1].

DB group seminar 18

Update of leaf(r)

Updating rj by discretizing the intersection of boxe(r) and the kernel of q along dimension j (the gray area represents wt2)

All points in this interval is approximated by its midpoint

DB group seminar 19

Insert a leaf

p: newly inserted point q: existing sample point such that pleaf(q)

Split leaf(q) by a hyperplane Pass through the midpoint (p+q)/2 Direction: alternative rule of kd-tree

If i is the splitting dim for the parent of q, then the splitting dim for q is (i+1) mod d

Update and values for p and q using similar procedure for updating

DB group seminar 20

Extension

Allow deletion of a point p from the data stream If p is not a kernel center

Compute leaf(si) such that p leaf(si)

i = i -1

ij = ij – (pj - sij)2

p is a kernel center Delete leaf(p) Replace p with a newly coming point p’

This does not follow the sample procedure, may make the sample not uniform w.r.t. points in D

DB group seminar 21

Experiments

Different number of dimensions Different query loads Range selectivity Measurement:

Accuracy Trade-off between accuracy and space usage

DB group seminar 22

Data

Synthetic data, generator is for projected cluster [Agg99] SD2(2D), SD4 (4D) 1 million points, 90% are contained in clusters, 10%

uniformly distributed

Real data NM2 1 million 2D data with real-valued attributes Each point: an aggregate of measurements taken in 15m

interval, reflecting minimum and maximum delay times between pairs of severs on AT&T’s backbone network

DB group seminar 23

Query loads

2 query workload for each dataset Queries are chosen randomly in the attribute

space Each workload contains 200 queries Each query in a workload has the same

selectivity, 0.5% for the first workload (low selectivity) 10% for the second (high selectivity)

DB group seminar 24

Accuracy measure

k

errerravg

Qsel

QselestimatedQselErr

k

j ij

i

iii

1_

}1),(max{

|)(_)(|

Qi = <i, Ri>, its relative error is Erri

Let {Qi1, … , Qik } be the query workload, the average relative error of this workload is avg_err

DB group seminar 25

Validating local kernels in an off-line setting(1) MPLKernels (Multi-Pass Local kernels)

Scan the data once, get random sample points Compute the kd-tree on them Scan the data second times, compute and Only useful in off-line setting

GKernels (Global Kernels) [Gun00] Kernel bandwidth: function of global standard deviation of

the data along each dimension One-pass approximation Two-pass accurate

computation Sample: Random sampling LKernels: one pass local kernels

DB group seminar 26

Validating local kernels in an off-line setting(2)

DB group seminar 27

Validating local kernels in an off-line setting(3)

DB group seminar 28

Comparison with histogram methods(1)

Histogram method [Tha02] faster heuristic: EGreedy

DB group seminar 29

General online setting(1)

Queries arrive interleaved with points Compare

Sample LKernels MPLernels

DB group seminar 30


DB group seminar 31


DB group seminar 32


DB group seminar 33

References [kd75] J.L. Bentley. Multidimensional Binary Search Trees Used for

Associative Searching. Communication of the ACM, 18(9), September 1975. [vit85] J.S. Vitter. Random sampling with a reservoir. ACM Transactions on

Mathematical Software, 11(1): 37-57, 1985. [Sco92] D. W. Scott. Multivariate Density Estimation. Wiley-Interscience,

1992. [Agg99] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S.

Park. Fast algorithms for projected clustering. In SIGMOD99, pages 61–72. [Gun00] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi.

Approximating multidimensional aggregate range queries over real attributes. In SIGMOD00, pages 463–474.

[Geh01] J. Gehrke, F. Korn and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD01.

[Gib01] P. Gibbons. Distinct sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In VLDB01.

[Gre01] M. Greenwald and S. Khanna, Space-Efficient Online Computation of Quantile Summaries. In SIGMOD01.

DB group seminar 34

References [Dob02] A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi.

Processing complex aggregate queries over data streams. In SiGMOD02.

[Kor02] Flip Korn, S. Muthukrishnan, Divesh Srivastava. Reverse nearest neighbor aggregats over data streams. In VLDB02.

[Tha02] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD02, pages 428–439.

[Man02] G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In VLDB02, pages 346-357.

[Kou04] Nick Koudas and Beng Chin Ooi and Kian-Lee Tan and Rui Zhang, Approximate NN Queries on Streams with Guaranteed Error/performance Bounds. In VLDB04, pages 804-815.

[Joh05] T. Johnson, S. Muthukrishnan and I. Rozenbaum. Sampling Algorithms in a stream Operator. In SIGMOD05.

DB group seminar 35

Appendix –reservoir sampling

This algorithm (called Algorithm X in Vitter’s paper) obtains a random sample of size n during a single pass through the relation.

The number of tuples in the relation does not need to be known beforehand. The algorithm proceeds by inserting the first n tuples into a “reservoir.”

Then a random number of records are skipped, and the next tuple replaces a randomly selected tuple in the reservoir.

Another random number of records are then skipped, and so forth, until the last record has been scanned.

DB group seminar 36

Appendix: kd-tree

Start from the root-cell and bisect recursively the cells through their longest axis, so that an equal number of particles lie in each sub-volume

DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian...

Documents

Transcript of DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian...