DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian...
-
Upload
hannah-lawson -
Category
Documents
-
view
217 -
download
0
Transcript of DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian...
DB group seminar 1
Density Estimation for Spatial Data Streams
Celilia M. Procopiuc and Octavian Procopiuc
AT&T Shannon Labs
SSTD’05
Presented by: Huiping Cao
DB group seminar 2
Outline
Background Related work Problem definition Online algorithm Experiments References
DB group seminar 3
Background
Streaming data Large volume Continuous arrival
Data stream algorithms One pass Small amount of space Fast updating
DB group seminar 4
Background
Data stream model Operations of elements
Insertion Most common case Deletion Updating Most difficult case
Validation time of elements Whole history
Landmark window [Geh01], cash register [this paper] Partial recent history
Sliding window [Geh01], turnstile model [this paper]
DB group seminar 5
Related work
Classified according to operations Aggregation
avg, max, min, sum [Dob02] count (distinct values) [Gib01] Quantile in 1D data [Gre01] Frequent items (Heavy hitter) [Man02]
Query estimation Join size estimation [Dob02]
K Nearest Neighbor seaching eKNN [Kou04], RNN aggregation[Kor01]
Techniques Histogram [Tha02], sample [Joh05], special synopsis ... ...
DB group seminar 6
Problem definition
D: <p1, p2, …>, where pi Rd
Cash register model Qi = <i, Ri>,
Ri: a d-dimensional hyper-rectangle Selectivity of Qi: sel(Qi) = |{pi| j ≤ i, pi Ri} |
These points arrive before time step i They lie in Ri
Problem: Estimating sel(Qi) Measurement, relative error
}1),(max{
|)(_)(|
Qsel
QselestimatedQselErr
DB group seminar 7
Online algorithm: rough steps
Get random samples S of D Using reservoir sampling method [vit85]
Using kd-tree([kd75])-like structure to index these sample points Maintenance of the sample and the kd-tree-like
structure online Compute range selectivity: estimated_sel(Q)
using kernel density estimator
DB group seminar 8
Random sampling Theorem1:
Let T be the data stream seen so far, size: |T| let S T be a random sample chosen via the reservoir
sampling technique, such that |S| = ((d/2)log (1/)+log(1/)), where 0 < , <1 and |S| is the size of S.
Then with probability 1-, for any axis-parallel hyper-rectangle| Q the following is true:
sel(Q) = |QT| is the selectivity of Q with respect to the data stream seen so far,
sel(Q, S) = |QS| is the selectivity of Q with respect to the random sample.
|||||
||),()(| TS
TSQselQsel
DB group seminar 9
Sampling
Random sampling
Problem: when sel(Q) is smaller, relative error is bigger Better selectivity estimator: kernel density estimator
||
||),()(_S
TSQselQselestimated
}1),(max{
||
Qsel
TErr
DB group seminar 10
Kernel density estimator
S = {s1, …, sm}: random subset of D
where x = (x1, …, xd) and si = (si1, …, sid) are d-dimensional points
Bj : kernel bandwidth along dimension j [Sco92] Global parameter
d
j j
j
d
dd
m
ii
B
x
BBBxxk
sxkm
xf
1
2
211
1
))(1(...
175.0),...,(
)(1
)(
)4/(15 djj mB
DB group seminar 11
One-dimensional kernels
(a) Kernel function, B = 1; (b) Contribution of multiple kernels to estimate of
range query
DB group seminar 12
Local kernel density estimator kd-tree structure T(S): index of the sample data
Each leaf contains one point si leaf(si) Two leaves are disjoint Union of all leaves is Rd
Each leaf maintain d+1 values: i, i1, i2,…, id
ij : approximates the standard distribution of the points in the cell centered at si along dimension j
R = [a1, b1] … [ad, bd]
Ti: subset of points in tree leaf leaf(si)
d
jba d
ij
ijj
idi
dii
jj
dxdxB
sx
BBTTRsel
1],[ 1
2
1
...))(1(...
175.0||),(
DB group seminar 13
Update T(S) Purpose: maintain i, ij (1 ≤ j ≤ d)
i is the number of stream points contained in leaf(si)
Assume p is the current point in the data stream If p is not selected in S according to sample algorithm
Find the leaf that contains p, leaf(si), Increment I Add (pj – sij)2 to ij
If p is selected in S A point q will be deleted from S Delete leaf(q) Add a new leaf corresponding to p
)1/(
1,)()(
2
iijij
sleafpijjij
i
djsp
DB group seminar 14
Delete leaf(q) u: parent node of leaf(q) v: sibling of leaf(q) box(u): axis parallel hyper-
rectangle of node u h(u): hyper-plane
orthogonal to a coordinate axis that divides box(u) into two smaller boxes associated with the children of u.
N(q), neighbors of leaf(q) leaves in the subtree of v
that have one boundary contained in h(u)
DB group seminar 15
Delete leaf(q) Redistribute points in leaf(q) to N(q)
Extending the bounding box of each neighbor of leaf(q) past h(u), until it hits the left boundary of leaf(q)
Update , values for all leaves in N(q) Notations:
Leaf(r) N(q) boxe(r): the expanded box of r
DB group seminar 16
Update of leaf(r)
Update value for every leaf leaf(r) N(q) compute selectivity sel(boxe(r)) of the boxe(r) w.r.t.
leaf(q) r = r +sel(boxe(r))
DB group seminar 17
Update of leaf(r)
[j, j] be the intersection of boxe(r) and the kernal function of q
along dimension j. Discretize it by equidistant points ( is a large constant)
j = v1, v2, …, v = j
Update rj as following:
Wti is the approximate number of points of leaf(q) whose j’th coordinate lies
in the interval [vi,vi+1].
DB group seminar 18
Update of leaf(r)
Updating rj by discretizing the intersection of boxe(r) and the kernel of q along dimension j (the gray area represents wt2)
All points in this interval is approximated by its midpoint
DB group seminar 19
Insert a leaf
p: newly inserted point q: existing sample point such that pleaf(q)
Split leaf(q) by a hyperplane Pass through the midpoint (p+q)/2 Direction: alternative rule of kd-tree
If i is the splitting dim for the parent of q, then the splitting dim for q is (i+1) mod d
Update and values for p and q using similar procedure for updating
DB group seminar 20
Extension
Allow deletion of a point p from the data stream If p is not a kernel center
Compute leaf(si) such that p leaf(si)
i = i -1
ij = ij – (pj - sij)2
p is a kernel center Delete leaf(p) Replace p with a newly coming point p’
This does not follow the sample procedure, may make the sample not uniform w.r.t. points in D
DB group seminar 21
Experiments
Different number of dimensions Different query loads Range selectivity Measurement:
Accuracy Trade-off between accuracy and space usage
DB group seminar 22
Data
Synthetic data, generator is for projected cluster [Agg99] SD2(2D), SD4 (4D) 1 million points, 90% are contained in clusters, 10%
uniformly distributed
Real data NM2 1 million 2D data with real-valued attributes Each point: an aggregate of measurements taken in 15m
interval, reflecting minimum and maximum delay times between pairs of severs on AT&T’s backbone network
DB group seminar 23
Query loads
2 query workload for each dataset Queries are chosen randomly in the attribute
space Each workload contains 200 queries Each query in a workload has the same
selectivity, 0.5% for the first workload (low selectivity) 10% for the second (high selectivity)
DB group seminar 24
Accuracy measure
k
errerravg
Qsel
QselestimatedQselErr
k
j ij
i
iii
1_
}1),(max{
|)(_)(|
Qi = <i, Ri>, its relative error is Erri
Let {Qi1, … , Qik } be the query workload, the average relative error of this workload is avg_err
DB group seminar 25
Validating local kernels in an off-line setting(1) MPLKernels (Multi-Pass Local kernels)
Scan the data once, get random sample points Compute the kd-tree on them Scan the data second times, compute and Only useful in off-line setting
GKernels (Global Kernels) [Gun00] Kernel bandwidth: function of global standard deviation of
the data along each dimension One-pass approximation Two-pass accurate
computation Sample: Random sampling LKernels: one pass local kernels
DB group seminar 26
Validating local kernels in an off-line setting(2)
DB group seminar 27
Validating local kernels in an off-line setting(3)
DB group seminar 28
Comparison with histogram methods(1)
Histogram method [Tha02] faster heuristic: EGreedy
DB group seminar 29
General online setting(1)
Queries arrive interleaved with points Compare
Sample LKernels MPLernels
DB group seminar 30
General online setting(2)
DB group seminar 31
General online setting(3)
DB group seminar 32
General online setting(4)
DB group seminar 33
References [kd75] J.L. Bentley. Multidimensional Binary Search Trees Used for
Associative Searching. Communication of the ACM, 18(9), September 1975. [vit85] J.S. Vitter. Random sampling with a reservoir. ACM Transactions on
Mathematical Software, 11(1): 37-57, 1985. [Sco92] D. W. Scott. Multivariate Density Estimation. Wiley-Interscience,
1992. [Agg99] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S.
Park. Fast algorithms for projected clustering. In SIGMOD99, pages 61–72. [Gun00] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi.
Approximating multidimensional aggregate range queries over real attributes. In SIGMOD00, pages 463–474.
[Geh01] J. Gehrke, F. Korn and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD01.
[Gib01] P. Gibbons. Distinct sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In VLDB01.
[Gre01] M. Greenwald and S. Khanna, Space-Efficient Online Computation of Quantile Summaries. In SIGMOD01.
DB group seminar 34
References [Dob02] A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi.
Processing complex aggregate queries over data streams. In SiGMOD02.
[Kor02] Flip Korn, S. Muthukrishnan, Divesh Srivastava. Reverse nearest neighbor aggregats over data streams. In VLDB02.
[Tha02] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD02, pages 428–439.
[Man02] G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In VLDB02, pages 346-357.
[Kou04] Nick Koudas and Beng Chin Ooi and Kian-Lee Tan and Rui Zhang, Approximate NN Queries on Streams with Guaranteed Error/performance Bounds. In VLDB04, pages 804-815.
[Joh05] T. Johnson, S. Muthukrishnan and I. Rozenbaum. Sampling Algorithms in a stream Operator. In SIGMOD05.
DB group seminar 35
Appendix –reservoir sampling
This algorithm (called Algorithm X in Vitter’s paper) obtains a random sample of size n during a single pass through the relation.
The number of tuples in the relation does not need to be known beforehand. The algorithm proceeds by inserting the first n tuples into a “reservoir.”
Then a random number of records are skipped, and the next tuple replaces a randomly selected tuple in the reservoir.
Another random number of records are then skipped, and so forth, until the last record has been scanned.
DB group seminar 36
Appendix: kd-tree
Start from the root-cell and bisect recursively the cells through their longest axis, so that an equal number of particles lie in each sub-volume