Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
-
Upload
bertha-fitzgerald -
Category
Documents
-
view
219 -
download
0
Transcript of Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
![Page 1: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/1.jpg)
Approximating Data Stream using histogram for Query Evaluation
Huiping CaoJan. 03, 2003
![Page 2: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/2.jpg)
2
Outline Introduction Background Histogramming a data stream
V-Optimal Histogram Optimal Histogram Construction Agglomerative Histogram Algorithm Fixed-Window Histogram Algorithm
Experiments Conclusion
![Page 3: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/3.jpg)
3
Introduction Data Stream refers to the fixed order of data elements
that come continuously and in a variable rate. Many applications generate streaming data, such as network
monitoring records, data generated by sensors, etc. New features of algorithms used to handle data stream: single-
pass, quick speed(maybe), limited memory, online(unbounded) Data stream operations
Approximate querying, similarity searching, data mining. Such operations reply on good approximation of data
stream, histogram is a popular way to approximate data stream
![Page 4: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/4.jpg)
4
Background Histogram
Histogram approximates the data distribution of data sets or data stream by partitioning the underlying data into subsets called buckets.
Good histogram construction algorithm can approximate the data as accurately and quickly as possible
Accuracy of the approximation depends on: (1) partitioning technique used to group values into buckets. I.e,
how to partition the data into subsets while inducing less error. (2) approximation technique employed within each bucket. I.e.,
how to summary the values in one buckets. E.g., mean, average.
![Page 5: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/5.jpg)
5
Background(cont.) Data stream model
Agglomerative(Landmark) model
Take into account every elements seen so far
Figure 1(a) Fixed-window(Sliding-window)
model Only consider the last seen n
data elements or the elements observed t time units before the current time
Figure 1(b)
Sketch
t0=0 tcurrent
Fig. 1(a)
Sketch
t0=0 tcurrent
Fig. 1(b)
n
![Page 6: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/6.jpg)
6
Background(cont.) Related work
Approximate specific queries Distinct values([Gib01]), frequency counts([MM02]),
quantile([GK01]), general aggregation([DGG+02]), join([KNV03]).
Approximate methods Sample, histogram, wavelets, more common
synopsis(Section 6 in [BBD+02]).
Focus of this talk: Query independent histogram construction
methods, specifically concentrate on the partitioning of buckets.
![Page 7: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/7.jpg)
7
Histogramming a data stream
Optimal histogram([GK02, IP95]) Optimal histogram
construction([GK02, JKM+98]) Agglomerative
algorithm([GK02,GKS01]) Fixed window algorithm([GK02])
![Page 8: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/8.jpg)
8
V-Optimal Histogram Optimal Histogram Problem
Given a sequence of length n, a number of buckets B, and an error function En(), find HB to minimize E(HB).
Independent on queries [IP95] showed that V-Optimal is the well known optimal
histogram. Basic idea: attribute values are grouped in buckets based on
proximity in their frequencies but not in their actual values. En()=bi[1,..,B] vbi(fv-Cbi /Vbi)2
B: maximum bucket number bi: the i-th bucket fv : the frequency of v in one bucket Cbi,Vbi: The sum and the number of frequencies in bucket bi
![Page 9: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/9.jpg)
9
Optimal Histogram Construction
Problem: The problem of constructing optimal histogram is intrinsically to partition
the index set 1...n into B intervals or buckets minimizing E() Main idea: [JKM+98]
the algorithm focuses on computing OPT[n,B] and getting the bucket boundaries at the same time.
OPT[i,k] denotes the minimum error of representing [1,…,i] by a histogram with k buckets, where i n and k B.
OPT[n,B]= mini<n{OPT[i,B-1]+SSE[i+1,n]} E() = OPT(i,B)= k[1...B]SSEk.
SSE is the common error metric: Sum Squared Error(SSE) SSE([a,b])= i[a,b](vi - avg(v)) 2 = vi
2- 1/(b-a+1)(vi) 2
= SQSUM[1,b]-SQLSUM[1,a-1] -(1/(b-a+1))(SUM[1,b]-SUM[1,a-1])
where, SUM[1,i]= vj SQSUM[1,i] = vj2 , j [1,...,i]
![Page 10: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/10.jpg)
10
Optimal Histogram Construction(Cont.)
Algorithm OptimalHistogram() Compute SUM[1,i], SQSUM[1,i] for all 1 i n Initialize OPT[j,1]= SQSUM[j,n], 1 j n 1. For j=1 to n do 2. For k=2 to B do 3. For i=1 to j-1 do 4. OPT[j,k] =mini(OPT[i,k-1]+SSE[i+1,j])
Explanation For any latest seen element vj , it computes OPT[j,B] get the
minimum cost of any possible intervals. E.g., OPT[n,B]= mini<n{OPT[i,B-1]+SSE[i+1,n]} means
OPT[1,B-1]+SSE[2,n] OPT[2,B-1]+SSE[3,n] ... OPT[n-1,B-1]+SSE[n,n]
minimum=opt[n,B]
![Page 11: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/11.jpg)
11
Example: data sequence:{x1, x2, x3, ...,x10} n=10, B=3 j=1 best partition: [1,1] j=2 best partition: [1,2] ... j=5 k=B-1 best partition: [1,2][3,5] j=6 k=B-1 best partition: [1,3][4,6] ... j=9, k=B OPT[9,B] = OPT[5,B-1]+SSE[6,9]
Then, best partition = [1,2][3,5][6,9] j=10, k=B OPT[10,B]=OPT[6,B-1]+SSE[7,10]
Then, best partition=[1,3][4,6],[7,10] Time complexity: O(n2B), Space complexity: O(n)
Optimal Histogram Construction(Cont.)
![Page 12: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/12.jpg)
12
Agglomerative algorithm -approximation algorithm
Given a sequence of length n, a number of buckets B, an error function En() and a precision >0, find HB with En(HB) less than (1+ )minH(En(H)).
If the data sequence is a data stream, then n is the fixed memory space used to store a portion, n data points, of the stream.
Agglomerative algorithm aims to construct an -approximation histogram.
Can we improve the optimal construction algorithm to -approximation algorithm in data stream setting?
The cost for searching minimum approximation error is big [GKS01]
![Page 13: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/13.jpg)
13
Agglomerative algorithm(cont.)
Improvement to the OptimalHistogram algorithm: It reduced the cost to compute OPT[j,k]
OptimalHistogram: OPT[j,k] =mini(OPT[i,k-1]+SSE[i+1,j]) Agg. Algorithm: OPT[j,k] = min(OPT[bi,k-1]+SSE[bi+1,j]), bi are
end points of intervals for approximating j data points using k-1 buckets.
E.g.: If {vi}={v1,v2,v3,....v9} and {bi}={v3, v5, v9}, then OptimalHistogram algorithm needs to compare 9 values, but Agg. algorithm just needs to compare 3 values.
Reason: OPT[b,k-1]+SSE[b+1,j] (1+ )(OPT[i,k-1]+SSE[i+1,j]), a i
b SSE[i+1,j] is a positive non-increasing function if j is fixed and i
increases. OPT[i,k-1] is a positive non-decreasing function as i increases.
![Page 14: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/14.jpg)
14
Main idea: For each 1 k B, the algorithm maintains
intervals(a1k,b1
k),...,(alk,bl
k) such that, a1
k =1, blk =n , bj
k +1= aj+1k for j<l.
OPT[k, bjk] (1+) OPT[k, aj
k] (1+) B 1+
Store OPT[k, ajk], OPT[k, bj
k] for all j and k, also store SUM[1,r], SQSUM[1,r], where r k,j{{aj
k} {bjk}}
B-1 queues storing the intervals and the related SUMs and SQSUMs
Agglomerative algorithm(cont.)
![Page 15: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/15.jpg)
15
On seeing the n+1’st value vn+1, the algorithm Compute OPT[k,n+1] for all 1 k B
for k=1, OPT[n+1,1]=SSE[1,n+1] for k 2, OPT[n+1,k] = mini (OPT[bi
k,k-1 ]+SSE[bik,n+1]).
Update the intervals (a1k,b1
k),...,(alk,bl
k) The algorithm just need to update the last interval(a l
k,blk),
either setting blk =n+1 or creating a new interval l+1 with
al+1k = bl+1
k =n+1.
Time complexity O((nB2/)logn) Space complexity O((B2/)logn)
Agglomerative algorithm(cont.)
![Page 16: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/16.jpg)
16
Fixed Window Algorithm Agglomerative algorithm is not very useful
in constructing a fixed window histogram Reason: the computation of a histogram on
[1,..,n] does not allow any information on[2,..., n].
Main Idea Maintain lj ixj and lj ixj
2 using two arrays SUM’ and SQSUM’ on [0,n], which are circular buffers. Here {xl,..., xi} are observations of interest.
![Page 17: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/17.jpg)
17
FixedWindowHistogram() Compute SUM’ and SQSUM’ Assume 1 to be the first point in the circular buffer For k=1 to B-1{
Initialize k’th queue to empty CreateList[1,n,k] //time complexity: O((1/)2log3n), = B
//creates intervals of [1...n] using k buckets//interval range[a,b] satisfying OPT[b,k] (1+ )OPT[a,k]// && b is maximized
} {let bl1, bl2,... are end points in QueueB-1} OPT[n,B]=mini{OPT[bli,B-1]+SSE[bli+1,n]}
Time complexity: O((B3/2)log3n), space complexity: O(n)
Fixed Window Algorithm(Cont.)
![Page 18: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/18.jpg)
18
Fixed Window Algorithm(Cont.) Example:
data sequence {0,0,0,1,1,1,1,1} =1, B=2 SUM’ =SQSUM’={0,0,0,1,2,3,4,5} CreateList[1,8,1] (a=1,b=8,k=1), running step:
a=1, OPT[1,1]=0 find index c such that OPT[c,1] 0 =(1+ )OPT[1,1] and c
is maximized. c=3 Queue1={3} Call CreateList[4,8,1]//CreateList(c+1,b,k)
OPT[4,1]=0.75 find index c such that OPT[c,1] 1.5= (1+ )OPT[4,1] and c
is maximized. c=6 Queue1={3,6}
Call CreateList[7,8,1] get Queue1 = {3,6,8}
![Page 19: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/19.jpg)
19
OPT[n=8,B=2] = minimum of the following 3 values OPT[3,1]+SSE[4,8]=0+0=0 OPT[6,1]+SSE[7,8]=1.5+0 =1.5 OPT[8,1] = 15/8 minimum is 0, then best partition
{(1,3),(4,8)}
Fixed Window Algorithm(Cont.)
![Page 20: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/20.jpg)
20
Experimental Evaluation Test
the Construction Performance Accuracy of fixed window algorithm when
evaluating range sum queries Measure:
Construction performance measure: time Accuracy measure: average results
Data: Real data sets extracted from AT&T data warehouses
![Page 21: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/21.jpg)
21
Accuracy test for various and B Conclusion:
For fixed window histogram, accuracy improves with and B Fixed window histogram outperforms wavelet based histogram
Exact
Histogram
Wavelets
![Page 22: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/22.jpg)
22
Construction time for various and B Conclusion:
Wavelet based method is much worse than fixed window histogram (so, not given here)
Construction time grows as B increases or decreases
![Page 23: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/23.jpg)
23
Conclusion Background knowledge on data
stream Three algorithms used to construct
optimal (-approximate) histogram in different scenario
Other related work: New operators over a data stream Operations over multi data streams
sketch technique, query optimization, etc.
![Page 24: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/24.jpg)
24
Reference 1 [GK02] Sudipto Guha and Nick Koudas. Approximating a data
stream for querying and estimation: algorithms and performance evaluation. In ICDE’02.
[GKS01]Sudipto Guha, Nick Koudas and Kyuseok Shim. Data-Streams and Histograms. In STOC’01, pages 471-475.
[IP95]Yannis E. Ioannidis and Viswanath Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In SIGMOD’95. Pages 233-244.
[JKM+98] H.V.Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik and Torsten Suel. Optimal Histograms with Quality Guarantees. In VLDB’98. Pages 275-286.
![Page 25: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/25.jpg)
25
Reference 2 [BBD+02]B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models
and Issues in Data Stream Systems. In PODS’ 02, pages 1-16. [DGG+02]A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi. Processing
complex aggregate queries over data streams. In SIGMOD’ 02, pages 61-72.
[Gib01]Distinct Sampling for highly-accurate answers to distinct values queries and event reports. In VLDB’01, pages 541-550.
[GK01]M. Greenwald, S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD’01, pages 58-66.
[MM02]G. S. Manku, R. Motwani. Approximate frequency counts over data streams. In VLDB’02, pages 346-357.
[KNV03]Jaewoo Kang, J.F.Naughton and Stratis D. Biglas. Evaluating window joins over unbounded streams. In ICDE’03.
![Page 26: Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.](https://reader035.fdocuments.net/reader035/viewer/2022081603/5697bfe71a28abf838cb5e08/html5/thumbnails/26.jpg)
26
Thank you!