An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S....
-
Upload
renee-feemster -
Category
Documents
-
view
222 -
download
0
Transcript of An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S....
An Improved Data Stream Summary:
The Count-Min Sketch and its Applications
Graham Cormode, S. Muthukrishnan2003
We consider the vector
initially iai 0)0(
))(),(,),(()( 1 tatatata ni
The t th update ),( tt ci
tii ctatatt
)1()(
)1()( tata ii tii
Data Stream Model
Count-Min Sketch
A Count-Min (CM) Sketch with parameters is represented by
a two-dimensional array counts with width and depth .
Given parameters , set and .
Each entry of the array is initially zero.
hash functions are chosen uniformly at random from a pairwise
independent family
),(
w ],[]1,1[: wdcountcountd
),(
e
w
1
lnd
d
}1{}1{:,,1 wnhh d
Update procedure :
When arrives, set),( tt ci dj 1
ttjtj cihjcountihjcount )](,[)](,[
t
t
t
t
c
c
c
c
ti1h
dh
point query
range queries
inner product queries
),( rlQ
approx.
ia)(iQ
approx.
r
liia
),( baQ approx.
n
iiibaba
1
Approximate Query Answering Using CM Sketches
Point Query
)(iQ )](,[minˆ ihjcounta jj
i
Non-negative case ( )
Theorem 1 ii aa ˆ
]ˆ[1
aaaP ii
0)( tati
PROOF : We introduce indicator variables
kjiI ,, ))()(()( khihki jj 1 if
0 otherwise
ewkhihIE jjkji
1)]()(Pr[)( ,,
Define the variable
n
kkkjiji aIX
1,,,
By construction, jiij Xaihjcount ,)](,[ ij aihjcount )](,[min
For the other direction, observe that
])](,[.Pr[]ˆPr[11
aaihjcountjaaa ijii
].Pr[1, aaXaj ijii
djiji eXeEXj )](.Pr[ ,,
Markov inequality
0)(
]Pr[ tt
XEtX
■
n
kkjik
n
kkkjiji a
eIEaaIEXE
11,,
1,,, )()(
Time to produce the estimate )1
(ln
O
Time for updates )1
(ln
O
Space used )1
ln1
(
O
Remark : The constant is used here to minimize the space used.e
General case
)(iQ )](,[ˆ ihjcountmediana jj
i
Theorem 2 4/1
111]3ˆ3Pr[ aaaaa iii
PROOF :1, )())](,[(| a
eXEaihjcountE jiij
8
1
33
)()3)](,[Pr(|
2
1
,
1
eae
XEaeaihjcount ji
ij
4/1
1)3ˆPr(| aeaa ii
Chernoff bounds
■
Theorem 3
)()( baba
])(Pr[11
bababa
PROOF:
)()(,1
)(qhphqp
qp
n
iiij
jj
bababa
)()( baba
e
ba
e
babaqhphbabaE
qp
qp
qpqpjj
11)]()(Pr[)(
]Pr[11
bababaMarkov inequality
■
0)(
]Pr[ tt
XEtX
The application of inner-product computation to Join size estimation (where the vectors generated have non-negative entries)
Join size of 2 database relations on a particular attribute :
= the number of items in the cartesian product of the 2 relations which agree the value of that attribute
a
b
}1{ ni
: the nr of tuples which have value iii ba ,
ba
Collorary 1 The Join size of two relations on a particular attribute can
be approximated up to with probability by
keeping space .
11ba 1
1
log1
O
Range Query
Dyadic range: ]2)1(12[ yy xx for parameters yx,
range query dyadic range queriesn2log2 single point query(at most)
For each set of dyadic ranges of length a sketch is kept
n2log CM Sketches
1log0,2 2 nyy
),( rlQ
Compute the dyadic ranges
(at most ) which
canonically cover the range
Pose that many point queries
to the sketches
Sum of queries
],[ˆ rla
n2log2
Theorem 4 ],[ˆ],[ rlarla
]log2],[],[ˆPr[1
anrlarla
Proof : Theorem 1ii aa ˆ
],[ˆ],[ rlarla
E(Σ error for each estimator) nlog2 E(error for each estimator)
■
1log2 a
en
deanrlarla ]log2],[],[ˆPr[1
Time to produce the estimate
Time for updates
Space used
1
log)log(nO
1
log)log(nO
1
log)log(n
O
Remark : the guarantee will be more useful when stated without terms ofIn the approximation bound.
nlog
Quantiles in the Turnstile Model
Do binary searches for ranges whose range sumr1
1],1[ akra
11
1
k
Quantiles Items with rank
(approx. rank and rank )
1ak
1
,,0 k
1)( ak
,)(1
ak
Theorem 5 approximate quantiles can be found with probability at least by keeping a data structure with space
The time for insert or delete operation is , and the time
to find each quantile on demand is .
1
n
nOlog
log)(log1 2
n
nOlog
log)log(
n
nOlog
log)log(
Heavy Hitters (cash register case)
),( tt ci )( tiQ1
)(ˆ taati
tiaddedto a heap
Heavy Hitters Items whose multiplicity exceeds the fraction
(approx. )
1aai
,)(1
aai
Theorem 6 The heavy hitters can be found from an inserts only sequence of
length by using CM sketches with space , and time
per item. Every item which occurs with count more than
time is output, and with probability , no item whose count is less than
is output.
1a
1log
1 aO
1log
aO
1a
1
1)( a
Sketching techniques
tug-of-war Alon, Matias and Szegedy (1996)
Count sketch Alon, Matias and Szegedy (2002)
Random subset sums Gilbert, Kotidis, Muthukrishnan and Strauss (2002)
Count-min sketch Cormode and Muthukrishnan (2003)
- Linear projections of the vector with appropriately chosen random vectors
Computation :
Sketch Array dw
}1{}1{:,,1 wnhh d pairwise independent hash functions
dgg ,,1 hash function whose range and randomness varies
The th entry of the sketch : ),( kj
jihi
ki
k
iga)(
)(
a
tug-of-war
is with 4-wise independence
Count sketch
Random subset sums
Count-min sketch
)(,1
log1
,12
igOdw
}1,1{
)(,1
log,1
2igOdOw
is with 2-wise independence}1,1{
)(,1
log24
,22
igdw
is 1
1)(,1
ln,
igd
ew
Method Query Space UpdateTime
QueryTime
RandomnessNeeded
Tug-of-war Inner-product 4-wise
Tug-of-war Point
Range
4-wise
4-wise
Rundom subset-sums Range Pairwise
Count sketches Point 1 1 Pairwise
Count-Min sketches Point
Inner-product
Range
Pairwise
Pairwise
Pairwise
2/1 2/1 2/1
2/)log( n 2/)log( n 2/)log( n
2/)log( n 2/)log( n 2/)log( n
22 /)(log n 22 /)(log n 22 /)(log n
2/1
/1
/1 /1
/)log(n /)log(n)log(n
1 1
1