An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S....

29
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003

Transcript of An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S....

An Improved Data Stream Summary:

The Count-Min Sketch and its Applications

Graham Cormode, S. Muthukrishnan2003

We consider the vector

initially iai 0)0(

))(),(,),(()( 1 tatatata ni

The t th update ),( tt ci

tii ctatatt

)1()(

)1()( tata ii tii

Data Stream Model

Count-Min Sketch

A Count-Min (CM) Sketch with parameters is represented by

a two-dimensional array counts with width and depth .

Given parameters , set and .

Each entry of the array is initially zero.

hash functions are chosen uniformly at random from a pairwise

independent family

),(

w ],[]1,1[: wdcountcountd

),(

e

w

1

lnd

d

}1{}1{:,,1 wnhh d

Update procedure :

When arrives, set),( tt ci dj 1

ttjtj cihjcountihjcount )](,[)](,[

t

t

t

t

c

c

c

c

ti1h

dh

point query

range queries

inner product queries

),( rlQ

approx.

ia)(iQ

approx.

r

liia

),( baQ approx.

n

iiibaba

1

Approximate Query Answering Using CM Sketches

Point Query

)(iQ )](,[minˆ ihjcounta jj

i

Non-negative case ( )

Theorem 1 ii aa ˆ

]ˆ[1

aaaP ii

0)( tati

PROOF : We introduce indicator variables

kjiI ,, ))()(()( khihki jj 1 if

0 otherwise

ewkhihIE jjkji

1)]()(Pr[)( ,,

Define the variable

n

kkkjiji aIX

1,,,

By construction, jiij Xaihjcount ,)](,[ ij aihjcount )](,[min

For the other direction, observe that

])](,[.Pr[]ˆPr[11

aaihjcountjaaa ijii

].Pr[1, aaXaj ijii

djiji eXeEXj )](.Pr[ ,,

Markov inequality

0)(

]Pr[ tt

XEtX

n

kkjik

n

kkkjiji a

eIEaaIEXE

11,,

1,,, )()(

Time to produce the estimate )1

(ln

O

Time for updates )1

(ln

O

Space used )1

ln1

(

O

Remark : The constant is used here to minimize the space used.e

General case

)(iQ )](,[ˆ ihjcountmediana jj

i

Theorem 2 4/1

111]3ˆ3Pr[ aaaaa iii

PROOF :1, )())](,[(| a

eXEaihjcountE jiij

8

1

33

)()3)](,[Pr(|

2

1

,

1

eae

XEaeaihjcount ji

ij

4/1

1)3ˆPr(| aeaa ii

Chernoff bounds

Time to produce the estimate )1

(ln

O

Time for updates )1

(ln

O

Space used )1

ln1

(

O

Inner Product Query

Set

w

kbaj kjcountkjcountba

1

],[],[)(

),( baQ

jjbaba

)(min)(

Theorem 3

)()( baba

])(Pr[11

bababa

PROOF:

)()(,1

)(qhphqp

qp

n

iiij

jj

bababa

)()( baba

e

ba

e

babaqhphbabaE

qp

qp

qpqpjj

11)]()(Pr[)(

]Pr[11

bababaMarkov inequality

0)(

]Pr[ tt

XEtX

Time to produce the estimate

Time for updates

Space used

)1

log1

(

O

)1

log1

(

O

)1

(log

O

The application of inner-product computation to Join size estimation (where the vectors generated have non-negative entries)

Join size of 2 database relations on a particular attribute :

= the number of items in the cartesian product of the 2 relations which agree the value of that attribute

a

b

}1{ ni

: the nr of tuples which have value iii ba ,

ba

Collorary 1 The Join size of two relations on a particular attribute can

be approximated up to with probability by

keeping space .

11ba 1

1

log1

O

Range Query

Dyadic range: ]2)1(12[ yy xx for parameters yx,

range query dyadic range queriesn2log2 single point query(at most)

For each set of dyadic ranges of length a sketch is kept

n2log CM Sketches

1log0,2 2 nyy

),( rlQ

Compute the dyadic ranges

(at most ) which

canonically cover the range

Pose that many point queries

to the sketches

Sum of queries

],[ˆ rla

n2log2

Theorem 4 ],[ˆ],[ rlarla

]log2],[],[ˆPr[1

anrlarla

Proof : Theorem 1ii aa ˆ

],[ˆ],[ rlarla

E(Σ error for each estimator) nlog2 E(error for each estimator)

1log2 a

en

deanrlarla ]log2],[],[ˆPr[1

Time to produce the estimate

Time for updates

Space used

1

log)log(nO

1

log)log(nO

1

log)log(n

O

Remark : the guarantee will be more useful when stated without terms ofIn the approximation bound.

nlog

Applications of Count-Min Sketches

Quantiles Heavy Hitters

Quantiles in the Turnstile Model

Do binary searches for ranges whose range sumr1

1],1[ akra

11

1

k

Quantiles Items with rank

(approx. rank and rank )

1ak

1

,,0 k

1)( ak

,)(1

ak

Theorem 5 approximate quantiles can be found with probability at least by keeping a data structure with space

The time for insert or delete operation is , and the time

to find each quantile on demand is .

1

n

nOlog

log)(log1 2

n

nOlog

log)log(

n

nOlog

log)log(

Heavy Hitters (cash register case)

),( tt ci )( tiQ1

)(ˆ taati

tiaddedto a heap

Heavy Hitters Items whose multiplicity exceeds the fraction

(approx. )

1aai

,)(1

aai

Theorem 6 The heavy hitters can be found from an inserts only sequence of

length by using CM sketches with space , and time

per item. Every item which occurs with count more than

time is output, and with probability , no item whose count is less than

is output.

1a

1log

1 aO

1log

aO

1a

1

1)( a

Sketching techniques

tug-of-war Alon, Matias and Szegedy (1996)

Count sketch Alon, Matias and Szegedy (2002)

Random subset sums Gilbert, Kotidis, Muthukrishnan and Strauss (2002)

Count-min sketch Cormode and Muthukrishnan (2003)

- Linear projections of the vector with appropriately chosen random vectors

Computation :

Sketch Array dw

}1{}1{:,,1 wnhh d pairwise independent hash functions

dgg ,,1 hash function whose range and randomness varies

The th entry of the sketch : ),( kj

jihi

ki

k

iga)(

)(

a

tug-of-war

is with 4-wise independence

Count sketch

Random subset sums

Count-min sketch

)(,1

log1

,12

igOdw

}1,1{

)(,1

log,1

2igOdOw

is with 2-wise independence}1,1{

)(,1

log24

,22

igdw

is 1

1)(,1

ln,

igd

ew

Method Query Space UpdateTime

QueryTime

RandomnessNeeded

Tug-of-war Inner-product 4-wise

Tug-of-war Point

Range

4-wise

4-wise

Rundom subset-sums Range Pairwise

Count sketches Point 1 1 Pairwise

Count-Min sketches Point

Inner-product

Range

Pairwise

Pairwise

Pairwise

2/1 2/1 2/1

2/)log( n 2/)log( n 2/)log( n

2/)log( n 2/)log( n 2/)log( n

22 /)(log n 22 /)(log n 22 /)(log n

2/1

/1

/1 /1

/)log(n /)log(n)log(n

1 1

1