Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC...

37
Finding Frequent Items in Data Streams Moses Charikar Princeton Un., Google Inc. Kevin Chen UC Berkeley, Google Inc. Martin Franch-Colton Rutgers Un., Google Inc. Presented by Amir Rothschild

Transcript of Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC...

Page 1: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Finding Frequent Items in Data Streams

Moses Charikar Princeton Un., Google Inc.Kevin Chen UC Berkeley, Google Inc.Martin Franch-Colton Rutgers Un., Google Inc.

Presented by Amir Rothschild

Page 2: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Presenting:

1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space.

The algorithm achieves especially good space bounds for Zipfian distribution

2-pass algorithm for estimating the items with the largest change in frequency between two data streams.

Page 3: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Definitions:

Data stream: where Object oi appears ni times in S. Order oi so that fi = ni/n

1 2, ,..., nS q q q

1 2{ , ,..., }i mq O o o o

1 2 ... mn n n

Page 4: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The first problem:

FindApproxTop(S,k,ε) Input: stream S, int k, real ε. Output: k elements from S such that:

for every element Oi in the output:(1 )i kn n

(1 )i kn n Contains every item with:

n1n2nk

kn kn

Page 5: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Clarifications:

This is not the problem discussed last week!

Sampling algorithm does not give any bounds for this version of the problem.

Page 6: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Hash functions

We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:

1, : Pr ( ( ) )

| |

1, 2 , 1, 2 :

1Pr ( ( 1) 1| ( 2) 2)

| |

h H

h H

H A B

a A b B h a bB

a a A b b B

h a b h a bB

Page 7: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Let’s start with some intuition…

Idea: Let s be a hash function from

objects to {+1,-1}, and let c be a counter.

For each qi in the stream, update c += s(qi)

( ( ))i isE c s o n

C

S

Estimate ni=c*s(oi) (since )

Page 8: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Realization ( ( ))i isE c s o n

22 1 1 2 2 2 3 3 2

1 1 2 2 3 3 2

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

cs o n s o s o n s o n s o s o

n s o s o n n s o s o

s(O1)s(O2) s(O2)s(O2) s(O3)s(O2)

s1 -1 +1 -1

s2 -1 +1 +1

s3 +1 +1 -1

s4 +1 +1 +1

E 0 +1 0

Page 9: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Claim:

For each element Oj other then Oi, s(Oj)*s(Oi)=-1 w.p.1/2

s(Oj)*s(Oi)=+1 w.p. 1/2. So Oj adds the counter +nj w.p. 1/2 and -nj

w.p. 1/2, and so has no influence on the expectation.

Oi on the other hand, adds +ni to the counter w.p. 1 (since s(Oi)*s(Oi)=+1)

So the expectation (average) is +ni.

( ( ))i isE c s o n

1 1 2 2( ) ( ) ... ( )m mc n s o n s o n s o Proof:

Page 10: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

That’s not enough:

The variance is very high. O(m) objects have estimates that

are wrong by more then the variance.

Page 11: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

First attempt to fix the algorithm…

t independent hash functions Sj

t different counters Cj

For each element qi in the stream: For each j in {1,2,…,t} do

Cj += Sj(qi) Take the mean or the median of the

estimates Cj*Sj(oi) to estimate ni.

C1 C3C2 C4 C5 C6

S1 S2 S3 S4 S5 S6

Page 12: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Still not enough

Collisions with high frequency elements like O1 can spoil most estimates of lower frequency elements, as Ok.

Page 13: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Ci

The solution !!!

Divide & Conquer: Don’t let each element update

every counter. More precisely: replace each

counter with a hash table of b counters and have the items one counter per hash table.

Ti

hi

Si

Page 14: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Presenting the CountSketch algorithm…

Let’s start working…

Page 15: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

h1 h2 ht

t hash tables

b b

uck

ets

T1

h1

S1

T2

h2

S2

Ttht

St

CountSketch data structure

Page 16: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The CountSketch data structure

Define CountSkatch d.s. as follows: Let t and b be parameters with values

determined later. h1,…,ht – hash functions O -> {1,2,…,b}. T1,…,Tt – arrays of b counters. S1,…,St – hash functions from objects O to

{+1,-1}. From now on, define : hi[oj] := Ti[hi(oj)]

Page 17: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The d.s. supports 2 operations:

Add(q): Estimate(q):

)(][:...1 qsqhDotiFor ii )}(][{ qsqhmedianreturn ii

Why median and not mean? In order to show the median is close

to reality it’s enough to show that ½ of the estimates are good.

The mean on the other hand is very sensitive to outliers.

Page 18: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Finally, the algorithm:

Keep a CountSketch d.s. C, and a heap of the top k elements.

Given a data stream q1,…,qn: For each j=1,…,n:

C.Add(qj); If qj is in the heap, increment it’s count. Else, If C.Estimate(qj) > smallest estimated

count in the heap, add qj to the heap. (If the heap is full evict the object with the

smallest estimated count from it)

Page 19: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

And now for the hard part:

Algorithms analysis

Page 20: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Definitions

( ) { : ( ) ( )}i i iA q o o q and h q h o *( ) ( ) { }i iA q A q q

1( ) ( ) \{ ,..., }ki i kA q A q o o

2

( )

( )i

i oo A q

v q n

2

( )

( )ki

ki o

o A q

v q n

Page 21: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Claims & Proofs

( ) { : ( ) ( )}i i iA q o o q and h q h o *( ) ( ) { }i iA q A q q

1( ) ( ) \{ ,..., }ki i kA q A q o o

2

( )

( )i

i oo A q

v q n

2

( )

( )ki

ki o

o A q

v q n

Lemma 0: : ( [ ] ( ))j

j j j qs

h E h q s q n

Lemma 1: ( [ ] ( )) ( )i

i i isVar h q s q v q

2

1 2Lemma 2: ( ( ))i

m

jj kk

ih

n

E v qb

Page 22: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

: ( [ ] ( ))j

j j j qs

h E h q s q n

( [ ] ( )) ( )i

i i isVar h q s q v q

2

1 2( ( ))i

m

jj kk

ih

n

E v qb

( ) { : ( ) ( )}i i iA q o o q and h q h o *( ) ( ) { }i iA q A q q

1( ) ( ) \{ ,..., }ki i kA q A q o o

2

( )

( )i

i oo A q

v q n

2

( )

( )ki

ki o

o A q

v q n

2( ) : ( ) 8 ( ( )) 8k ki i iSmall Variance q v q E v q

1( ) : { ,..., } ( ) ; ( ( ) ( ))ki k i i iNo Collision q o o A q v q v q

2( ) : | [ ] ( ) | 8 ( [ ] ( ))i i i q i iSmall Deviation q h q s q n Var h q s q

Page 23: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The CountSketch algorithm space complexity:

2

12

( log log )( )

m

ii k

k

nn n

O kn

Page 24: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Zipfian distribution

Analysis of the CountSketch algorithm for Zipfian distribution

Page 25: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Zipfian distribution

Zipfian(z): for some constant c.

This distribution is very common in human languages (useful in search engines).

j z

cn

j

Page 26: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Zipfian distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n1 n2 n3 n4 n5 n6 n7 n8 n9 n10

z=0.25

z=0.5

z=0.75

z=1

z=2

Prq

(oi=

q)

Page 27: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Observations

k most frequent elements can only be preceded by elements j with nj > (1-ε)nk

=> Choosing l instead of k so that nl+1 <(1-ε)nk will ensure that our list will include the k most frequent elements.

n1n2nknl+1

kn kn

Page 28: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Analysis for Zipfian distribution

For this distribution the space complexity of the algorithm is

where:1-2 21

Case 1: 21

Case 2: log21

Case 3: 2

z zz b m k

z b k m

z b k

))/log(( nbO

Page 29: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Proof of the space bounds: Part 1, l=O(k)

zlzk l

cn

k

cn ;

z

l

kz

z

l

k

n

n

k

l

k

l

n

n/1

kl nn )1( since and

1

1

l

k

n

n

z

z

l

k

n

n

k

l/1

/1

)1(

1

kl z/1)1(

)(kOl

Page 30: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Proof of the space bounds: Part 2

m

klz

m

kll ln

1

2

1

2 1m

kzl 2

1

5.0,log

5.0,21

21

zl

zz

l

m

k

m

k

z

5.0,loglog

5.0,21

2121

zlm

zz

km zz

5.0,)(

5.0,)(log

5.0,)(

21

21

zkO

zmO

zmO

z

z

2

1

2

)();( : 5 lemmaby

k

m

kll

n

nbandkb

m

kll

z nk1

22

QED

zk

zmk

zmk

b

zz

5.0,

5.0,log

5.0,

so

212

Page 31: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Comparison of space requirements for random sampling vs. our algorithm

Page 32: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Yet another algorithm which uses CountSketch d.s.

Finding items with largest frequency change

Page 33: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The problem

Let be the number of occurrences of o in S.

Given 2 streams S1,S2 find the items o such that is maximal.

2-pass algorithm.

son

22 So

So nn

Page 34: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The algorithm – first pass

First pass – only update the counters:

For each S1:

For 1 , [ ] ( )

For each S2 :

For 1 , [ ] ( )

j j

j j

q

j t h q s q

q

j t h q s q

Page 35: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

The algorithm – second pass

Pass over S1 and S2 and:

q

q

1. Maintain set A of the objects encountered with the

ˆlargest values of n .

2. For each item in A maintain an exact count of the number

of occurences in S1 and S2.

3. For each ,

ˆ n : { [i

l

q

median h q

q

] ( )}.

change A accordingly.

ˆ3. report the k items with the largest values of n amongst the

items in A.

is q

Page 36: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.

Explanation

Though A can change, items once removed are never added back.

Thus accurate exact counts can be maintained for all objects currently in A.

Space bounds for this algorithm are similar to those of the former with replaced by

qn21 S

qSq nnq

Page 37: Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.