Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC...
-
Upload
cassie-haig -
Category
Documents
-
view
218 -
download
1
Transcript of Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC...
Finding Frequent Items in Data Streams
Moses Charikar Princeton Un., Google Inc.Kevin Chen UC Berkeley, Google Inc.Martin Franch-Colton Rutgers Un., Google Inc.
Presented by Amir Rothschild
Presenting:
1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space.
The algorithm achieves especially good space bounds for Zipfian distribution
2-pass algorithm for estimating the items with the largest change in frequency between two data streams.
Definitions:
Data stream: where Object oi appears ni times in S. Order oi so that fi = ni/n
1 2, ,..., nS q q q
1 2{ , ,..., }i mq O o o o
1 2 ... mn n n
The first problem:
FindApproxTop(S,k,ε) Input: stream S, int k, real ε. Output: k elements from S such that:
for every element Oi in the output:(1 )i kn n
(1 )i kn n Contains every item with:
n1n2nk
kn kn
Clarifications:
This is not the problem discussed last week!
Sampling algorithm does not give any bounds for this version of the problem.
Hash functions
We say that h is a pair wise independent hash function, if h is chosen randomly from a group H, so that:
1, : Pr ( ( ) )
| |
1, 2 , 1, 2 :
1Pr ( ( 1) 1| ( 2) 2)
| |
h H
h H
H A B
a A b B h a bB
a a A b b B
h a b h a bB
Let’s start with some intuition…
Idea: Let s be a hash function from
objects to {+1,-1}, and let c be a counter.
For each qi in the stream, update c += s(qi)
( ( ))i isE c s o n
C
S
Estimate ni=c*s(oi) (since )
Realization ( ( ))i isE c s o n
22 1 1 2 2 2 3 3 2
1 1 2 2 3 3 2
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
cs o n s o s o n s o n s o s o
n s o s o n n s o s o
s(O1)s(O2) s(O2)s(O2) s(O3)s(O2)
s1 -1 +1 -1
s2 -1 +1 +1
s3 +1 +1 -1
s4 +1 +1 +1
E 0 +1 0
Claim:
For each element Oj other then Oi, s(Oj)*s(Oi)=-1 w.p.1/2
s(Oj)*s(Oi)=+1 w.p. 1/2. So Oj adds the counter +nj w.p. 1/2 and -nj
w.p. 1/2, and so has no influence on the expectation.
Oi on the other hand, adds +ni to the counter w.p. 1 (since s(Oi)*s(Oi)=+1)
So the expectation (average) is +ni.
( ( ))i isE c s o n
1 1 2 2( ) ( ) ... ( )m mc n s o n s o n s o Proof:
That’s not enough:
The variance is very high. O(m) objects have estimates that
are wrong by more then the variance.
First attempt to fix the algorithm…
t independent hash functions Sj
t different counters Cj
For each element qi in the stream: For each j in {1,2,…,t} do
Cj += Sj(qi) Take the mean or the median of the
estimates Cj*Sj(oi) to estimate ni.
C1 C3C2 C4 C5 C6
S1 S2 S3 S4 S5 S6
Still not enough
Collisions with high frequency elements like O1 can spoil most estimates of lower frequency elements, as Ok.
Ci
The solution !!!
Divide & Conquer: Don’t let each element update
every counter. More precisely: replace each
counter with a hash table of b counters and have the items one counter per hash table.
Ti
hi
Si
Presenting the CountSketch algorithm…
Let’s start working…
h1 h2 ht
t hash tables
b b
uck
ets
T1
h1
S1
T2
h2
S2
Ttht
St
CountSketch data structure
The CountSketch data structure
Define CountSkatch d.s. as follows: Let t and b be parameters with values
determined later. h1,…,ht – hash functions O -> {1,2,…,b}. T1,…,Tt – arrays of b counters. S1,…,St – hash functions from objects O to
{+1,-1}. From now on, define : hi[oj] := Ti[hi(oj)]
The d.s. supports 2 operations:
Add(q): Estimate(q):
)(][:...1 qsqhDotiFor ii )}(][{ qsqhmedianreturn ii
Why median and not mean? In order to show the median is close
to reality it’s enough to show that ½ of the estimates are good.
The mean on the other hand is very sensitive to outliers.
Finally, the algorithm:
Keep a CountSketch d.s. C, and a heap of the top k elements.
Given a data stream q1,…,qn: For each j=1,…,n:
C.Add(qj); If qj is in the heap, increment it’s count. Else, If C.Estimate(qj) > smallest estimated
count in the heap, add qj to the heap. (If the heap is full evict the object with the
smallest estimated count from it)
And now for the hard part:
Algorithms analysis
Definitions
( ) { : ( ) ( )}i i iA q o o q and h q h o *( ) ( ) { }i iA q A q q
1( ) ( ) \{ ,..., }ki i kA q A q o o
2
( )
( )i
i oo A q
v q n
2
( )
( )ki
ki o
o A q
v q n
Claims & Proofs
( ) { : ( ) ( )}i i iA q o o q and h q h o *( ) ( ) { }i iA q A q q
1( ) ( ) \{ ,..., }ki i kA q A q o o
2
( )
( )i
i oo A q
v q n
2
( )
( )ki
ki o
o A q
v q n
Lemma 0: : ( [ ] ( ))j
j j j qs
h E h q s q n
Lemma 1: ( [ ] ( )) ( )i
i i isVar h q s q v q
2
1 2Lemma 2: ( ( ))i
m
jj kk
ih
n
E v qb
: ( [ ] ( ))j
j j j qs
h E h q s q n
( [ ] ( )) ( )i
i i isVar h q s q v q
2
1 2( ( ))i
m
jj kk
ih
n
E v qb
( ) { : ( ) ( )}i i iA q o o q and h q h o *( ) ( ) { }i iA q A q q
1( ) ( ) \{ ,..., }ki i kA q A q o o
2
( )
( )i
i oo A q
v q n
2
( )
( )ki
ki o
o A q
v q n
2( ) : ( ) 8 ( ( )) 8k ki i iSmall Variance q v q E v q
1( ) : { ,..., } ( ) ; ( ( ) ( ))ki k i i iNo Collision q o o A q v q v q
2( ) : | [ ] ( ) | 8 ( [ ] ( ))i i i q i iSmall Deviation q h q s q n Var h q s q
The CountSketch algorithm space complexity:
2
12
( log log )( )
m
ii k
k
nn n
O kn
Zipfian distribution
Analysis of the CountSketch algorithm for Zipfian distribution
Zipfian distribution
Zipfian(z): for some constant c.
This distribution is very common in human languages (useful in search engines).
j z
cn
j
Zipfian distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
z=0.25
z=0.5
z=0.75
z=1
z=2
Prq
(oi=
q)
Observations
k most frequent elements can only be preceded by elements j with nj > (1-ε)nk
=> Choosing l instead of k so that nl+1 <(1-ε)nk will ensure that our list will include the k most frequent elements.
n1n2nknl+1
kn kn
Analysis for Zipfian distribution
For this distribution the space complexity of the algorithm is
where:1-2 21
Case 1: 21
Case 2: log21
Case 3: 2
z zz b m k
z b k m
z b k
))/log(( nbO
Proof of the space bounds: Part 1, l=O(k)
zlzk l
cn
k
cn ;
z
l
kz
z
l
k
n
n
k
l
k
l
n
n/1
kl nn )1( since and
1
1
l
k
n
n
z
z
l
k
n
n
k
l/1
/1
)1(
1
kl z/1)1(
)(kOl
Proof of the space bounds: Part 2
m
klz
m
kll ln
1
2
1
2 1m
kzl 2
1
5.0,log
5.0,21
21
zl
zz
l
m
k
m
k
z
5.0,loglog
5.0,21
2121
zlm
zz
km zz
5.0,)(
5.0,)(log
5.0,)(
21
21
zkO
zmO
zmO
z
z
2
1
2
)();( : 5 lemmaby
k
m
kll
n
nbandkb
m
kll
z nk1
22
QED
zk
zmk
zmk
b
zz
5.0,
5.0,log
5.0,
so
212
Comparison of space requirements for random sampling vs. our algorithm
Yet another algorithm which uses CountSketch d.s.
Finding items with largest frequency change
The problem
Let be the number of occurrences of o in S.
Given 2 streams S1,S2 find the items o such that is maximal.
2-pass algorithm.
son
22 So
So nn
The algorithm – first pass
First pass – only update the counters:
For each S1:
For 1 , [ ] ( )
For each S2 :
For 1 , [ ] ( )
j j
j j
q
j t h q s q
q
j t h q s q
The algorithm – second pass
Pass over S1 and S2 and:
q
q
1. Maintain set A of the objects encountered with the
ˆlargest values of n .
2. For each item in A maintain an exact count of the number
of occurences in S1 and S2.
3. For each ,
ˆ n : { [i
l
q
median h q
q
] ( )}.
change A accordingly.
ˆ3. report the k items with the largest values of n amongst the
items in A.
is q
Explanation
Though A can change, items once removed are never added back.
Thus accurate exact counts can be maintained for all objects currently in A.
Space bounds for this algorithm are similar to those of the former with replaced by
qn21 S
qSq nnq