Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based...
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based...
Algorithms for massive data setsAlgorithms for massive data sets
Lecture 2 (Mar 14, 2004)
Yossi Matias & Ely Porat
(partially based on various presentations & notes)
CS 361A 2
Negative Result for Sampling Negative Result for Sampling [Charikar, Chaudhuri, Motwani, Narasayya 2000]
Theorem: Let E be estimator for D(X) examining r<n values in X, possibly in an adaptive and randomized order. Then, for any , E must have relative error
with probability at least .
• Example – Say, r = n/5
– Error 20% with probability 1/2
1
ln2r
rn
re
CS 361A 3
Scenario AnalysisScenario Analysis
Scenario A: – all values in X are identical (say V)
– D(X) = 1
Scenario B: – distinct values in X are {V, W1, …, Wk},
– V appears n-k times
– each Wi appears once
– Wi’s are randomly distributed
– D(X) = k+1
CS 361A 4
ProofProof• Little Birdie – one of Scenarios A or B only
• Suppose– E examines elements X(1), X(2), …, X(r) in that order
– choice of X(i) could be randomized and depend arbitrarily on values of X(1), …, X(i-1)
• Lemma P[ X(i)=V | X(1)=X(2)=…=X(i-1)=V ]
• Why? – No information on whether Scenario A or B
– Wi values are randomly distributed
1
1
in
kin
CS 361A 5
Proof (continued)Proof (continued)
• Define EV – event {X(1)=X(2)=…=X(r)=V}
• Last inequality because
rn
krr
rn
k
r
rn
krnr
iin
kin
ViXXXViXr
i
2exp1
11
1
)1(...)2()1(|)(
1
PΕP V
2/10for ),2exp(1 ZZZ
CS 361A 6
Proof (conclusion)Proof (conclusion)
• Choose to obtain
• Thus:– Scenario A – Scenario B
• Suppose– E returns estimate Z when EV happens
– Scenario A D(X)=1
– Scenario B D(X)=k+1
– Z must have worst-case error >
1
ln2r
rnk
EVP
1P EV EVP
k
A bit vector BV will represent the setLet b be smallest integer s.t. 2^b > u. Let F = GF(2^b). Let r,s be random from F.
For a in A, let h(a) = r ·a + s = 101****10….0Set k’th bit.
Estimate is 2^{max bit set}.
k
Randomized Approximation Randomized Approximation (based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996])(based on [Flajolet-Martin 1983, Alon-Matias-Szegedy 1996])
Theorem: For every c > 2 there exists an algorithm that, given a sequence A of n members of U={1,2,…,u}, computes a number d’ using O(log u) memory bits, such that the probability that max(d’/d,d/d’) > c is at most 2/c.
0 1 k u-1
Pr(h(a)=k)
Bit vector : 0000101010001001111
b
k
Randomized Approximation Randomized Approximation (2)(2)(based on [Indyk-Motwani 1998])(based on [Indyk-Motwani 1998])
• Algorithm SM – For fixed t, is D(X) >> t?– Choose hash function h: U[1..t]
– Initialize answer to NO
– For each , if h( ) = t, set answer to YES
• Theorem:– If D(X) < t, P[SM outputs NO] > 0.25
– If D(X) > 2t, P[SM outputs NO] < 0.136 = 1/e^2
ix ix
AnalysisAnalysis
• Let – Y be set of distinct elements of X
• SM(X) = NO no element of Y hashes to t
• P[element hashes to t] = 1/t
• Thus – P[SM(X) = NO] =
• Since |Y| = D(X),– If D(X) < t, P[SM(X) = NO] > > 0.25
– If D(X) > 2t, P[SM(X) = NO] < < 1/e^2
• Observe – need 1 bit memory only!
||)/11( Yt
tt)/11( tt 2)/11(
Boosting AccuracyBoosting Accuracy• With 1 bit
can probabilistically distinguish D(X) < t from D(X) > 2t
• Running O(log 1/δ) instances in parallel reduces error probability to any δ>0
• Running O(log n) in parallel for t = 1, 2, 4, 8 …, n can estimate D(X) within factor 2
• Choice of factor 2 is arbitrary can use factor (1+ε) to reduce error to ε
• EXERCISE – Verify that we can estimate D(X) within factor (1±ε) with probability (1-δ) using space )( 1
loglog 2
nO
CS 361A 11
Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-represents all the data
– For a fast approx answer, apply the query to S & “scale” the result
– E.g., R.a is {0,1}, S is a 20% sample
select count(*) from R where R.a = 0
select 5 * count(*) from S where S.a = 0
1 1 0 1 1 1 1 1 0 0 0
0 1 1 1 1 1 0 11 1 0 1 0 1 1
0 1 1 0
Red = in S
R.aR.a
Est. count = 5*2 = 10, Exact count = 10
• Leverage extensive literature on confidence intervals for sampling
Actual answer is within the interval [a,b] with a given probability
E.g., 54,000 ± 600 with prob 90%
Sampling versus CountingSampling versus Counting• Observe
– Count merely abstraction – need subsequent analytics
– Data tuples – X merely one of many attributes
– Databases – selection predicate, join results, …
– Networking – need to combine distributed streams
• Single-pass Approaches– Good accuracy
– But gives only a count -- cannot handle extensions
• Sampling-based Approaches– Keeps actual data – can address extensions
– Strong negative result
Distinct Sampling for StreamsDistinct Sampling for Streams[Gibbons 2001]
• Best of both worlds– Good accuracy
– Maintains “distinct sample” over stream
– Handles distributed setting
• Basic idea– Hash – random “priority” for domain values
– Tracks highest priority values seen
– Random sample of tuples for each such value
– Relative error with probability
12 log O
1
Hash FunctionHash Function
• Domain U = [0..m-1]
• Hashing– Random A, B from U, with A>0
– g(x) = Ax + B (mod m)
– h(x) – # leading 0s in binary representation of g(x)
• Clearly –
• Fact
mxh log)(0
)1(2P llh(x)
Overall IdeaOverall Idea
• Hash random “level” for each domain value
• Compute level for stream elements
• Invariant– Current Level – cur_lev
– Sample S – all distinct values scanned so far of level at least cur_lev
• Observe– Random hash random sample of distinct values
– For each value can keep sample of their tuples
Algorithm DS (Distinct Sample)Algorithm DS (Distinct Sample)
• Parameters – memory size
• Initialize – cur_lev0; Sempty
• For each input x– L h(x)
– If L>cur_lev then add x to S
– If |S| > M• delete from S all values of level cur_lev• cur_lev cur_lev +1
• Return ||2 _ Slevcur
12 logM O
AnalysisAnalysis
• Invariant – S contains all values x such that
• By construction
• Thus
• EXERCISE – verify deviation bound
levcurxh _)(
)(2|S|E _ XDlevcur
levcurlevcurxh _2_)(P
CS 361A 18
Hot list queriesHot list queries
• Why is it interesting:– Top ten – best seller list
– Load balancing
– Caching policies
CS 361A 19
Hot list queriesHot list queries
• Let use sampling
edoejddkaklsadkjdkdkpryekfvcuszldfoasd
djkkdkvza
k3d2jvza
CS 361A 20
Hot list queriesHot list queries
• The question is:– How to sample if we don’t know our sample size?
CS 361A 21
Gibbons & Matias’ algorithmGibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
c d
p = 1.0
CS 361A 22
Gibbons & Matias’ algorithmGibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
c d
p = 1.0
e
Need to replaceone value
CS 361A 23
Gibbons & Matias’ algorithmGibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
c d
p = 0.75
e
Multiply p with someamount f
(f = 0.75)
Throw biasedcoins with
probability f
4 3 0 2
Replace countsby number
of seen heads
CS 361A 24
Gibbons & Matias’ algorithmGibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
e d
p = 0.75
e
4 3 1 2
Replace a value whichhas zero count
Count/p is an estimate of number oftimes a value has been seen. E.g., the
value ‘a’ has been seen 4/p = 5.33 times
CS 361A 25
CountersCounters
• How many bits need to count?– Prefix code
– Approximated counters
CS 361A 26
RarityRarity• Paul goes fishing.
• There are many different fish species U={1,..,u}
• Paul catch one fish at a time atU
• Ct[j]=|{ai| ai=j,i≤t}| number of time catches the species j
• Species j is rare at time t if it appears only once
[t]=|{j| Ct[j]=1}|/u
CS 361A 27
RarityRarity
• Why is it interesting?
CS 361A 28
Again lets use samplingAgain lets use sampling
U={1,2,3,4,5,6,7,8,9,10,11,12…u}
U’={4,9,13,18,24}
Xt[i]=|{t|aj=U’[i],j≤t}|
CS 361A 29
Again lets use samplingAgain lets use sampling
Xi[t]=|{t|aj=Xi,j≤t}|
[t]=|{Ct[i]| Ct[i]=1}|/u
’[t]=|{Xt[i]| Xt[i]=1}|/k
:תזכורת
CS 361A 30
RarityRarity
• But [t] need to be at least 1/k to get a good estimator.
|}0][|{|
|}1][|{|][
jCj
jcjt
t
t
CS 361A 31
Min-wise independent hash functionsMin-wise independent hash functions
• Family of hash functions H[n]->[n]call Min-wise independent
• If for any X [n] and xX
||
1)](min)([Pr
XXhxhHh