1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 .
-
date post
18-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 .
1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 13
June 25, 2006
http://www.ee.technion.ac.il/courses/049011
4
Distinct Elements[Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02]
Input: a vector x [1,m]n
Goal: find D = number of distinct elements of x Exact algorithms: need (m) bits of space Deterministic algorithms: need (m) bits of space Approximate randomized algorithms: O(log m)
bits of space
5
Distinct Elements, 1st Attempt
Let M >> m2
Pick a “random hash function” h: [1,m] [1,M] h(1),…,h(m) are chosen uniformly and independently
from [1,M] Since M >> m2, probability of collisions is tiny
1. min M2. for i = 1 to n do3. read xi from stream4. if h(xi) < min, min h(xi)5. output M/min
6
Distinct Elements: Analysis
Space: O(log M) = O(log m) for minO(m log M) = O(m log m) for h
Too much!Worse than the naïve O(m) space algorithm
Next: show how to use more “space-efficient” hash functions
7
Small Families of Hash Functions
H = {h | h: [1,m] [1,M] }: a family of hash functions
|H| = O(mc) for some constant c Therefore, each h H can be represented in O(log m) bits
Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently.
How do we make sure H has the “random-like” properties of random hash functions?
8
Universal Hash Functions[Carter, Wegman 79]
H is a 2-universal family of hash functions if:For all x y [1,m] and for all z,w [1,M], when choosing h from H randomly, then
Pr[h(x) = z and h(y) = w] = 1/M2
Conclusions: For each x, h(x) is uniform in [1,M] For all x y, h(x) and h(y) are independent h(1),…,h(m) is a sequence of uniform pairwise-independent
random variables
k-universal families: straightforward generalization
9
Construction of a Universal Family
Suppose M = prime power [1,M] can be viewed as a finite field FM
[1,m] can be viewed as elements of FM
H = { ha,b | a,b FM } is defined as:
ha,b(x) = ax + b Note:
|H| = M2
If x y FM and z,w Fm, then ha,b(x) = z and ha,b(y) = w iff
Since x y, the above system has a unique solution Hence, Pra,b[ha,b(x) = z and ha,b(y) = w] = 1/M2.
10
Distinct Elements, 2nd Attempt Use 2-universal hash functions rather than random hash
function Space:
O(log m) for tracking the minimum O(log m) for storing the hash function
Correctness: Part 1:
h(a1),…,h(aD) are still uniform in [1,M] Linearity of expectation holds regardless of whether Z1,…,Zk are
independent or not. Part 2:
h(a1),…,h(aD) are still uniform in [1,M] Main point: variance of pairwise independent variables is additive:
11
Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 + approximation algorithm:
Find the t = O(1/2) smallest elements, rather than just the smallest one.
If v is the largest among these, output tM/v
Space: O(1/2 log m) Better algorithm: O(1/2 + log m)
12
Lp Norms Input: an integer vector x [-m,+m]n
Goal: find ||x||p = Lp norm of x
Popular instantiations: L2: Euclidean distance L1: Manhattan distance L: max L0: # of non-zeros (assuming 1/0 = 1, 00 = 0)
Not a norm
Data stream algorithm: Can be done trivially in O(log m) space
13
Lp Norms: The “Cash Register” Model Input: a sequence X of N pairs (i1,a1),…,(iN,aN)
For each j, ij {1,…,n} For each j, aj [-m,m]
Ex: X = (1,3), (3,-2), (1,-5), (2,4), (2,1)
For each i = 1,…,n, let Si = { j | ij = i } Ex: S1 = {1,3}, S2 = {4,5}, S3 = {2}
Define: xi = jSi aj
Ex: x1 = -2, x2 = 5, x3 = -2
Goal: find ||x||p = Lp norm of x
14
Lp Norms in the “Cash Register” Model: Applications Standard Lp norms Lp distances
Input: two vectors x,y [-m,+m]n (interleaved arbitrarily) Goal: find ||x – y||p
Frequency moments: Input: a vector X [1,n]N
Ex: X = (1 2 3 1 1 2) For each i = 1,…,n, define: xi = frequency of i in X
Ex: x1 = 3, x2 = 2, x3 = 1 Goal: output ||x||p Special cases:
p = : Most frequent element p = 0: Distinct elements
15
Lp Norms: State of the Art Results 0 < p ≤ 2: O(log n log m) space algorithm [Indyk 00]
2 < p < : O(n1-2/p log m) space algorithm [Indyk,Woodruff 05] (n1-2/p-o(1)) space lower bound [Saks, Sun 02], [Bar-
Yossef,Jayram,Kumar,Sivakumar 02], [Chakrabarti, Khot, Sun 03]
p = : O(n) space algorithm [Alon,Matias,Szegedy 96] (n) space lower bound [Alon,Matias,Szegedy 96]
p = 0 (distinct elements): O(log n + 1/2) space algorithm [Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan 02] (log n + 1/2) space lower bound [Alon,Matias,Szegedy
96], [Indyk, Woodruff 03]
16
Stable Distributions D: distribution on R, x Rn, p (0,2]
The distribution Dx: Z1,…,Zn: i.i.d. random variables with distribution D Dx = distribution of i xi Zi
The distribution Dp,x: Z: random variable with distribution D Dp,x = distribution of ||x||p Z
Definition: D is p-stable, if for every x, Dx = Dp,x.
Examples: p = 2: Standard normal distribution. p = 1: Cauchy distribution. Other p’s: no closed form pdf.
17
Indyk’s Algorithm
For simplicity, assume p = 1. Input: a sequence X = (i1,a1),…,(iN,aN) Output: a value z s.t.
“Cauchy hash function”: h:[1,n] R h(1),…,h(n) are i.i.d. with Cauchy distribution In practice, use bounded precision
18
Indyk’s Algorithm, 1st Attempt
1. k O(1/2 log(1/))
2. generate k Cauchy hash functions h1,…,hk
3. for t = 1,…,k do
4. At 0
5. for j = 1,…,N do
6. read (ij,aj) from data stream
7. for t = 1,…,k do
8. At At + aj ht(ij)
9. output median(A1,…,Ak)
19
Correctness Analysis
Fix some t [1,k] What value does At have at the end of the execution?
Recall: ht(1),…,ht(n) are i.i.d. with 1-stable distribution Therefore, At is distributed the same as: ||x||1 Z
Z: random variable with Cauchy distribution
20
Correctness Analysis (cont.)
Z1,…,Zk: i.i.d. random variables with Cauchy distribution
Output of algorithm: median(A1,…,Ak)
Same as: median(||x||1 Z1,…,||x||1 Zk) = ||x||1 median(Z1,…,Zk)
Conclusion: enough to show:
21
Correctness Analysis (cont.)
Claim: Let Z be distributed Cauchy. Then,
Proof: The cdf of the Cauchy distribution is:
Therefore,
Claim: Let Z be distributed Cauchy. For any sufficiently small > 0,
22
Correctness Analysis (cont.) Claim: Let Z1,…,Zk be k = O(1/2 log(1/)) i.i.d. Cauchy
random variables. Then,
Proof: For j = 1,…,k, let Then,
median(Z1,…,Zk) < 1 - iff jYj ≥ k/2
E[jYj] = k/2 - k/4
By Chernoff-Heoffding bound,
Pr[jYj ≥ k/2] < /2
Similar analysis shows:
Pr[median(Z1,…,Zk) > 1 + ] < /2
23
Space Analysis
Space used: k = O(1/2 log(1/)) times: At: O(log m) bits
ht: O(n log m) bits Too much!
This time we really need ht(1),…,ht(n) to be totally independent
Otherwise, resulting distribution is not stable Cannot use universal hashing What can we do?
24
Pseudo-Random Generators for Space-Bounded Computations [Nisan 90]
Notation: Uk = a random sequence of k bits
An S-space R-random bits randomized algorithm A: Uses at most S bits of space Uses at most R random bits Accesses random bits sequentially A(x,UR): (random) output of A on input x
Nisan’s pseudo-random generator: G: {0,1}S log R {0,1}R s.t. For every S-space R-random bits randomized algorithm A, for every input x, A(x,UR) has almost the same distribution as A(x,G(US log R))
25
Space Analysis
Suppose input stream is guaranteed to come in the following order: First all pairs of the form (1,*) Then, all pairs of the form (2,*), … Finally, all pairs of the form (n,*)
Then, we can generate the values ht(1),…,ht(n) on the fly, and no need to store them O(log m) bits will suffice to store the hash function
Therefore, for such input streams, Indyk’s algorithm uses: O(log m) bits of space O(n log m) random bits
26
Space Analysis (cont.) Conclusion: For “ordered” input streams, Indyk’s algorithm
is an O(log m)-space O(n log m)-random bits randomized algorithm.
Can use Nisan’s generator ht can now be generated from only O(log m log n) random bits Space needed: O(log n log m) bits
Crucial observation: Indyk’s algorithm does not depend on the order of the input stream.
Conclusion: If we generate the Cauchy hash functions using Nisan’s generator, then Indyk’s algorithm will work even for “unordered” streams.
27
Wrapping Up
Space used: k = O(1/2 log(1/)) times:At: O(log m) bits
ht: O(log n log m) bits (using Nisan’s generator)
Total: O(1/2 log(1/) log n log m) bits