Sublinear Algorihms for Big Data
description
Transcript of Sublinear Algorihms for Big Data
![Page 2: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/2.jpg)
Part 0: Introduction
• Disclaimers• Logistics• Materials• …
![Page 3: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/3.jpg)
Name
Correct:• Grigory• Gregory (easiest and highly recommended!)Also correct: • Dr. Yaroslavtsev (I bet it’s difficult to pronounce)Wrong:• Prof. Yaroslavtsev (Not any easier)
![Page 4: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/4.jpg)
Disclaimers
• A lot of Math!
![Page 5: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/5.jpg)
Disclaimers
• No programming!
![Page 6: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/6.jpg)
Disclaimers
• 10-15 times longer than “Fuerza Bruta”, soccer game, milonga…
![Page 7: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/7.jpg)
Big Data
• Data• Programming and Systems• Algorithms• Probability and Statistics
![Page 8: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/8.jpg)
Sublinear Algorithms
= size of the data, we want , not • Sublinear Time– Queries– Samples
• Sublinear Space– Data Streams– Sketching
• Distributed Algorithms– Local and distributed computations – MapReduce-style algorithms
![Page 9: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/9.jpg)
Why is it useful?
• Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making)– Networking applications (counting and detecting
patterns in small space)– Distributed computations (small sketches to reduce
communication overheads)• Aggregate Knowledge: startup doing streaming
algorithms, acquired for $150M• Today: Applications to soccer
![Page 10: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/10.jpg)
Course Materials
• Will be posted at the class homepage:http://grigory.us/big-data.html
• Related and further reading:– Sublinear Algorithms (MIT) by Indyk, Rubinfeld– Algorithms for Big Data (Harvard) by Nelson– Data Stream Algorithms (University of
Massachusetts) by McGregor– Sublinear Algorithms (Penn State) by
Raskhodnikova
![Page 11: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/11.jpg)
Course Overview
• Lecture 1• Lecture 2• Lecture 3• Lecture 4• Lecture 5
3 hours = 3 x (45-50 min lecture + 10-15 min break).
![Page 12: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/12.jpg)
Puzzles
You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) – If all are different and have values between and ,
which value is missing? – You have space
• Example:– There are 11 soccer players with numbers 1, …, 11. – You see 10 of them one by one, which one is
missing? You can only remember a single number.
![Page 13: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/13.jpg)
1
![Page 14: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/14.jpg)
8
![Page 15: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/15.jpg)
5
![Page 16: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/16.jpg)
11
![Page 17: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/17.jpg)
3
![Page 18: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/18.jpg)
9
![Page 19: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/19.jpg)
2
![Page 20: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/20.jpg)
6
![Page 21: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/21.jpg)
7
![Page 22: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/22.jpg)
4
![Page 23: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/23.jpg)
Which number was missing?
![Page 24: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/24.jpg)
Puzzle #1
You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) – If all are different and have values between and ,
which value is missing? – You have space
• Example:– There are 11 soccer players with numbers 1, …, 11. – You see 10 of them one by one, which one is
missing? You can only remember a single number.
![Page 25: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/25.jpg)
Puzzle #2
You see a sequence of values , arriving one by one: • (Harder, “Keep a random team”) – How can you maintain a uniformly random sample
of values out of those you have seen so far? – You can store exactly items at any time
• Example:– You want to have a team of 11 players randomly
chosen from the set you have seen.– Players arrive one at a time and you have to decide
whether to keep them or not.
![Page 26: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/26.jpg)
Puzzle #3
You see a sequence of values , arriving one by one: • (Very hard, “Count the number of players”)– What is the total number of values up to error – You have space and can be completely wrong
with some small probability
![Page 27: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/27.jpg)
Puzzles
You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”)
– If all are different and have values between and , which value is missing?
– You have space• (Harder, “Keep a random team”)
– How can you maintain a uniformly random sample of values out of those you have seen so far?
– You can store exactly items at any time• (Very hard, “Count the number of players”)
– What is the total number of values up to error – You have space and can be completely wrong with some small
probability
![Page 28: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/28.jpg)
Part 1: Probability 101
“The bigger the data the better you should know your Probability”• Basic Spanish: Hola, Gracias, Bueno, Por favor,
Bebida, Comida, Jamon, Queso, Gringo, Chica, Amigo, …
• Basic Probability:– Probability, events, random variables– Expectation, variance / standard deviation – Conditional probability, independence, pairwise
independence, mutual independence
![Page 29: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/29.jpg)
Expectation
• = random variable with values • Expectation 𝔼• Properties (linearity):
• Useful fact: if all and integer then 𝔼
![Page 30: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/30.jpg)
Expectation
• Example: dice has values with probability 1/6[Value]
![Page 31: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/31.jpg)
Variance
• Variance
• is some fixed value (a constant)
• Corollary:
![Page 32: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/32.jpg)
Variance• Example (Variance of a fair dice):
[=[ +
![Page 33: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/33.jpg)
Independence
• Two random variables and are independent if and only if (iff) for every :
• Variables are mutually independent iff
• Variables are pairwise independent iff for all pairs i,j
![Page 34: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/34.jpg)
Independence: Example
Independent or not?• Event Argentina wins the World Cup• Event Messi becomes the best striker
Independent or not?• Event Argentina wins against Netherlands in
the semifinals• Event Germany wins against Brazil in the
semifinals
![Page 35: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/35.jpg)
Independence: Example
• Ratings of mortgage securities– AAA = 1% probability of default (over X years)– AA = 2% probability of default– A = 5% probability of default– B = 10% probability of default– C = 50% probability of default– D = 100% probability of default
• You are a portfolio holder with 1000 AAA securities? – Are they all independent? – Is the probability of default
![Page 36: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/36.jpg)
Conditional Probabilities
• For two events and :
• If two random variables (r.vs) are independent
(by definition) (by independence)
![Page 37: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/37.jpg)
Union Bound
For any events :
• Pro: Works even for dependent variables!• Con: Sometimes very loose, especially for
mutually independent events
![Page 38: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/38.jpg)
Union Bound: Example
Events “Argentina wins the World Cup” and “Messi becomes the best striker” are not independent, but:
Pr[“Argentina wins the World Cup” or “Messi becomes the best striker”]
Pr[“Argentina wins the World Cup”] +Pr[“Messi becomes the best striker”]
![Page 39: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/39.jpg)
Independence and Linearity of Expectation/Variance
• Linearity of expectation (even for dependent variables!):
• Linearity of variance (only for pairwise independent variables!)
![Page 40: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/40.jpg)
Part 2: Inequalities
• Markov inequality• Chebyshev inequality• Chernoff bound
![Page 41: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/41.jpg)
Markov’s Inequality• For every • Proof (by contradiction) (by definition)
(pick only some i’s) > (by assumption )
![Page 42: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/42.jpg)
Markov’s Inequality
• For every • Corollary :For every • Pro: always works!• Cons: – Not very precise– Doesn’t work for the lower tail:
![Page 43: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/43.jpg)
Markov Inequality: Example
Markov 1: For every
• Example:
![Page 44: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/44.jpg)
Markov Inequality: Example
Markov 2: For every
• Example:
![Page 45: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/45.jpg)
Markov Inequality: ExampleMarkov 2: For every
• :
![Page 46: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/46.jpg)
Markov + Union Bound: Example
Markov 2: For every
• Example: 1.75
![Page 47: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/47.jpg)
Chebyshev’s Inequality
• For every :
• Proof:
(by squaring)
(by Markov’s inequality)
![Page 48: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/48.jpg)
Chebyshev’s Inequality
• For every :
• Corollary ():For every :
![Page 49: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/49.jpg)
Chebyshev: Example
![Page 50: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/50.jpg)
Chebyshev: Example
• Roll a dice 10 times: = Average value over 10 rolls
• = value of the i-th roll, • Variance ( by linearity for independent r.vs):
![Page 51: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/51.jpg)
Chebyshev: Example
• Roll a dice 10 times: = Average value over 10 rolls
• (if n rolls then 2.91 / n)
![Page 52: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/52.jpg)
Chernoff bound
• Let be independent and identically distributed r.vs with range [0,1] and expectation .
• Then if and
![Page 53: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/53.jpg)
Chernoff bound (corollary)
• Let be independent and identically distributed r.vs with range [0, c] and expectation .
• Then if and
![Page 54: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/54.jpg)
Chernoff: Example
• Roll a dice 10 times: = Average value over 10 rolls
![Page 55: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/55.jpg)
Chernoff: Example
• Roll a dice 1000 times: = Average value over 1000 rolls
![Page 56: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/56.jpg)
Chernoff v.s Chebyshev: Example
Let :• Chebyshev: • Chernoff: If is very big:• Values are all constants!– Chebyshev: – Chernoff:
![Page 57: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/57.jpg)
Chernoff v.s Chebyshev: Example
Large values of t is exactly what we need!• Chebyshev: • Chernoff:
So is Chernoff always better for us?• Yes, if we have i.i.d. variables.• No, if we have dependent or only pairwise
independent random varaibles.• If the variables are not identical – Chernoff-type
bounds exist.
![Page 58: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/58.jpg)
Answers to the puzzlesYou see a sequence of values , arriving one by one: • (Easy)
– If all are different and have values between and , which value is missing?
– You have space– Answer: missing value =
• (Harder) – How can you maintain a uniformly random sample of values
out of those you have seen so far? – You can store exactly values at any time– Answer: Store first . When you see for , with probability
replace random value from your storage with .
![Page 59: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/59.jpg)
Part 3: Morris’s Algorithm
• (Very hard, “Count the number of players”)– What is the total number of values up to error – You have space and can be completely wrong
with some small probability
![Page 60: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/60.jpg)
Morris’s Algorithm: Alpha-version
Maintains a counter using bits• Initialize to 0• When an item arrives, increase X by 1 with
probability • When the stream is over, output
Claim:
![Page 61: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/61.jpg)
Morris’s Algorithm: Alpha-version
Maintains a counter using bits• Initialize to 0, when an item arrives, increase
X by 1 with probability Claim: • Let the value after seeing items be
= 1 +
![Page 62: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/62.jpg)
Morris’s Algorithm: Alpha-version
Maintains a counter using bits• Initialize to 0, when an item arrives, increase
X by 1 with probability Claim:
![Page 63: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/63.jpg)
Morris’s Algorithm: Alpha-version
Maintains a counter using bits• Initialize to 0, when an item arrives, increase
X by 1 with probability • = n + 1, • Is this good?
![Page 64: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/64.jpg)
Morris’s Algorithm: Beta-version
Maintains counters using bits for each• Initialize to 0, when an item arrives, increase
each by 1 independently with probability • Output • = n + 1, • Claim: If then
![Page 65: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/65.jpg)
Morris’s Algorithm: Beta-version
Maintains counters using bits for each• Output
• Claim: If then
– If we can make this at most
![Page 66: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/66.jpg)
Morris’s Algorithm: Final
• What if I want the probability of error to be really small, i.e.
• Same Chebyshev-based analysis: • Do these steps times independently in
parallel and output the median answer.• Total space:
![Page 67: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/67.jpg)
Morris’s Algorithm: Final
• Do these steps times independently in parallel and output the median answer .
Maintains counters using bits for each• Initialize to 0, when an item arrives, increase
each by 1 independently with probability • Output
![Page 68: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/68.jpg)
Morris’s Algorithm: Final Analysis
Claim: • Let be an indicator r.v. for the event that ,
where is the i-th trial.• Let .
![Page 69: Sublinear Algorihms for Big Data](https://reader035.fdocuments.net/reader035/viewer/2022062722/56813af1550346895da36bb9/html5/thumbnails/69.jpg)
Thank you!
• Questions?• Next time: –More streaming algorithms– Testing distributions