Advanced Algorithms for Massive Datasets
description
Transcript of Advanced Algorithms for Massive Datasets
Advanced Algorithmsfor Massive Datasets
The power of “failing”
TTT 2
Not perfectly true but...
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1
0 1 2 3 4 5 6 7 8 9 10
Fa
lse
po
siti
ve
rate
Hash functions
m/n = 8Opt k = 5.45...
We do have an
explicit formula
for the optimal k
Other advantage: no key storage
Crawling
What data structures should we use to keep
track of the visited URLs of a crawler?
URLs are long
Check should be very fast
No care about small errors (≈ page not crawled)
Bloom Filter
over crawled URLs
Anti-virus detection
D is a dictionary of virus-checksum of some given length z. For each position i, check…
Brute-force check: O( |D| * |F| ) time Trie check: O( z * |F| ) time Better Solution ?
Build a BF on D.
Check T[i,i+z-1] є D, if BF answers YES
then “warn the user” or explicitly scan D
FVji i+z
O(k*|F|)
or even better...
Upper bounds
Upper bounds
Recurring minimum forimproving the estimate
+ 2 SBF