kwiecień 2013
description
Transcript of kwiecień 2013
![Page 1: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/1.jpg)
kwiecień 2013
Filtry Blooma jako
oszczędna struktura słownikowa
Szymon [email protected]
Instytut Informatyki Stosowanej, Politechnika Łódzka
![Page 2: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/2.jpg)
Hash table: store keys (and possibly satellite data), the location is found via a hash function; some collision resolution method needed.
The idea of hashing
chained hashing open addressing
![Page 3: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/3.jpg)
3
…but have a few drawbacks:are randomized,
don’t allow for iteration in a sorted order,and may require quite some space.
Now, if we require possibly small space, what can we do?
Hash structures are typically fast…
![Page 4: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/4.jpg)
4
Don’t store the keys themselves!
Bloom Filter (Bloom, 1970)
Just be able to answer if a given key is in the structure.If the answer if “no”, it is correct.
But if the answer is “yes”, it may be wrong!
So, it’s a probabilistic data structure.
There’s a tradeoff between its space(avg space per inserted key) and “truthfulness”.
![Page 5: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/5.jpg)
5
Bloom Filter features• little space, so save on RAM (mostly old apps);
• little space, so also fast to transfer the structure over network;
• little space, sometimes small enough to fit L2 (or L1!) CPU cache (Li & Zhong, 2006: BF makes a Bayesian spam filter work
much faster thx to fitting an L2 cache);
• extremely simple / easy to implement;
• major application domains: databases, networking;
• …but also a few drawbacks / issues, hence a significantinterest in devising novel BF variants
![Page 6: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/6.jpg)
6
The idea:
• keep a bit-vector of some size m, initially all zeros;
• use k independent hash functions (h.f.) (instead of one, in a standard HT) for each added key;
• write 1 in the k locations pointed by the k h.f.;
• testing for a key: if in all k calculated locations there is 1,then return “yes” (=the key exists), which may be wrong,
if among the k locations there’s at least one 0, return “no”, which is always correct.
Bloom Filter idea
![Page 7: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/7.jpg)
7
BF, basic API
insert(k)
exists(k)
No delete(k)!
And of course: no iteration over the keysadded to the BF (no content listing).
http://www.cl.cam.ac.uk/research/srg/opera/meetings/attachments/2008-10-14-BloomFiltersSurvey.pdf
![Page 8: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/8.jpg)
8
Early applications – spellchecking (1982, 1990), hyphenation
If a spellcheck occasionally ignoresa word not in its dictionary – not a big problem.
This is exactly the case with BF in this app.
Quite a good app: the dictionary is static (or almost static), so once we set the BF size,
we can estimate the error,which practically doesn’t change.
App from Bloom’s paper: program for automatic hyphenation in which 90% of words can be hyphenated using simple rules,
but 10% require dictionary lookup.
![Page 9: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/9.jpg)
9
Bloom speaking…
http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=7FB6933B782FBC9C98BBCDA0EB420935?doi=10.1.1.20.2080&rep=rep1&type=pdf
![Page 10: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/10.jpg)
10
BF tradeoffs
The error grows with load (i.e. with growing n / m, n is the # of added items).
When the BF is almost empty, the error is very small,but then we also waste lots of space.
Another factor: k. How to choose it?
For any specified load (m set to the ‘expected’ nin a given scenario) there is an optimal value of k
(such that minimizes the error).
k too small – too many collisions;k too large – the bit vector gets too ‘dense’ quickly
(and too many collisions, too!)
![Page 11: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/11.jpg)
11
Finding the best k
We assume the hash functionschoose each bit vector slot with equal prob.
Pr(a given bit NOT set by a given h.f.) = 1 – 1/m
Pr(a given bit NOT set) = (1 – 1/m)kn
m and kn are typically large, so Pr(a given bit NOT set) e–kn / m
Pr(a given bit is set) = 1 – (1 – 1/m)kn
Consider an element not added to the BF:the filter will lie if all the corresponding k bits are set.
This is: Pr(a given bit is set)k = (1 – (1 – 1/m)kn)k (1 – e–kn / m)k
![Page 12: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/12.jpg)
12
Finding the best k, cont’d
Differentiation (=calculating a derivative) helps.
The error is minimized for k = ln 2 * m / n 0.693 m / n.(Then the # of 1s and 0s in the bit-vector is
approx. equal. Of course, k must be an integer!)
And the error (false positive rate, FPR) = (1/2)k (0.6185)m / n.
Again, Pr(the Bloom filter lies) (1 – e–kn / m)k.
Clearly, the error grows with growing n (for fixed k, m)and decreases with growing m (for fixed k, n).
What is the optimal k?
![Page 13: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/13.jpg)
13
Minimizing the error in practice
m = 8n error 0.0214m = 12n error 0.0031m = 16n error 0.0005
http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html#SECTION00053000000000000000
![Page 14: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/14.jpg)
14
www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
FPR example, m / n = 8
![Page 15: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/15.jpg)
15
Funny tricks with BF
Given two BFs, representing sets S1 and S2,with the same # of bits and using the same hash functions,
we can represent the union of those sets by taking the OR of the two bit-vectors of the original BFs.
Say you want to halve the memory use after some timeand assume the filter size is a power of 2.
Just OR the halves of the filter.When hashing for a lookup, OR the lower and upper bits
of the hash value.
Intersection of two BFs (of the same size), i.e. AND operation, can be used to approximate
the intersection of two sets.
![Page 16: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/16.jpg)
16
Solution: when the filter gets ‘full’ (reaches the limit on the fill ratio), a new one is added, with tighter
max FPR, and querying translates totesting at most all of those filters…
Scalable BF (Almeida et al., 2007)
We can find the optimal k knowing n / m in advance.As m is settled once, we must know (roughly) n,
the number of items to add.What if we have a pale idea of the size of n..?
If the initial m is too large, we may halve it easily(see prev. slide). Crude, but possible.
What about m being too small?
![Page 17: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/17.jpg)
17
How to approximate a set without knowing its size in advance
ε – max allowed false positive rate
Classic result: BF (and some other related structures) offers (n log(1/ε))-bit solution, when n is known in advance.
Pagh, Segev, Wider (2013):
![Page 18: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/18.jpg)
18
Semi-join operation in a distributed database
Empl Salary Addr City
John 60K … New York
George 30K … New York
Moe 25K … Topeka
Alice 70K … Chicago
Raul 30K Chicago
City Cost of living
New York 60K
Chicago 55K
Topeka 30K
Task: Create a table of all employees that make < 40K and live in city where COL > 50K.
Empl Salary Addr City COL
Semi-join: send (from A to B) just (City)
database A database B
Anything better?www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
![Page 19: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/19.jpg)
19
• BF-based solution: A sends a Bloom filterinstead of actual city names,
• …then B sends back its answers…
• …from which A filters out the false positives
This is to minimize transfer (over a network) between the database sites! The CPU work is increased:
B needs to filter its city list using the received filter, A needs to filter its received list of persons.
Bloom-join
![Page 20: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/20.jpg)
20
P2P keyword search(Reynolds & Vadhat, 2003)
• distributed inverted index on words, multi-word queries,
• Peer A holds list of document IDs containing Word1, Peer B holds list for Word2,
• intersection needed, but minimize communication,
• A sends B a Bloom filter of document list,
• B sends back possible intersections to A,
• A verifies and sends the true result to user,
• i.e. equivalent to Bloom-join
![Page 21: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/21.jpg)
21
Web Cache 1 Web Cache 2 Web Cache 3
The WebThe Web
Distributed Web caches (Fan et al., 2000)
www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt
![Page 22: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/22.jpg)
22
k-mer counting in bioinformatics
http://www.homolog.us/blogs/wp-content/uploads/2011/07/i6.png
k-mers: substrings of length k in a DNA sequence.Counting them is important: for genome de novo assemblers
(based on a de Brujin graph),for detection of repeated sequences,
to study the mechanisms of sequence duplication in genomes, etc.
![Page 23: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/23.jpg)
23
BFCounter algorithm(Melsted & Pritchard, 2011)
The considered problem variant:find all non-unique k-mers in reads collection, with their counts.
I.e. ignore k-mers with occ = 1 ( almost certainly noise).
Input data: 2.66G 36bp Illumina reads (40-fold coverage).
(Output) statistics:12.18G k-mers (k = 25) present in the sequencing reads,
of which 9.35G are unique and 2.83G have coverage of two or greater.
![Page 24: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/24.jpg)
24
BFCounter idea(Melsted & Pritchard, 2011)
Both a Bloom filter and a plain hash table used.
Bloom filter B used to store implicitly all k-mers seen so far, while only inserting non-unique k-mers into the hash table T.
For each k-mer x, we check if x is in B.If not, we update the appropriate bits in B, to indicate that it has now been observed.
If x is in B, then we check if it is in T, and if not, we add it to T (with freq = 2).
What about false positives?
![Page 25: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/25.jpg)
25
BFCounter idea, cont’d(Melsted & Pritchard, 2011)
After the first pass throughthe sequence data, one can re-iterate
over the sequence data to obtain exact k-mer counts in T (and then delete all unique k-mers).
Extra time for this second round: at most 50% of the total time, And tends to be less since hash table lookups are generally
faster than insertions.
Approximate version possible: no re-iteration(i.e. coverage counts for some k-mers will be higher by 1
than their true value).
![Page 26: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/26.jpg)
26
Memory usage for chr21 (Melsted & Pritchard, 2011)
![Page 27: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/27.jpg)
27
BF, cache access
Negative answer: ½ chance that the first probed bit is 0, then we terminate (i.e., 1 cache miss – in rare cases 0).
On avg with a negative answer: (almost) 2 cache misses. Good (and hard to improve).
Positive answer: (almost) k misses on avg.
A problem not really addressed until quite recently…
![Page 28: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/28.jpg)
28
Blocked Bloom filters (Putze et al., 2007, 2009)
The idea:
first h.f. determines the cache line (of typical size 64B = 512 bits nowadays),
the next k–1 h.f. are used to set or test bits (as usual)but only inside this one block.
I.e. (up to) one cache miss always!
Drawback: FPR slightly larger than with plain BF for the same c := m / n and k.
And the loss grows with growing c…(even if smaller k is chosen for large c,
which helps somewhat).
![Page 29: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/29.jpg)
29
Blocked Bloom filters, cont’d (Putze et al., 2007, 2009)
I.e. if c < 20 (the top row), then the space grows usually by <20%
compared to the plain BF, with comparable FPR.
Unfortunately, for large c (rarely needed?) the loss is very significant.
The idea of blocking for BF was first suggested in (Manber & Wu, 1994), for storing the filter on disk.
![Page 30: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/30.jpg)
30
Counting Bloom filter (Fan et al., 1998, 2000)
BF with delete:use small counters instead of single bits.
BF[pos]++ at insert, BF[pos]-- at del.
E.g. 4 bits: up to count 15.
Problem: counter overflow(plain solution: freeze the given counter).
Another (obvious) problem: more space, eg. 4 times.
4-bit counters and k < ln 2 (m / n) probability of overflow 1.37e–15 * m
![Page 31: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/31.jpg)
31
CBF, another problem…
A deletion instruction for a false positive item(a.k.a. incorrect deletion of a false positive item)
may produce false negative items!
Problem widely discussed and analyzed in (Guo et al., 2010)
![Page 32: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/32.jpg)
32
Deletable Bloom filter (DlBF) (Rotherberg et al., 2010)
Cute observation:those of the k bits for an item x which don’t have a collision
may be safely unset. If at least one of those k bits is such,
then we’ve managed to delete x!
How to distinguish colliding (overlapping) set bitsfrom non-colliding ones?
One extra bit per location? Quite costly…
![Page 33: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/33.jpg)
33
Deletable Bloom filter, cont’d (Rotherberg et al., 2010)
Compromise solution:divide the bit-vector into small areas;
iff no collision in an area happen then mark it as a collision-free area.
![Page 34: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/34.jpg)
34
DlBF, deletability prob. as a function of filter density
![Page 35: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/35.jpg)
35
Compressed Bloom filter (Mitzenmacher, 2002)
If RAM is not an issue, but we want to transmit the filterover a network…
Mitzenmacher noticed it pays to use more space,incl. more 0 bits (i.e. the structure is more sparse),
as then the bit-vector becomes compressible.(In a plain BF the numbers of 0s and 1s are approx equal
practically incompressible.)
m / n increased from 16 to 48: after compressionapprox. the same size, but the FPR drops twice
![Page 36: kwiecień 2013](https://reader034.fdocuments.net/reader034/viewer/2022051316/5681473b550346895db47a41/html5/thumbnails/36.jpg)
36
Conclusions
Bloom Filter is alive and kicking!
Lots of applications and lots of new variants.
In theory: constant FP rate and constant number of bits per key.
In practice: always think what FP rate you can allow.Also: what the errors mean (erroneous results
or „only” increased processing time for false positives?).
Bottom line: succinct data structure for Big Data.