kwiecień 2013

kwiecień 2013

Filtry Blooma jako

oszczędna struktura słownikowa

Szymon [email protected]

Instytut Informatyki Stosowanej, Politechnika Łódzka

Hash table: store keys (and possibly satellite data), the location is found via a hash function; some collision resolution method needed.

The idea of hashing

chained hashing open addressing

3

…but have a few drawbacks:are randomized,

don’t allow for iteration in a sorted order,and may require quite some space.

Now, if we require possibly small space, what can we do?

Hash structures are typically fast…

4

Don’t store the keys themselves!

Bloom Filter (Bloom, 1970)

Just be able to answer if a given key is in the structure.If the answer if “no”, it is correct.

But if the answer is “yes”, it may be wrong!

So, it’s a probabilistic data structure.

There’s a tradeoff between its space(avg space per inserted key) and “truthfulness”.

5

Bloom Filter features• little space, so save on RAM (mostly old apps);

• little space, so also fast to transfer the structure over network;

• little space, sometimes small enough to fit L2 (or L1!) CPU cache (Li & Zhong, 2006: BF makes a Bayesian spam filter work

much faster thx to fitting an L2 cache);

• extremely simple / easy to implement;

• major application domains: databases, networking;

• …but also a few drawbacks / issues, hence a significantinterest in devising novel BF variants

6

The idea:

• keep a bit-vector of some size m, initially all zeros;

• use k independent hash functions (h.f.) (instead of one, in a standard HT) for each added key;

• write 1 in the k locations pointed by the k h.f.;

• testing for a key: if in all k calculated locations there is 1,then return “yes” (=the key exists), which may be wrong,

if among the k locations there’s at least one 0, return “no”, which is always correct.

Bloom Filter idea

7

BF, basic API

insert(k)

exists(k)

No delete(k)!

And of course: no iteration over the keysadded to the BF (no content listing).

http://www.cl.cam.ac.uk/research/srg/opera/meetings/attachments/2008-10-14-BloomFiltersSurvey.pdf

8

Early applications – spellchecking (1982, 1990), hyphenation

If a spellcheck occasionally ignoresa word not in its dictionary – not a big problem.

This is exactly the case with BF in this app.

Quite a good app: the dictionary is static (or almost static), so once we set the BF size,

we can estimate the error,which practically doesn’t change.

App from Bloom’s paper: program for automatic hyphenation in which 90% of words can be hyphenated using simple rules,

but 10% require dictionary lookup.

9

Bloom speaking…

http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=7FB6933B782FBC9C98BBCDA0EB420935?doi=10.1.1.20.2080&rep=rep1&type=pdf

10

BF tradeoffs

The error grows with load (i.e. with growing n / m, n is the # of added items).

When the BF is almost empty, the error is very small,but then we also waste lots of space.

Another factor: k. How to choose it?

For any specified load (m set to the ‘expected’ nin a given scenario) there is an optimal value of k

(such that minimizes the error).

k too small – too many collisions;k too large – the bit vector gets too ‘dense’ quickly

(and too many collisions, too!)

11

Finding the best k

We assume the hash functionschoose each bit vector slot with equal prob.

Pr(a given bit NOT set by a given h.f.) = 1 – 1/m

Pr(a given bit NOT set) = (1 – 1/m)kn

m and kn are typically large, so Pr(a given bit NOT set) e–kn / m

Pr(a given bit is set) = 1 – (1 – 1/m)kn

Consider an element not added to the BF:the filter will lie if all the corresponding k bits are set.

This is: Pr(a given bit is set)k = (1 – (1 – 1/m)kn)k (1 – e–kn / m)k

12

Finding the best k, cont’d

Differentiation (=calculating a derivative) helps.

The error is minimized for k = ln 2 * m / n 0.693 m / n.(Then the # of 1s and 0s in the bit-vector is

approx. equal. Of course, k must be an integer!)

And the error (false positive rate, FPR) = (1/2)k (0.6185)m / n.

Again, Pr(the Bloom filter lies) (1 – e–kn / m)k.

Clearly, the error grows with growing n (for fixed k, m)and decreases with growing m (for fixed k, n).

What is the optimal k?

13

Minimizing the error in practice

m = 8n error 0.0214m = 12n error 0.0031m = 16n error 0.0005

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html#SECTION00053000000000000000

14

www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt

FPR example, m / n = 8

15

Funny tricks with BF

Given two BFs, representing sets S1 and S2,with the same # of bits and using the same hash functions,

we can represent the union of those sets by taking the OR of the two bit-vectors of the original BFs.

Say you want to halve the memory use after some timeand assume the filter size is a power of 2.

Just OR the halves of the filter.When hashing for a lookup, OR the lower and upper bits

of the hash value.

Intersection of two BFs (of the same size), i.e. AND operation, can be used to approximate

the intersection of two sets.

16

Solution: when the filter gets ‘full’ (reaches the limit on the fill ratio), a new one is added, with tighter

max FPR, and querying translates totesting at most all of those filters…

Scalable BF (Almeida et al., 2007)

We can find the optimal k knowing n / m in advance.As m is settled once, we must know (roughly) n,

the number of items to add.What if we have a pale idea of the size of n..?

If the initial m is too large, we may halve it easily(see prev. slide). Crude, but possible.

What about m being too small?

17

How to approximate a set without knowing its size in advance

ε – max allowed false positive rate

Classic result: BF (and some other related structures) offers (n log(1/ε))-bit solution, when n is known in advance.

Pagh, Segev, Wider (2013):

18

Semi-join operation in a distributed database

Empl Salary Addr City

John 60K … New York

George 30K … New York

Moe 25K … Topeka

Alice 70K … Chicago

Raul 30K Chicago

City Cost of living

New York 60K

Chicago 55K

Topeka 30K

Task: Create a table of all employees that make < 40K and live in city where COL > 50K.

Empl Salary Addr City COL

Semi-join: send (from A to B) just (City)

database A database B

Anything better?www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt

19

• BF-based solution: A sends a Bloom filterinstead of actual city names,

• …then B sends back its answers…

• …from which A filters out the false positives

This is to minimize transfer (over a network) between the database sites! The CPU work is increased:

B needs to filter its city list using the received filter, A needs to filter its received list of persons.

Bloom-join

20

P2P keyword search(Reynolds & Vadhat, 2003)

• distributed inverted index on words, multi-word queries,

• Peer A holds list of document IDs containing Word1, Peer B holds list for Word2,

• intersection needed, but minimize communication,

• A sends B a Bloom filter of document list,

• B sends back possible intersections to A,

• A verifies and sends the true result to user,

• i.e. equivalent to Bloom-join

21

Web Cache 1 Web Cache 2 Web Cache 3

The WebThe Web

Distributed Web caches (Fan et al., 2000)

www.eecs.harvard.edu/~michaelm/TALKS/NewZealandBF.ppt

22

k-mer counting in bioinformatics

http://www.homolog.us/blogs/wp-content/uploads/2011/07/i6.png

k-mers: substrings of length k in a DNA sequence.Counting them is important: for genome de novo assemblers

(based on a de Brujin graph),for detection of repeated sequences,

to study the mechanisms of sequence duplication in genomes, etc.

23

BFCounter algorithm(Melsted & Pritchard, 2011)

The considered problem variant:find all non-unique k-mers in reads collection, with their counts.

I.e. ignore k-mers with occ = 1 ( almost certainly noise).

Input data: 2.66G 36bp Illumina reads (40-fold coverage).

(Output) statistics:12.18G k-mers (k = 25) present in the sequencing reads,

of which 9.35G are unique and 2.83G have coverage of two or greater.

24

BFCounter idea(Melsted & Pritchard, 2011)

Both a Bloom filter and a plain hash table used.

Bloom filter B used to store implicitly all k-mers seen so far, while only inserting non-unique k-mers into the hash table T.

For each k-mer x, we check if x is in B.If not, we update the appropriate bits in B, to indicate that it has now been observed.

If x is in B, then we check if it is in T, and if not, we add it to T (with freq = 2).

What about false positives?

25

BFCounter idea, cont’d(Melsted & Pritchard, 2011)

After the first pass throughthe sequence data, one can re-iterate

over the sequence data to obtain exact k-mer counts in T (and then delete all unique k-mers).

Extra time for this second round: at most 50% of the total time, And tends to be less since hash table lookups are generally

faster than insertions.

Approximate version possible: no re-iteration(i.e. coverage counts for some k-mers will be higher by 1

than their true value).

26

Memory usage for chr21 (Melsted & Pritchard, 2011)

27

BF, cache access

Negative answer: ½ chance that the first probed bit is 0, then we terminate (i.e., 1 cache miss – in rare cases 0).

On avg with a negative answer: (almost) 2 cache misses. Good (and hard to improve).

Positive answer: (almost) k misses on avg.

A problem not really addressed until quite recently…

28

Blocked Bloom filters (Putze et al., 2007, 2009)

The idea:

first h.f. determines the cache line (of typical size 64B = 512 bits nowadays),

the next k–1 h.f. are used to set or test bits (as usual)but only inside this one block.

I.e. (up to) one cache miss always!

Drawback: FPR slightly larger than with plain BF for the same c := m / n and k.

And the loss grows with growing c…(even if smaller k is chosen for large c,

which helps somewhat).

29

Blocked Bloom filters, cont’d (Putze et al., 2007, 2009)

I.e. if c < 20 (the top row), then the space grows usually by <20%

compared to the plain BF, with comparable FPR.

Unfortunately, for large c (rarely needed?) the loss is very significant.

The idea of blocking for BF was first suggested in (Manber & Wu, 1994), for storing the filter on disk.

30

Counting Bloom filter (Fan et al., 1998, 2000)

BF with delete:use small counters instead of single bits.

BF[pos]++ at insert, BF[pos]-- at del.

E.g. 4 bits: up to count 15.

Problem: counter overflow(plain solution: freeze the given counter).

Another (obvious) problem: more space, eg. 4 times.

4-bit counters and k < ln 2 (m / n) probability of overflow 1.37e–15 * m

31

CBF, another problem…

A deletion instruction for a false positive item(a.k.a. incorrect deletion of a false positive item)

may produce false negative items!

Problem widely discussed and analyzed in (Guo et al., 2010)

32

Deletable Bloom filter (DlBF) (Rotherberg et al., 2010)

Cute observation:those of the k bits for an item x which don’t have a collision

may be safely unset. If at least one of those k bits is such,

then we’ve managed to delete x!

How to distinguish colliding (overlapping) set bitsfrom non-colliding ones?

One extra bit per location? Quite costly…

33

Deletable Bloom filter, cont’d (Rotherberg et al., 2010)

Compromise solution:divide the bit-vector into small areas;

iff no collision in an area happen then mark it as a collision-free area.

34

DlBF, deletability prob. as a function of filter density

35

Compressed Bloom filter (Mitzenmacher, 2002)

If RAM is not an issue, but we want to transmit the filterover a network…

Mitzenmacher noticed it pays to use more space,incl. more 0 bits (i.e. the structure is more sparse),

as then the bit-vector becomes compressible.(In a plain BF the numbers of 0s and 1s are approx equal

practically incompressible.)

m / n increased from 16 to 48: after compressionapprox. the same size, but the FPR drops twice

36

Conclusions

Bloom Filter is alive and kicking!

Lots of applications and lots of new variants.

In theory: constant FP rate and constant number of bits per key.

In practice: always think what FP rate you can allow.Also: what the errors mean (erroneous results

or „only” increased processing time for false positives?).

Bottom line: succinct data structure for Big Data.

kwiecień 2013

Documents

Transcript of kwiecień 2013