Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x...

11
1 Bloom Filters n Given a set S = {x 1 ,x 2 ,x 3 ,…,x n } on a universe U, want to answer queries of the form: Is y ÎS ? n Bloom filter provides an answer in n Constant time (to hash). n Very small amount of space. n But with small probability of a false positive n Particularly useful when the answer is usually NO n When care a lot about space.

Transcript of Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x...

Page 1: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

1

Bloom Filters

n Given a set S = {x1,x2,x3,…,xn} on a universe U, want to answer queries of the form:

Is yÎS ?

n Bloom filter provides an answer inn �Constant� time (to hash).n Very small amount of space.n But with small probability of a false positive

n Particularly useful when the answer is usually NO

n When care a lot about space.

Page 2: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

2

Bloom Filters

Start with an m bit array, filled with 0s.

Insertion: Hash each item xj in S k times. If hi (xj) = a, set B[a] = 1.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

n items m = cn bits k hash functions

h U 0 im I

i hyxiI

C E 8hi ha ghk

Page 3: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

3

Bloom Filters

Lookup: To check if y is in S, check B at hi(y). All k values must be 1.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

Possible to have false positive; all k values are 1, but y is not in S.

n items m = cn bits k hash functions

h.ly holy hsls

E tif any 0 say gets

o w YES

TIFFS

Page 4: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

4

Estimate False Positive Probability

n items m = cn bits k hash functions

Purposesof analysis sizeof table c elts insets storing

Assume hash functions completely random is yesInserted elements S x xa xn into table falseposture

is when we

Lookuply yC say thatPrffalseYpositive Prf h ly hab

i hidy aneltismy Kh Kh Kmt in the set

Pr specific bit in table is 0 a I th e when infact itmelts Lm X

KtdartsI Lm ke l X was never

kn randomdartsveryclose

insertedto 0

Let 13 denote the fraction of bits Expected dbthat are set to 0 Ed prespecificsbot that are 0 m Proffer

Pr false positive ftp.ykzgp ekfnykisbts s

Page 5: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

5

Estimate False Positive Probability

n Pr(specific bit of filter is 0) isp� ≡ (1-1/m)kn ≈ e-kn/m ≡ p (p�≤p)

n If β is fraction of 0 bits in the filter then false positive probability for a new element is

(1- β)k ≈ (1- p�)k ≈ (1- p�)k= (1-e-kn/m)k

n Find optimal at k = (ln 2) m/n by calculus.n So optimal false positive prob is about (0.6185)m/n

n items m = cn bits k hash functions

And k that minimizes this fin

OF

Page 6: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

6

Graph of (1-e-k/c)k for c=8

00.010.020.030.040.050.060.070.080.090.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fals

e po

sitiv

e ra

te m/n = 8Opt k = 8 ln 2 = 5.45...

n items m = cn bits k hash functions

Page 7: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

7

Applications

n Original (40 years ago) use:n Spellcheckers (false positive = misspelled

word)n Forbidden passwords (false positive = no

biggie)

Page 8: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

8

Applications – More modern

n In databases:n Join: combine two tables with a common

domain into a single tablen Semi-join: a join in distributed databases in

which only the joining attribute from one site is transmitted to other and used for selection. The selected records sent back

n Bloom-join: a semi-join where we send only a Bloom filter of the joining attribute

Page 9: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

9

Applications- modern

n Monitor all traffic going through a router, checking for signatures of bad behavior (e.g. strings associated to worms, viruses)

n Must be fast and simple - operate at hardware/line speed.

n Use a Bloom filter for signatures.n On positive: send off to analyzer for

actionn False positive = extra work on slow path.

Page 10: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

10

Handling Deletions

n Bloom filters can handle insertions, but not deletions.

n If deleting xi means resetting 1�s to 0�s, then deleting xi will �delete� xj.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

xi xj

Page 11: Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x 2,x 3,…,x n}on a universe U, want to answer queries of the form: IsyÎS? nBloom

11

Bloom filter numerous variations and applications

n See papers on website.

n ``The Bloom Filter principle: wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.’’