Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x...
Transcript of Bloom Filters - University of Washington · 2019-01-26 · 1 Bloom Filters nGiven a set S= {x 1,x...
1
Bloom Filters
n Given a set S = {x1,x2,x3,…,xn} on a universe U, want to answer queries of the form:
Is yÎS ?
n Bloom filter provides an answer inn �Constant� time (to hash).n Very small amount of space.n But with small probability of a false positive
n Particularly useful when the answer is usually NO
n When care a lot about space.
2
Bloom Filters
Start with an m bit array, filled with 0s.
Insertion: Hash each item xj in S k times. If hi (xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
n items m = cn bits k hash functions
h U 0 im I
i hyxiI
C E 8hi ha ghk
3
Bloom Filters
Lookup: To check if y is in S, check B at hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
Possible to have false positive; all k values are 1, but y is not in S.
n items m = cn bits k hash functions
h.ly holy hsls
E tif any 0 say gets
o w YES
TIFFS
4
Estimate False Positive Probability
n items m = cn bits k hash functions
Purposesof analysis sizeof table c elts insets storing
Assume hash functions completely random is yesInserted elements S x xa xn into table falseposture
is when we
Lookuply yC say thatPrffalseYpositive Prf h ly hab
i hidy aneltismy Kh Kh Kmt in the set
Pr specific bit in table is 0 a I th e when infact itmelts Lm X
KtdartsI Lm ke l X was never
kn randomdartsveryclose
insertedto 0
Let 13 denote the fraction of bits Expected dbthat are set to 0 Ed prespecificsbot that are 0 m Proffer
Pr false positive ftp.ykzgp ekfnykisbts s
5
Estimate False Positive Probability
n Pr(specific bit of filter is 0) isp� ≡ (1-1/m)kn ≈ e-kn/m ≡ p (p�≤p)
n If β is fraction of 0 bits in the filter then false positive probability for a new element is
(1- β)k ≈ (1- p�)k ≈ (1- p�)k= (1-e-kn/m)k
n Find optimal at k = (ln 2) m/n by calculus.n So optimal false positive prob is about (0.6185)m/n
n items m = cn bits k hash functions
And k that minimizes this fin
OF
6
Graph of (1-e-k/c)k for c=8
00.010.020.030.040.050.060.070.080.090.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Fals
e po
sitiv
e ra
te m/n = 8Opt k = 8 ln 2 = 5.45...
n items m = cn bits k hash functions
7
Applications
n Original (40 years ago) use:n Spellcheckers (false positive = misspelled
word)n Forbidden passwords (false positive = no
biggie)
8
Applications – More modern
n In databases:n Join: combine two tables with a common
domain into a single tablen Semi-join: a join in distributed databases in
which only the joining attribute from one site is transmitted to other and used for selection. The selected records sent back
n Bloom-join: a semi-join where we send only a Bloom filter of the joining attribute
9
Applications- modern
n Monitor all traffic going through a router, checking for signatures of bad behavior (e.g. strings associated to worms, viruses)
n Must be fast and simple - operate at hardware/line speed.
n Use a Bloom filter for signatures.n On positive: send off to analyzer for
actionn False positive = extra work on slow path.
10
Handling Deletions
n Bloom filters can handle insertions, but not deletions.
n If deleting xi means resetting 1�s to 0�s, then deleting xi will �delete� xj.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
xi xj
11
Bloom filter numerous variations and applications
n See papers on website.
n ``The Bloom Filter principle: wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.’’