1 INFLATION PROTECTION -------------------------------------- author B N VENKATARAMAN.
New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman*...
-
Upload
shana-houston -
Category
Documents
-
view
233 -
download
0
Transcript of New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman*...
New Streaming Algorithms for Fast Detection of
Superspreaders
Shobha Venkataraman*([email protected])
Joint work with:Dawn Song*, Phillip Gibbons¶, Avrim Blum*
*Carnegie Mellon University,
¶Intel Research Pittsburgh
2
Superspreaders k-superspreader: host that contacts at least
k distinct destinations in short time period. Goal: given stream of packets, find k-
superspreaders Why care about superspreaders?
Indicators of possible network attacks E.g., compromised host in worm propagation
contacts many distinct destinations Slammer worm contacted upto 26,000 hosts per second!
Automatic identification useful in logging and throttling attack traffic
3
Heavy Distinct-Hitters General problem: given a stream of (x,y)
pairs, find all x paired with at least k distinct y: heavy distinct-hitter problem.
Applications: Find dests contacted by many distinct srcs Find ports contacted by many distinct srcs/dests,
or with high ICMP traffic Find potential spammers without per-src
information Find nodes that contact many other nodes in
peer-to-peer networks
4
Challenges Need very efficient algorithms for high-speed
links Superspreaders often tiny fraction of network
traffic: e.g., in traces, < 0.004% of total traffic Need algorithms in streaming model:
Allow only one pass over data Much less storage than data
Distributed monitoring desirable, must have little communication between monitors
5
Strawman Approaches Approach 1: track every src with list of
distinct destinations contacted, e.g. Snort Too much storage!
Approach 2: track every src with a distinct counter per src. [Estan et al 03] Also too much storage!
Approach 3: Use multiple-cache data structure of Weaver et al 04. Designed for different problem, does not scale
for finding superspreaders
6
Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions
7
Formal Problem Definition Given k, b > 1, and probability of failure ,
any k-superspreader output with probability at least 1 -
any src that contacts < k/b distinct dests output with probability <
srcs in between may or may not be output.
Thus, expect to identify src as superspreader after it contacts more than k/b and fewer than k distinct dests
8
Example Example: k = 1000, b = 2, = 0.05. Then,
Pr[src output | contacts ≥ 1000 dests] > 0.95 Pr[src output | contacts < 500 dests] < 0.05
Expect gap between normal behaviour and superspreaders.
No. of distinct destinations contacted
d3 = 500
d2 = 750
d1 = 1000s1
s2
s3
9
Theoretical Guarantees Given k, b > 1, and , can set parameters so
that, for N distinct flows: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/) Note: as many as N/k k-superspreaders possible,
so within O(log 1/) of lower bound Per-packet processing time: constant
At most 2 hashes and 2 memory accesses per packet Most packets get 1 hash, or 1 hash and 1 memory
access
10
Outline Introduction Problem Definition Algorithms
One-Level Filtering Algorithm Two-Level Filtering Algorithm
Extensions Experiments Conclusions
11
One-Level Filtering Algorithm
(s, d)
Step 2: If h(s, d) > c, discard packet
Step 3: If h(s, d) < c, insert into hash table
s1s2
sm
d1,1 d1,2 d1,z
d2,1 d2,2 d2,z’
dm,1 dm,2 dm,z”
.
.
.
Step 1: Compute h(s, d)
Step 4: Report all srcs with more than r destinations in hash table
(We’re effectively sampling distinct flows at rate c.)
packet
12
Example: One-Level Filtering
Example: k = 1000, b = 2, = 0.05. Compute that c = 0.052, r = 39 In expectation:
94.8% packets require one computation Remaining 5.2% require more processing &
storage
13
Two-Level Filtering: Intuition (I)
One-level filtering stores many small-dest srcs Need threshold sampling rate to distinguish
between srcs contact k and k/b dests Expected distribution: most srcs contact few dests.
But, all srcs sampled at threshold rate.
Use two-level filtering to reduce memory usage on such traffic distributions Coarse rate: decide whether to sample at fine rate Fine rate: distinguish between srcs sending to k
and k/b dests
14
Two-Level Filtering: Intuition (II)
Example: k = 1000, b = 2 Suppose coarse rate is 1/100
Expect that a 1000-superspreader will show up once in first 100 dest; w.h.p. in, say, first 200 dest
Use the remaining 800 dest to distinguish from a source that sends to only 500 dest w.h.p.
Only store 1% of the sources that send to few dests
Similar worst-case guarantees, but significantly better under some natural distributions
15
Two-Level Filtering Algorithm
s1,1 s1,2 s1,z
s2,1 s2,2 s2,z’
sm,1 sm,2 sm,z”
F1
F2
Fm
.
.
.
s’1,1 s’1,2 s’1,wC
(s, d) Compute h1(s, d)
Sample: if h1(s, d) < r1
and s is present in CCompute k = r1/mInsert s into hash-table Fk
Compute h2(s, d)
Sample: if h2(s, d) < r2
store s in C
Return all the sources that appear in at least r of the hash-table Fi
packet
Step 1 Step 2
16
Example: Two-Level Filtering
Example: k = 1000, b = 2, = 0.05. Compute r1 = 0.15, r2 = 0.006, m = 100 Case 1: srcs that contact 1 distinct dest
each 85% of flows discarded 0.6% entered into coarse filter 15% examined if present in coarse filter
Case 2: srcs that are superspreaders 85% of flows discarded per superspreader 15% of flows require entry into fine filter
17
Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions
18
Extension: Deletions in Stream
Goal: superspreaders when deletions allowed in stream
Application: find srcs with many distinct connection failures Connection initiated: (src, dst) pair appears in
stream Response received: that (src, dst) pair gets deleted
(s1,d1,1), (s1,d2,1), (s1,d3,1), (s2,d2,1), (s1,d4,1), (s2,d2,-1) ...
(s1,d1,1), (s1,d2,1), (s1,d3,1), (s2,d2,1), (s1,d4,1), (s2,d2,-1), (s1,d2,-1)...
19
Extension: Sliding Windows
Goal: Find superspreaders over sliding windows of packetse.g. in only most recent t packets, or last 1
hour. … (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4) ...
… (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4), (s1,d5) ...
… (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4), (s1,d5), (s3,d4) ...
20
Given: set of monitoring points, each point sees a stream of packets
Goal: Find superspreaders in union of streams
One-level filtering algorithm needs very little communication
Extension: Distributed Monitoring
(s1,d1), (s1,d2), (s2,d3), (s1,d1) ...
(s1,d1), (s1,d3), (s2,d4), (s2,d5)...
(s1,d1), (s2,d2), (s3,d3), (s4,d4)...
A
B
C
21
Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions
22
Experimental Setup Experiments run on Pentium IV, 1.8 GHz
with 1GB RAM Traces taken from NLANR archive,
ranging from 2.8 million packets (65 sec) to 4.5 million packets (4.5 min)
Added 100 srcs that contact k distinct dests and 100 srcs that contact k/b distinct dests
Use randomly generated SHA1 hash function for each run
For all experiments, = 0.05
23
Experimental Results (I) Accuracy Discussion:
Both algorithms have desired accuracy False positive rate much less 0.05, since
most (eligible) srcs send to many fewer than k/b dests
Observed false positives only come from srcs close to the boundary
24
Experimental Results (II)
1LF = 1-Level Filtering2LF-T = 2-Level Filtering
hash-table implementation2LF-B = 2-Level Filtering
Bloom-filter implementation
As expected, when b increases, sampling rates decrease, andtotal memory usage decreases
2LF-B has least memory usage
k = 200, b = 2 k = 200, b = 5 k = 200, b = 10
25
Experimental Results (III)
1LF = 1-Level Filtering2LF-T = 2-Level Filtering
hash-table implementation2LF-B = 2-Level Filtering
Bloom-filter implementation
As expected, when k increases, sampling rates decrease, andtotal memory usage decreases
2LF-B has least memory usage
k = 500, b = 2 k = 1000, b = 2 k = 5000, b = 2
26
Related Work Networking:
Related problems: finding heavy-hitters [Estan-Varghese 02], multidimensional traffic clusters [Estan+ 03], distribution of flow lengths [Duffield+ 03], large changes in network traffic [Cormode-Muthukrishnan 03]
Streaming Algorithms:Most closely related: counting number of distinct
values in a stream [Flajolet-Martin 85, Alon-Matias-Szegedy 99,
Cohen 97, Gibbons-Tirthapura 02, Bar-Yossef+ 02, Cormode+ 02]
27
Summary Defined superspreader (and heavy distinct-
hitter) problem One-pass streaming algorithms:
Theoretical guarantees on accuracy and overhead
Experimental analysis validates theoretical results
Extensions to model with deletions, sliding windows and distributed monitoring
Novel two-level filtering scheme may be of independent interest
28
Thank you!
29
Motivation (II)
Superspreaders different from heavy-hitters!Care about many distinct destinations Few large file transfers => heavy-hitter, but not
superspreader Superspreaders not necessarily heavy-hitters
In test traces, superspreaders < 0.004% total traffic analyzed
30
Theoretical Guarantees Given k, b > 1, and , can set parameters for
both algorithms so that: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/) Per-packet processing time: constant
At most 2 hashes and 2 memory accesses per packet Most packets get one hash, or 1 hash + 1 memory
access
Optimization: implement Two-Level Filtering with Bloom filters – decreases memory usage, increases computational cost.