New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman*...

New Streaming Algorithms for Fast Detection of

Superspreaders

Shobha Venkataraman*([email protected])

Joint work with:Dawn Song*, Phillip Gibbons¶, Avrim Blum*

*Carnegie Mellon University,

¶Intel Research Pittsburgh

2

Superspreaders k-superspreader: host that contacts at least

k distinct destinations in short time period. Goal: given stream of packets, find k-

superspreaders Why care about superspreaders?

Indicators of possible network attacks E.g., compromised host in worm propagation

contacts many distinct destinations Slammer worm contacted upto 26,000 hosts per second!

Automatic identification useful in logging and throttling attack traffic

3

Heavy Distinct-Hitters General problem: given a stream of (x,y)

pairs, find all x paired with at least k distinct y: heavy distinct-hitter problem.

Applications: Find dests contacted by many distinct srcs Find ports contacted by many distinct srcs/dests,

or with high ICMP traffic Find potential spammers without per-src

information Find nodes that contact many other nodes in

peer-to-peer networks

4

Challenges Need very efficient algorithms for high-speed

links Superspreaders often tiny fraction of network

traffic: e.g., in traces, < 0.004% of total traffic Need algorithms in streaming model:

Allow only one pass over data Much less storage than data

Distributed monitoring desirable, must have little communication between monitors

5

Strawman Approaches Approach 1: track every src with list of

distinct destinations contacted, e.g. Snort Too much storage!

Approach 2: track every src with a distinct counter per src. [Estan et al 03] Also too much storage!

Approach 3: Use multiple-cache data structure of Weaver et al 04. Designed for different problem, does not scale

for finding superspreaders

6

Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions

7

Formal Problem Definition Given k, b > 1, and probability of failure ,

any k-superspreader output with probability at least 1 -

any src that contacts < k/b distinct dests output with probability <

srcs in between may or may not be output.

Thus, expect to identify src as superspreader after it contacts more than k/b and fewer than k distinct dests

8

Example Example: k = 1000, b = 2, = 0.05. Then,

Pr[src output | contacts ≥ 1000 dests] > 0.95 Pr[src output | contacts < 500 dests] < 0.05

Expect gap between normal behaviour and superspreaders.

No. of distinct destinations contacted

d3 = 500

d2 = 750

d1 = 1000s1

s2

s3

9

Theoretical Guarantees Given k, b > 1, and , can set parameters so

that, for N distinct flows: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/) Note: as many as N/k k-superspreaders possible,

so within O(log 1/) of lower bound Per-packet processing time: constant

At most 2 hashes and 2 memory accesses per packet Most packets get 1 hash, or 1 hash and 1 memory

access

10

Outline Introduction Problem Definition Algorithms

One-Level Filtering Algorithm Two-Level Filtering Algorithm

Extensions Experiments Conclusions

11

One-Level Filtering Algorithm

(s, d)

Step 2: If h(s, d) > c, discard packet

Step 3: If h(s, d) < c, insert into hash table

s1s2

sm

d1,1 d1,2 d1,z

d2,1 d2,2 d2,z’

dm,1 dm,2 dm,z”

.

.

.

Step 1: Compute h(s, d)

Step 4: Report all srcs with more than r destinations in hash table

(We’re effectively sampling distinct flows at rate c.)

packet

12

Example: One-Level Filtering

Example: k = 1000, b = 2, = 0.05. Compute that c = 0.052, r = 39 In expectation:

94.8% packets require one computation Remaining 5.2% require more processing &

storage

13

Two-Level Filtering: Intuition (I)

One-level filtering stores many small-dest srcs Need threshold sampling rate to distinguish

between srcs contact k and k/b dests Expected distribution: most srcs contact few dests.

But, all srcs sampled at threshold rate.

Use two-level filtering to reduce memory usage on such traffic distributions Coarse rate: decide whether to sample at fine rate Fine rate: distinguish between srcs sending to k

and k/b dests

14

Two-Level Filtering: Intuition (II)

Example: k = 1000, b = 2 Suppose coarse rate is 1/100

Expect that a 1000-superspreader will show up once in first 100 dest; w.h.p. in, say, first 200 dest

Use the remaining 800 dest to distinguish from a source that sends to only 500 dest w.h.p.

Only store 1% of the sources that send to few dests

Similar worst-case guarantees, but significantly better under some natural distributions

15

Two-Level Filtering Algorithm

s1,1 s1,2 s1,z

s2,1 s2,2 s2,z’

sm,1 sm,2 sm,z”

F1

F2

Fm

.

.

.

s’1,1 s’1,2 s’1,wC

(s, d) Compute h1(s, d)

Sample: if h1(s, d) < r1

and s is present in CCompute k = r1/mInsert s into hash-table Fk

Compute h2(s, d)

Sample: if h2(s, d) < r2

store s in C

Return all the sources that appear in at least r of the hash-table Fi

packet

Step 1 Step 2

16

Example: Two-Level Filtering

Example: k = 1000, b = 2, = 0.05. Compute r1 = 0.15, r2 = 0.006, m = 100 Case 1: srcs that contact 1 distinct dest

each 85% of flows discarded 0.6% entered into coarse filter 15% examined if present in coarse filter

Case 2: srcs that are superspreaders 85% of flows discarded per superspreader 15% of flows require entry into fine filter

17


18

Extension: Deletions in Stream

Goal: superspreaders when deletions allowed in stream

Application: find srcs with many distinct connection failures Connection initiated: (src, dst) pair appears in

stream Response received: that (src, dst) pair gets deleted

(s1,d1,1), (s1,d2,1), (s1,d3,1), (s2,d2,1), (s1,d4,1), (s2,d2,-1) ...

(s1,d1,1), (s1,d2,1), (s1,d3,1), (s2,d2,1), (s1,d4,1), (s2,d2,-1), (s1,d2,-1)...

19

Extension: Sliding Windows

Goal: Find superspreaders over sliding windows of packetse.g. in only most recent t packets, or last 1

hour. … (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4) ...

… (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4), (s1,d5) ...

… (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4), (s1,d5), (s3,d4) ...

20

Given: set of monitoring points, each point sees a stream of packets

Goal: Find superspreaders in union of streams

One-level filtering algorithm needs very little communication

Extension: Distributed Monitoring

(s1,d1), (s1,d2), (s2,d3), (s1,d1) ...

(s1,d1), (s1,d3), (s2,d4), (s2,d5)...

(s1,d1), (s2,d2), (s3,d3), (s4,d4)...

A

B

C

21


22

Experimental Setup Experiments run on Pentium IV, 1.8 GHz

with 1GB RAM Traces taken from NLANR archive,

ranging from 2.8 million packets (65 sec) to 4.5 million packets (4.5 min)

Added 100 srcs that contact k distinct dests and 100 srcs that contact k/b distinct dests

Use randomly generated SHA1 hash function for each run

For all experiments, = 0.05

23

Experimental Results (I) Accuracy Discussion:

Both algorithms have desired accuracy False positive rate much less 0.05, since

most (eligible) srcs send to many fewer than k/b dests

Observed false positives only come from srcs close to the boundary

24

Experimental Results (II)

1LF = 1-Level Filtering2LF-T = 2-Level Filtering

hash-table implementation2LF-B = 2-Level Filtering

Bloom-filter implementation

As expected, when b increases, sampling rates decrease, andtotal memory usage decreases

2LF-B has least memory usage

k = 200, b = 2 k = 200, b = 5 k = 200, b = 10

25

Experimental Results (III)

1LF = 1-Level Filtering2LF-T = 2-Level Filtering

hash-table implementation2LF-B = 2-Level Filtering

Bloom-filter implementation

As expected, when k increases, sampling rates decrease, andtotal memory usage decreases

2LF-B has least memory usage

k = 500, b = 2 k = 1000, b = 2 k = 5000, b = 2

26

Related Work Networking:

Related problems: finding heavy-hitters [Estan-Varghese 02], multidimensional traffic clusters [Estan+ 03], distribution of flow lengths [Duffield+ 03], large changes in network traffic [Cormode-Muthukrishnan 03]

Streaming Algorithms:Most closely related: counting number of distinct

values in a stream [Flajolet-Martin 85, Alon-Matias-Szegedy 99,

Cohen 97, Gibbons-Tirthapura 02, Bar-Yossef+ 02, Cormode+ 02]

27

Summary Defined superspreader (and heavy distinct-

hitter) problem One-pass streaming algorithms:

Theoretical guarantees on accuracy and overhead

Experimental analysis validates theoretical results

Extensions to model with deletions, sliding windows and distributed monitoring

Novel two-level filtering scheme may be of independent interest

28

Thank you!

29

Motivation (II)

Superspreaders different from heavy-hitters!Care about many distinct destinations Few large file transfers => heavy-hitter, but not

superspreader Superspreaders not necessarily heavy-hitters

In test traces, superspreaders < 0.004% total traffic analyzed

30

Theoretical Guarantees Given k, b > 1, and , can set parameters for

both algorithms so that: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/) Per-packet processing time: constant

At most 2 hashes and 2 memory accesses per packet Most packets get one hash, or 1 hash + 1 memory

access

Optimization: implement Two-Level Filtering with Bloom filters – decreases memory usage, increases computational cost.

New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman*...

Documents

Transcript of New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman*...