New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman*...

30
New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* ([email protected]) Joint work with: Dawn Song*, Phillip Gibbons , Avrim Blum* *Carnegie Mellon University, Intel Research Pittsburgh

Transcript of New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman*...

Page 1: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

New Streaming Algorithms for Fast Detection of

Superspreaders

Shobha Venkataraman*([email protected])

Joint work with:Dawn Song*, Phillip Gibbons¶, Avrim Blum*

*Carnegie Mellon University,

¶Intel Research Pittsburgh

Page 2: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

2

Superspreaders k-superspreader: host that contacts at least

k distinct destinations in short time period. Goal: given stream of packets, find k-

superspreaders Why care about superspreaders?

Indicators of possible network attacks E.g., compromised host in worm propagation

contacts many distinct destinations Slammer worm contacted upto 26,000 hosts per second!

Automatic identification useful in logging and throttling attack traffic

Page 3: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

3

Heavy Distinct-Hitters General problem: given a stream of (x,y)

pairs, find all x paired with at least k distinct y: heavy distinct-hitter problem.

Applications: Find dests contacted by many distinct srcs Find ports contacted by many distinct srcs/dests,

or with high ICMP traffic Find potential spammers without per-src

information Find nodes that contact many other nodes in

peer-to-peer networks

Page 4: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

4

Challenges Need very efficient algorithms for high-speed

links Superspreaders often tiny fraction of network

traffic: e.g., in traces, < 0.004% of total traffic Need algorithms in streaming model:

Allow only one pass over data Much less storage than data

Distributed monitoring desirable, must have little communication between monitors

Page 5: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

5

Strawman Approaches Approach 1: track every src with list of

distinct destinations contacted, e.g. Snort Too much storage!

Approach 2: track every src with a distinct counter per src. [Estan et al 03] Also too much storage!

Approach 3: Use multiple-cache data structure of Weaver et al 04. Designed for different problem, does not scale

for finding superspreaders

Page 6: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

6

Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions

Page 7: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

7

Formal Problem Definition Given k, b > 1, and probability of failure ,

any k-superspreader output with probability at least 1 -

any src that contacts < k/b distinct dests output with probability <

srcs in between may or may not be output.

Thus, expect to identify src as superspreader after it contacts more than k/b and fewer than k distinct dests

Page 8: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

8

Example Example: k = 1000, b = 2, = 0.05. Then,

Pr[src output | contacts ≥ 1000 dests] > 0.95 Pr[src output | contacts < 500 dests] < 0.05

Expect gap between normal behaviour and superspreaders.

No. of distinct destinations contacted

d3 = 500

d2 = 750

d1 = 1000s1

s2

s3

Page 9: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

9

Theoretical Guarantees Given k, b > 1, and , can set parameters so

that, for N distinct flows: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/) Note: as many as N/k k-superspreaders possible,

so within O(log 1/) of lower bound Per-packet processing time: constant

At most 2 hashes and 2 memory accesses per packet Most packets get 1 hash, or 1 hash and 1 memory

access

Page 10: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

10

Outline Introduction Problem Definition Algorithms

One-Level Filtering Algorithm Two-Level Filtering Algorithm

Extensions Experiments Conclusions

Page 11: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

11

One-Level Filtering Algorithm

(s, d)

Step 2: If h(s, d) > c, discard packet

Step 3: If h(s, d) < c, insert into hash table

s1s2

sm

d1,1 d1,2 d1,z

d2,1 d2,2 d2,z’

dm,1 dm,2 dm,z”

.

.

.

Step 1: Compute h(s, d)

Step 4: Report all srcs with more than r destinations in hash table

(We’re effectively sampling distinct flows at rate c.)

packet

Page 12: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

12

Example: One-Level Filtering

Example: k = 1000, b = 2, = 0.05. Compute that c = 0.052, r = 39 In expectation:

94.8% packets require one computation Remaining 5.2% require more processing &

storage

Page 13: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

13

Two-Level Filtering: Intuition (I)

One-level filtering stores many small-dest srcs Need threshold sampling rate to distinguish

between srcs contact k and k/b dests Expected distribution: most srcs contact few dests.

But, all srcs sampled at threshold rate.

Use two-level filtering to reduce memory usage on such traffic distributions Coarse rate: decide whether to sample at fine rate Fine rate: distinguish between srcs sending to k

and k/b dests

Page 14: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

14

Two-Level Filtering: Intuition (II)

Example: k = 1000, b = 2 Suppose coarse rate is 1/100

Expect that a 1000-superspreader will show up once in first 100 dest; w.h.p. in, say, first 200 dest

Use the remaining 800 dest to distinguish from a source that sends to only 500 dest w.h.p.

Only store 1% of the sources that send to few dests

Similar worst-case guarantees, but significantly better under some natural distributions

Page 15: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

15

Two-Level Filtering Algorithm

s1,1 s1,2 s1,z

s2,1 s2,2 s2,z’

sm,1 sm,2 sm,z”

F1

F2

Fm

.

.

.

s’1,1 s’1,2 s’1,wC

(s, d) Compute h1(s, d)

Sample: if h1(s, d) < r1

and s is present in CCompute k = r1/mInsert s into hash-table Fk

Compute h2(s, d)

Sample: if h2(s, d) < r2

store s in C

Return all the sources that appear in at least r of the hash-table Fi

packet

Step 1 Step 2

Page 16: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

16

Example: Two-Level Filtering

Example: k = 1000, b = 2, = 0.05. Compute r1 = 0.15, r2 = 0.006, m = 100 Case 1: srcs that contact 1 distinct dest

each 85% of flows discarded 0.6% entered into coarse filter 15% examined if present in coarse filter

Case 2: srcs that are superspreaders 85% of flows discarded per superspreader 15% of flows require entry into fine filter

Page 17: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

17

Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions

Page 18: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

18

Extension: Deletions in Stream

Goal: superspreaders when deletions allowed in stream

Application: find srcs with many distinct connection failures Connection initiated: (src, dst) pair appears in

stream Response received: that (src, dst) pair gets deleted

(s1,d1,1), (s1,d2,1), (s1,d3,1), (s2,d2,1), (s1,d4,1), (s2,d2,-1) ...

(s1,d1,1), (s1,d2,1), (s1,d3,1), (s2,d2,1), (s1,d4,1), (s2,d2,-1), (s1,d2,-1)...

Page 19: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

19

Extension: Sliding Windows

Goal: Find superspreaders over sliding windows of packetse.g. in only most recent t packets, or last 1

hour. … (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4) ...

… (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4), (s1,d5) ...

… (s1,d1), (s1,d2), (s1,d3), (s2,d2), (s2,d4), (s1,d5), (s3,d4) ...

Page 20: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

20

Given: set of monitoring points, each point sees a stream of packets

Goal: Find superspreaders in union of streams

One-level filtering algorithm needs very little communication

Extension: Distributed Monitoring

(s1,d1), (s1,d2), (s2,d3), (s1,d1) ...

(s1,d1), (s1,d3), (s2,d4), (s2,d5)...

(s1,d1), (s2,d2), (s3,d3), (s4,d4)...

A

B

C

Page 21: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

21

Outline Introduction Problem Definition Algorithms Extensions Experiments Conclusions

Page 22: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

22

Experimental Setup Experiments run on Pentium IV, 1.8 GHz

with 1GB RAM Traces taken from NLANR archive,

ranging from 2.8 million packets (65 sec) to 4.5 million packets (4.5 min)

Added 100 srcs that contact k distinct dests and 100 srcs that contact k/b distinct dests

Use randomly generated SHA1 hash function for each run

For all experiments, = 0.05

Page 23: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

23

Experimental Results (I) Accuracy Discussion:

Both algorithms have desired accuracy False positive rate much less 0.05, since

most (eligible) srcs send to many fewer than k/b dests

Observed false positives only come from srcs close to the boundary

Page 24: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

24

Experimental Results (II)

1LF = 1-Level Filtering2LF-T = 2-Level Filtering

hash-table implementation2LF-B = 2-Level Filtering

Bloom-filter implementation

As expected, when b increases, sampling rates decrease, andtotal memory usage decreases

2LF-B has least memory usage

k = 200, b = 2 k = 200, b = 5 k = 200, b = 10

Page 25: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

25

Experimental Results (III)

1LF = 1-Level Filtering2LF-T = 2-Level Filtering

hash-table implementation2LF-B = 2-Level Filtering

Bloom-filter implementation

As expected, when k increases, sampling rates decrease, andtotal memory usage decreases

2LF-B has least memory usage

k = 500, b = 2 k = 1000, b = 2 k = 5000, b = 2

Page 26: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

26

Related Work Networking:

Related problems: finding heavy-hitters [Estan-Varghese 02], multidimensional traffic clusters [Estan+ 03], distribution of flow lengths [Duffield+ 03], large changes in network traffic [Cormode-Muthukrishnan 03]

Streaming Algorithms:Most closely related: counting number of distinct

values in a stream [Flajolet-Martin 85, Alon-Matias-Szegedy 99,

Cohen 97, Gibbons-Tirthapura 02, Bar-Yossef+ 02, Cormode+ 02]

Page 27: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

27

Summary Defined superspreader (and heavy distinct-

hitter) problem One-pass streaming algorithms:

Theoretical guarantees on accuracy and overhead

Experimental analysis validates theoretical results

Extensions to model with deletions, sliding windows and distributed monitoring

Novel two-level filtering scheme may be of independent interest

Page 28: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

28

Thank you!

Page 29: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

29

Motivation (II)

Superspreaders different from heavy-hitters!Care about many distinct destinations Few large file transfers => heavy-hitter, but not

superspreader Superspreaders not necessarily heavy-hitters

In test traces, superspreaders < 0.004% total traffic analyzed

Page 30: New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* (shobha@cs.cmu.edu) Joint work with: Dawn Song*, Phillip Gibbons ¶,

30

Theoretical Guarantees Given k, b > 1, and , can set parameters for

both algorithms so that: Pr[k-superspreader output] > 1 - Pr[false positive output] < Expected memory (fixed b): O(N/k log 1/) Per-packet processing time: constant

At most 2 hashes and 2 memory accesses per packet Most packets get one hash, or 1 hash + 1 memory

access

Optimization: implement Two-Level Filtering with Bloom filters – decreases memory usage, increases computational cost.