Distributed Top-K Monitoring

50
1 Distributed Top-K Monitoring Brian Babcock & Chris Olston Presented by Yuval Altman To be presented at ACM SIGMOD 2003 International Conference on Management of Data

description

Distributed Top-K Monitoring. Brian Babcock & Chris Olston Presented by Yuval Altman. To be presented at ACM SIGMOD 2003 International Conference on Management of Data. The problem. Continuously report the k largest values obtained from distributed data streams. Motivation -. - PowerPoint PPT Presentation

Transcript of Distributed Top-K Monitoring

Page 1: Distributed Top-K Monitoring

1

Distributed Top-K Monitoring

Brian Babcock & Chris Olston

Presented by Yuval Altman

To be presented at ACM SIGMOD 2003 International Conference on Management of Data

Page 2: Distributed Top-K Monitoring

2

The problem

Continuously report the k largest values obtained from

distributed data streams.

Page 3: Distributed Top-K Monitoring

3

Motivation -

Google is the most popular search engine in the world.

Servers in multiple sites in the world handle millions of queries an hour.

What are the top 20 search terms?

Page 4: Distributed Top-K Monitoring

4

The problem

Continuously report the k largest values obtained from distributed data streams.

Multiple sources - physically far away Communication is expensive. Inefficient to transmit large amounts of data

Streaming model Values change over time

Approximation may be sufficient

Page 5: Distributed Top-K Monitoring

5

Motivation – Detecting DDos attacks

Page 6: Distributed Top-K Monitoring

6

Formal problem definition

m+1 nodes: Monitor nodes: N1, N2 , … , Nm

Coordinator node: N0

Set of n data objects U = {O1, O2 , … , On} i.e. Search terms, IP addresses

Objects are associated with real values V1, V2 , … , Vn i.e. # of requests DNS queries to IP address in

last 15 minutes

Page 7: Distributed Top-K Monitoring

7

Distributed streaming model

Updates to values through a sequence of < Oi , Nj , > touples where: Nj detects a change in the value Vi of Oi. Change is not seen by other nodes Nk

(ki)

For each node j, Define Partial values V1,j, V2,j,…, Vn,j: Vi,j

= < Oi , Nj , > ()

The value Vi for an object Oi: Vi= j (Vi,j)

Page 8: Distributed Top-K Monitoring

8

Model example

< O1 , N1 , 2>< O2 , N1 , 3>< O4 , N1 , 4>< O3 , N1 , 2>< O1 , N1 , 1> N1 N2 N3

U = {O1, O2 , O3 , O4}

< O2 , N2 , 3>< O4 , N2 , 5>< O4 , N2 , -2>< O3 , N2 , 4>< O3 , N2 , 5>

< O2 , N3 , -1>< O3 , N3 , 4>< O2 , N3 , 2>< O3 , N3 , 3>< O2 , N3 , 5>

V1,2 = 0V2,2 = 3V3,2 = 9V4,2 = 3

V1,1 = 3V2,1 = 3V3,1 = 2V4,1 = 4

V1,3 = 0V2,3 = 6V3,3 = 7V4,3 = 0

V1=3 , V2=12 , V3=18 , V4=7

Page 9: Distributed Top-K Monitoring

9

Using the model

Top-k IP addresses in the last 15 minutes: <IPAddr,Router,1> when receiving a request

for an IP address. A cancelling <IPAddr,Router,-1> 15 minutes

afterwards

Can Adopt a different strategy: <IPAddr, Router, 15> when receiving a request. <IPAddr, Router, -1> 15 times on the minute

Page 10: Distributed Top-K Monitoring

10

The problem

The coordinator node N0 must report a set TU, |T|=k, that represents the top-k data objects.

Must be the correct within .

Formally. If OtT and OsU-T :

Vt+ VS

Example

=5

1009795 92908887838075

Page 11: Distributed Top-K Monitoring

11

Related work

One time distributed top-k calculation Bruno, Gravano, Marian 2002 Fagin, Lotem, Naor 2001

Much better than transmitting all the values to coordinator nodeNot streaming no means to detect changes to data Running algorithm continuously is very expensive

Monitor nodes have limited query capabilities Sorted (GetNext) and random (GetValue)

Page 12: Distributed Top-K Monitoring

12

Related work

Streaming top-k monitoring from single source Charikar, Chen, Farach-Colton 2002 Manku, Motwani 2002 Gibbons, Matias 1998

Randomized Algorithms Focus on minimizing space

Reminder: The objective is to minimize communication costs

Page 13: Distributed Top-K Monitoring

13

Overview of algorithm

Initialize a top-k set at the coordinator node

Set arithmetic constraints at monitor nodes Depend on current top-k set

Constraints valid No communications

Constraints invalidated Resolution Possibly new top-k set Reallocation of constraints

Page 14: Distributed Top-K Monitoring

14

Choosing the constraints

Ideally, data is distributed evenly at monitor nodes, such that the top-k sets are the sameIn this case, the global top-k set matches the local local top-k sets It suffices that local constraints remain valid

N1 (US)

Money=100Sex=98

Health=94Mail=92

N2 (Germany)

Sex=30Money=20

Mail=5Health=3

N3 (Japan)

Money=50Sex=5Mail=4

Health=1

Global List

Money=170Sex=133Mail=101Health=98

Page 15: Distributed Top-K Monitoring

15

Adjustment factors

In real life, data is not distributed evenly

N1 (US)

Money=100Health=94Mail=92Sex=90

N2 (Germany)

Sex=30Money=20

Mail=5Health=3

N3 (Japan)

Money=50Health=6

Sex=5Mail=4

Global List

Money=170Sex=125

Health=103Mail=101

Local constraints are invalidated, but global top-k still valid

<N1,Sex,-8> <N3,Health,5>

Page 16: Distributed Top-K Monitoring

16

Adjustment factors

For each node Nj and object Oi associate an adjustment factor i,j

Constraints are evaluated after adding the adjustment factors If OtT and OsU-T : Vt,i+ t,i Vs,i + t,i

Adjustment factors for each object sum to zero: This ensures sum remains valid

Page 17: Distributed Top-K Monitoring

17

Adjustment factors example

N1 (US)

Money=100Health=94Mail=92Sex=90

N2 (Germany)

Sex=30Money=10

Mail=5Health=3

N3 (Japan)

Money=50Health=6

Sex=5Mail=4

Global List

Money=170Sex=125

Health=103Mail=101

Sex,1=10, Sex,2=-15, Sex,3=5N1 (US)

Money=100Sex=100

Health=94Mail=92

N2 (Germany)

Money=20Sex=15Mail=5

Health=3

N3 (Japan)

Money=50Sex=10

Health=6Mail=4

Global List

Money=170Sex=125

Health=103Mail=101

Page 18: Distributed Top-K Monitoring

18

Coordinator adjustment factor

For each object Oj add an adjustment factor j,0 at the coordinator node Factors for each object Oj must still sum to 0

To allow error, if OtT and OsU-T : Give Ot values a “bonus” of Let Vt,0

= Vs,0 = 0

The constraint: t,0+ s,0

Page 19: Distributed Top-K Monitoring

19

Allowing error – example

N1 (US)

Money=100Sex=98

Health=94Mail=92

N2 (Germany)

Sex=30Money=20

Mail=5Health=3

N3 (Japan)

Money=50Health=41

Sex=5Mail=4

Global List

Money=170Health=138

Sex=133Mail=101

<N3,Health,40> =5

sex,1=-4, 2,sex,2=-25, sex,3=29

health,2=2, health,3=-7

The trick: Health,0 =5sex,0 + 5 health,0

Page 20: Distributed Top-K Monitoring

20

Why do adjustment factors work?

For OtT and OsU-T :

As long as for each node Ni the adjusted constraints and the coordinator constraint are valid: Vt,i+ t,i Vs,i + t,I

t,0+ s,0

We can sum for the nodes and the error constraint and get:Vt+ Vs

Page 21: Distributed Top-K Monitoring

21

Algorithm details

Coordinator node No maintains Current approximate Top-k set All adjustment factors i,j

Each monitor node Nj maintains Current approximate top-k set For each object Oi

Partial value: Vi,j

Relevant adjustment factor: i,j

Page 22: Distributed Top-K Monitoring

22

Algorithm details

Initialization. Coordinator: Computes the approximate top-k set once. Chooses adjustment factors Sends adjustment factors and top-k set to monitors

Monitor node constraints: For OtT and OsU-T : Vt,j+ t,j Vs,j + t,j

Adjustment factor constraints: For each object Oi: j (i,j) = 0

For objects OtT and OsU-T: t,0+ s,0

Page 23: Distributed Top-K Monitoring

23

Algorithm for monitor node Nj

Algorithm for monitor node Nj

While (1) Read tuple < Oi , Nj , >

Vi,j = Vi,j +

Check constraints: For OtT and OsU-T : Vt,j+ t,j Vs,j + t,j

If invalid, initiate resolution.

End

To check constraints: Use two Heaps (or Fibheaps)

Page 24: Distributed Top-K Monitoring

24

Resolution – phase 1

First, Nj sends a message to N0

with: F - The set of objects

involved in violated constraints

All partial values for objects in R = FT

The border value Bf - Maximum adjusted value not in the resolution set

N3 (Japan)

Money=50 Mail=10

Sex=5Health=1Love=0

F3 = {Mail, Sex}R3 = {Money,Mail, Sex}

Vmoney,3 = 50Vmail,3 = 10Vsex,3 = 5B3 = 1

Page 25: Distributed Top-K Monitoring

25

Resolution – phase 2

The coordinator N0 attempts to resolve the constraints using the *,0 slack

For each violated constraint N0 tests:

Vt,j+ t,j + t,0 + Vs,j + s,j + s,0

If all tests succeed, the top-k set is valid, and there’s no need to communicate with other nodes. No reallocates adjustment factors. Resolution is over

If at least one test fails, proceed to phase 3

Page 26: Distributed Top-K Monitoring

26

Phase 2 resolution example

Money=100Sex=98Mail=96

Health=92

Money=35Sex =20Mail=5

Health=3

Money=50Sex=5Mail=4

Health=1

Money=185Sex=123Mail=105Health=96

=5

<N2,Mail,17>

*,* =0

Money=100Sex=98Mail=96

Health=92

Money=35Mail=22Sex =20Health=3

Money=50Sex=5Mail=4

Health=1

Money=185 Sex=123Mail=122Health=96

To fix: sex,0 =-2 sex,2 =2

Page 27: Distributed Top-K Monitoring

27

Phase 2 resolution failure

<N2,Sex,5>

Money=100Sex=98Mail=96

Health=92

Money=35 Sex =27

Mail=22Health=3

Money=50Sex=5Mail=4

Health=1

Money=185 Sex=128Mail=122Health=96

sex,0 =-2 sex,2 =2

<N3,Mail,5>

Money=100Sex=98Mail=96

Health=92

Money=35 Sex =27

Mail=22Health=3

Money=50Mail=9Sex=5

Health=1

Money=185 Sex=128Mail=127Health=96

Can’t “loan” 4 from sex,0

Page 28: Distributed Top-K Monitoring

28

Resolution – phase 3

The coordinator N0 contacts all the nodes Ni

excluding Nj, requesting: Partial values for objects in R = FT Border values Bi

N0 sums the partial values and sorts them to compute new top-k list T’

N0 reallocates new adjustment factors for T’

N0 sends T’ and adjustment factors to all nodes

Page 29: Distributed Top-K Monitoring

29

Resolution – summary

Phase 1 - Nj detects failed constraints and notifies N0. Initiates resolution for R = FT

Phase 2 – N0 attempts to resolve constraints using *,0 – the “bank” If success, reallocate adjustment factors & stop

Phase 3 - N0 requests all updated partial values for R, sorts, computes new top-k list Reallocate adjustment factors

Page 30: Distributed Top-K Monitoring

30

Resolution Performance

Means to measure algorithm performanceMessages are usually small Only resolution set R = FT is involved

Two phase resolution Initiation + reallocation Only two messages

Three phase resolution Initiation + Query + reallocation 1 + 2(m-1) + m = 3m –1

Page 31: Distributed Top-K Monitoring

31

Adjustment factor reallocation

Input: top-k list T’ Partial values in resolution set R Border values

Output New adjustment factors i,j

Method - For each object: Meet border value constraints Calculate leeway Distribute leeway evenly

Money=50 Mail=10

Sex=5Health=1Love=0

F = {Mail, Sex}R = {Money,Mail, Sex}

Vmoney = 50Vmail = 10Vsex = 5B = 1

Page 32: Distributed Top-K Monitoring

32

Leeway computation

For each object in R compute leeway : the slack above the sum of border valuesDefine: Sum of border values: B = j (Bj) Computed values: Vi = j (Vi,j) Vi,0 = 0 ; Bj = max (i,0) where Oi not in R

If Oi T’ : i= Vi – B + Otherwise : i= Vi – B

Page 33: Distributed Top-K Monitoring

33

Leeway computation example

N1 (US)

Money=100Sex=98Health=94

Mail=92Love = 85

N2 (Germany)

Sex=30Money=20

Mail=5Love = 5Health=3

N3 (Japan)

Money=50 Mail=10

Sex=5Health=1Love=0

Global List

Money=170Sex=133Mail=107Health=98Love=90

B = 94+5+1 = 100

money = 170 – B = 70

sex = 133 – B = 33

Mail = 107 – B = 7

=0

Page 34: Distributed Top-K Monitoring

34

Leeway distribution

Initialization: Meet constraints i,j = Bj - Vi,j

For Oi T’ , j = 0 : i,0 = B0 - Leeway distribution: i,j = i,j + (i / m)

Correctness: Vt,j+ t,j Vs,j + t,j

If Os R: follows from Vt,i, > Bi

If Os R: follows from t,i > s,i

Page 35: Distributed Top-K Monitoring

35

Leeway distribution example

N1 (US)

Money=100Sex=98Health=94

Mail=92Love = 85

N2 (Germany)

Sex=30Money=20

Mail=5Love = 5Health=3

N3 (Japan)

Money=50 Mail=10

Sex=5Health=1Love=0

Global List

Money=170Sex=133Mail=107Health=98Love=90

sex = 33

sex,1 = B1 – Vsex,1 + 33/3 = 94 – 98 + 11 = 7

sex,2 = B2 – Vsex,2 + 33/3 = 5 – 30 + 11 = -14

sex,3 = B3 – Vsex,3 + 33/3 = 1 – 5 + 11 = 7

Page 36: Distributed Top-K Monitoring

36

Leeway distribution example

money = 70

money,1 = B1 – Vmoney,1 + 70/3 = 94 – 100 + 24 = 18

money,2 = B2 – Vmoney,2 + 70/3 = 5 – 20 + 23 = 8

money,3 = B3 – Vmoney,3 + 70/3 = 1 – 50 + 23 = -26

mail = 7

mail,1 = B1 – Vmail,1 + 7/3 = 94 – 92 + 3 = 5

mail,2 = B2 – Vmail,2 + 7/3 = 5 – 5 + 2 = 2

mail,3 = B3 – Vmail,3 + 7/3 = 1 – 10 + 2 = -7

Page 37: Distributed Top-K Monitoring

37

Reallocation Results

N1 (US)

Money=100Sex=98Health=94

Mail=92Love = 85

N2 (Germany)

Sex=30Money=20

Mail=5Love = 5Health=3

N3 (Japan)

Money=50 Mail=10

Sex=5Health=1Love=0

Global List

Money=170Sex=133Mail=107Health=98Love=90

N1 (US)

Money=118Sex=105Mail=97Health=94Love = 85

N2 (Germany)

Money=28Sex=16Mail=7Love = 5Health=3

N3 (Japan)

Money=24 Sex=12 Mail=3 Health=1Love=0

Global List

Money=170Sex=133Mail=107Health=98Love=90

Page 38: Distributed Top-K Monitoring

38

Leeway distribution to N0

Leeway also distributed to monitor node added to leeway computation for Ot T’ Initialization for t,0 for Ot T’ is B0 - Any addition can be “loaned” to monitor nodes

Amount distributed to N0

Higher (i / 2) – Less chance for phase 3 in resolution

Lower (0) – Less resolutions (More leeway to monitor nodes)

Page 39: Distributed Top-K Monitoring

39

Proportional leeway distribution

Allocate more leeway to monitor nodes updated more often

Top-k likely to change more

Good for monitor notes that exhibit characteristic behavior Google locations Enterprise routers

Page 40: Distributed Top-K Monitoring

40

Experiments

Query 1: FIFA ’98 Servers at 4 locations throughout the world. 20 top Web site page hit statistics

Query 2: Most loaded server in a cluster Single value per monitor node

Query 3: Berkly to world WAN link, with 4 monitor points 20 top destination hosts by number outgoing tcp

packets

Page 41: Distributed Top-K Monitoring

41

Results – Query 1

Page 42: Distributed Top-K Monitoring

42

Results – Query 2

Page 43: Distributed Top-K Monitoring

43

Results – query 3

Page 44: Distributed Top-K Monitoring

44

Analysis of results

Allowing error improves results dramatically

Leeway for N0 – Dominant factor Low – Half leeway to N0

Low little leeway Resolutions are bound to happen. Make them less

expensive

High – No leeway to N0

Page 45: Distributed Top-K Monitoring

45

Analysis of results

Even / Proportional leeway distribution depends on query. Server load – Proportional Berkly WAN – Monitor nodes simulated, so

even distribution better FIFA – Proportional for lower . Even for

higher .

Page 46: Distributed Top-K Monitoring

46

Comparison to alternative

Caching Coordinator holds cached partial data values Monitor must send update to coordinator when

partial value deviates by /2m

Monitor will always have correct partial values, within /2

Top-k list always correct within

Page 47: Distributed Top-K Monitoring

47

Results:

Note the

log scale!

Page 48: Distributed Top-K Monitoring

48

Summary

Problem – find top-k set within error Distributed – multiple sources Streaming – frequent updates

Naive approach Transmit streams to coordinator node If error is allowed, transmit only when deviation from

cached value threatens correctness

New approach offers dramatic improvement over naïve approach for low-medium .

Page 49: Distributed Top-K Monitoring

49

Summary

Use adjustment factors to establish constraintsMonitor node initiates resolution when constraint gets brokenResolution Attempt to use coordinator node leeway. If successful,

fix constraints by adjustment factor reallocation. Get partial values for resolution set from all nodes,

compute new top-k set. Reallocate leeway to all nodes.

Reallocation Distribute leeway evenly between monitor nodes Distribute leeway for monitor on on low

Page 50: Distributed Top-K Monitoring

50

Questions?