NetPoirot: Taking the blame game out of data center...

35
NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred

Transcript of NetPoirot: Taking the blame game out of data center...

Page 1: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

NetPoirot: Taking The Blame Game Out of Data Center Operations

Behnaz Arzani, Selim Ciraci, Boon Thau Loo,

Assaf Schuster, Geoff Outhred

Page 2: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Datacenters can fail …

2

Page 3: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Failures are disruptive

••

3

Page 4: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Why is debugging hard?

4

Penn researcher

Azure VM Azure Network Service X

Network

Page 5: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

NetworkNetwork

`

Someone accepts responsibility Each blames the other

5

In the case of a failure…

Page 6: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

A real example… Event X

••

6

Page 7: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Current tools are insufficient

SherlockSIGCOMM-07

NetMedicSIGCOMM-09NSDI-11

TRatSIGCOMM-02 Netprofile

rP2Psys-05

7

Page 8: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Can we do better? (Overview)

• Introducing…

8

NetPoirot

Fault injector

Learning Agent

Page 9: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

The monitoring agent

••

9

Page 10: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

What is the TCP event digest?

10

Page 11: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Why do we think this can work?

••

••

11

Page 12: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

To distinguish failures…

••

12

Page 13: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Decision trees…

13His uncertainty is X

Page 14: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Decision trees…

••

14His uncertainty is X-Y

Page 15: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Decision trees alone are not enough

15

Page 16: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Decision trees alone are not enough

16

Page 17: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Decision trees alone are not enough

17Feature 1

Fe

atu

re 2

Page 18: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Decision trees alone are not enough

Easiest to

18

Hardest to classify

Fe

atu

re 2

Feature 1

Page 19: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

What we do to deal with this

19

Fe

atu

re 2

Feature 1

Page 20: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Upper portion of an example tree…

20

Mean of max congestion window

Min of the last congestion window

50th percentile of number of triple duplicate ACKs

50th percentile of connection duration

Max of the number of triple duplicate Acks

95th percentile of the max congestion window

Page 21: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

What we do to deal with this

21

Fe

atu

re 2

Feature 1

Page 22: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Upper portion of an example tree…

22

50TH percentile of the max RTT

Number of flows

50th percentile of amount of data received

95th percentile of the number of timeouts

Page 23: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Decision trees alone are not enough

23Feature 1

Fe

atu

re 2

Page 24: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

The upper portion of an example tree…

24

Mean time spent in zero window probing

95th percentile of the ratio of number of bytes posted

to received

Number of flows

Number of flows

95th percentile of connection durations

Minimum of the number of bytes received

Page 25: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

25

Is it a network failure?

Is it a server problem?

Is it a client side problem?

Page 26: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Other details

••

26

If throughput < x:Open more

connections

If throughput <x:Send more data on the same connection

Page 27: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

What did we learn from all this?

••

••

••

••

27

Page 28: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Evaluation

••

••

28

Page 29: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

How did we get labeled data?

••

29

Page 30: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Worse case application

30

Page 31: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

What if we haven’t seen the failure before?

31

Page 32: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Performance on real applications

32

General label

Normal Client Network

Precision

97.78% 99.7% 100%

Recall 99.68% 98.25% 99.37

YouTube

Event X

Page 33: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Things we did not talk about

33

Page 34: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

What’s next?

••

••

34

Page 35: NetPoirot: Taking the blame game out of data center …conferences.sigcomm.org/sigcomm/2016/files/program/...NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani,

Conclusion

35