Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging...

59
Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan Meng, Bojie Li, Kun Tan, Dan Pei, Peng Cheng, Layong (Larry) Luo, Yongqiang Xiong, Xiaoliang Wang, and Youjian Zhao

Transcript of Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging...

Page 1: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Fast and Cautious:Leveraging Multi-path Diversity

for Transport Loss Recovery in Data Centers

Guo ChenYuanwei Lu, Yuan Meng, Bojie Li, Kun Tan, Dan Pei, Peng Cheng, Layong (Larry) Luo, Yongqiang Xiong,

Xiaoliang Wang, and Youjian Zhao

Page 2: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Motivationn Services care about the tail flow completion time (tail FCT)

¨ Large number of flows generated in each operation¨ Overall performance governed by the last completed flows

16/6/25 2

Large-scale web applicationhosted in

Data Center Network (DCN)

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

AppLogic

Respondingtoauserrequest

Page 3: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Motivationn Services care about the tail flow completion time (tail FCT)

¨ Large number of flows generated in each operation¨ Overall performance governed by the last completed flows

n But packet loss hurts tail FCT¨ Real case in a Microsoft Azure’s DCN

16/6/25 3

Spine switch 2% random drop rate -->

increase of 99th percentile latency of all users

DCNtaillatencyvisualization[Pingmesh (SIGCOMM’15)]

(a)Normal (b)Spinefailure

Page 4: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Outlinen Motivationn Packet Loss in DCNn Impact of Packet Lossn Challenge for Loss Recoveryn FUSO Designn Evaluationn Summary

16/6/25 4

Page 5: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Lossrateandlocationdistributionoflossy links(lossrate>1%)

Mean loss rate 4%

78% above ToR

Similar in 5 days

Packet Loss in DCN

16/6/25 5

1) Lossfrequently happens(theoveralllossrateislow)2) Mostlosseshappeninthenetworkinsteadoftheedge

n Loss characteristics¨ Measured in a Microsoft production DCN during Dec. 1st-5th, 2015

Page 6: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Packet Loss in DCNn Reasons causing loss

¨ Congestion lossØ Uneven load-balance

Ø Incast

¨ Failure lossØ Silent random drop

Ø Packet black-hole

Bursty;Transient

16/6/25 6

Complex;Hardtodetect

Greatly mitigated(e.g., 1%->0.01%)[Jupiter Rising SIGCOMM’15]

Common& Huge impacton performance[Pingmesh SIGCOMM’15]

Page 7: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Outlinen Motivationn Packet Loss in DCNn Impact of Packet Loss

¨ Why loss hurts the tail?¨ How hard loss hurts?

n Challenge for Loss Recoveryn FUSO Designn Evaluationn Summary

16/6/25 7

Page 8: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

How TCP Handles Loss?n Fast recovery

¨ Wait for certain number of DACKs to detect the loss and retransmit

8

1-2

Ack1-2

3-6

DupAck3

Retran3

RTT

RTT

Sender Receiver

Page 9: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

1-2

Ack1-2

3-6

Retran3

RTT

Timeout

Sender Receiver

How TCP Handles Loss?n Fast recovery

¨ Wait for certain number of DACKs to detect the loss and retransmit

n Timeout (RTO)¨ If not enough DACKs return, retransmit

after a timeout

9

RTO>> RTTe.g.RTO=5ms,RTT<100us[Pingmesh (SIGCOMM’15),DCTCP(SIGCOMM’10)]

Page 10: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

1-2

Ack1-2

3-6

Retran3

RTT

Timeout

Sender Receiver

How TCP Handles Loss?n Fast recovery

¨ Wait for certain number of DACKs to detect the loss and retransmit

n Timeout (RTO)¨ If not enough DACKs return, retransmit

after a timeout

10

RTO>> RTTe.g.RTO=5ms,RTT<100us[Pingmesh (SIGCOMM’15),DCTCP(SIGCOMM’10)]

Encountering one RTO àdramatically increase the FCT

Page 11: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Timeout probability of flows with different sizes passing a path with different packet loss rate

10KB(testbed) 100KB(testbed)

100KB(analysis)

10KB(analysis)

Loss Incurs Timeout

1. 1% loss à more than 1% flows timeout2. Larger flows (e.g. 100KB)

a. timeout ratio sharply grows when loss rate > 1%16/6/25 11

99th FCT>RTO3%lossà ~10%timeout

n A little loss causes enough timeout to hurt the tail FCT

Page 12: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Timeout probability of flows with different sizes passing a path with different packet loss rate

10KB(testbed) 100KB(testbed)

100KB(analysis)

10KB(analysis)

Loss Incurs Timeout

1. 1% loss à more than 1% flows timeout2. Larger flows (e.g. 100KB)

a. timeout ratio sharply grows when loss rate > 1%16/6/25 12

99th FCT>RTO3%lossà ~10%timeout

n A little loss causes enough timeout to hurt the tail FCT

To avoid RTO

Page 13: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Outlinen Motivationn Packet Loss in DCNn Impact of Packet Lossn Challenge for Loss Recoveryn FUSO Designn Evaluationn Summary

16/6/25 13

Page 14: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Challenge for TCP Loss Recoveryn Prior works add aggressiveness to congestion control to do

loss recovery before timeout (RTO)¨ Tail Loss Probe (TLP)

Ø transmit one prober after 2RTT

¨ Instant Recovery (TCP-IR)Ø generate an FEC packet for every group of packets (up to 16)

Ø FEC packets also act as probers, delayed 1/4RTT before sent

¨ Proactive/RepFlowØ Duplicate every packet/flow

16/6/25 14

[SIGCOMM’13,RFC5827]

[SIGCOMM’13,RFC5827]

[SIGCOMM’13,INFOCOM’14]

Page 15: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Challenge for TCP Loss Recoveryn How long to wait before sending recovery packets?

¨ For congestion lossØ Should delay enough in case of worsening congestion

16/6/25 15

Bursty:Lead to multiple consecutive losses

[Incast (WREN’09),DCTCP(SIGCOMM’10)]

Page 16: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Challenge for TCP Loss Recoveryn How long to wait before sending recovery packets?

¨ For congestion lossØ Should delay enough in case of worsening congestion

¨ For failure loss such as random dropØ Should recover as fast as possible, otherwise already increase the FCT

16/6/25 16

• Wait 2RTT is too costly• Accurate & high-precision RTT measurement is challenging

[TLPSIGCOMM’13,RFC5827]

Page 17: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

How to accelerate loss recovery as soon as possible, under various loss conditions without causing congestion?

Brief Summaryn Loss easily incurs timeout to hurt the tailn To prevent timeout, prior works add fixed aggressiveness to

recover loss before timeoutn Hard to adapt to various loss conditions

¨ Should be fast for failure loss¨ Should be cautious for congestion loss

16/6/25 17

Page 18: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Outlinen Motivationn Packet Loss in DCNn Impact of Packet Lossn Challenge for Loss Recoveryn FUSO Designn Evaluationn Summary

16/6/25 18

Page 19: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

n Utilize the “good” paths to proactively conduct loss recovery for “bad” paths¨ Leveraging path diversity (multiple paths; a few encounter loss)

n Fast and Cautious¨ Fast

Ø Proactive (immediate) recovery for potential packet loss utilizing sparetransmission opportunity

¨ CautiousØ Strictly follow congestion control without adding aggressiveness

FUSO: Fast Multi-path Loss Recovery

16/6/25 19

Page 20: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

Multi-path Transport Background

16/6/25 20

SF1

SF2

SF3

SF1

SF2

SF3

CWND2CWNDtotal

CWND1

CWND3

Multi-pathCongestionControl

DataDistribution

Sub-flows:Implicitly/Explicitlymappingtophysicalpaths

Page 21: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 21

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2P3P4P5 P1

Page 22: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 22

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2P3P4P5

P1

Page 23: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 23

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2P3P4P5

P1

Page 24: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 24

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2P3P4P5 P1

Page 25: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 25

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2P3P4P5 P1

Page 26: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 26

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Page 27: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 27

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P4P5 P1

P3

P2

Lost

Page 28: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 28

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P4P5 P1

P3

P2

Lost

Page 29: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 29

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P4P5 P1P3P2

Lost

Page 30: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 30

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P4P5 P1P3P2

Lost

ACKP3

Page 31: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 31

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P4P5 P1P3P2

Lost

Page 32: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 32

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P4P5

P1P3P2

Lost

Page 33: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 33

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P1P3P2

Lost

P4P5

Page 34: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 34

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P1P3P2

Lost

P4P5

Page 35: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 35

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P1P3P2

Lost

ACKP1P4P5

ACKP4&P5

Page 36: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 36

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Page 37: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 37

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

LostSpareCWND

Nonewdata

Page 38: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 38

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Proactivelossrecovery

P2

Page 39: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 39

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Proactivelossrecovery

P2

“Worst”sub-flow

“Best”sub-flow

Page 40: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 40

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Proactivelossrecovery

P2

“Worst”sub-flow

“Best”sub-flow

P2

Page 41: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 41

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Proactivelossrecovery

P2

“Worst”sub-flow

“Best”sub-flow

P2

Page 42: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 42

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Proactivelossrecovery

P2

“Worst”sub-flow

“Best”sub-flow

P2

Page 43: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

FUSO

16/6/25 43

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

Lost

Proactivelossrecovery

P2

“Worst”sub-flow

“Best”sub-flow

P2

Done!

Page 44: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

ReceiverSender

Standard MPTCP

16/6/25 44

SF1

SF2

SF3

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P3P4P5 P1P2

LostRetransmitafteranRTO

Page 45: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Sender

FUSO: Path Selection

16/6/25 45

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2

Lost

Possibility of encountering loss

Page 46: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Sender

FUSO: Path Selection

16/6/25 46

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2

Lost

“Worst”sub-flow

n “Worst” Sub-flow¨ With un-ACKed data¨ Most likely having loss

Un-ACKed data

Possibility of encountering loss

Page 47: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Sender

FUSO: Path Selection

16/6/25 47

SF1

SF2

SF3

CWNDtotal

CWND1

CWND2

CWND3

P2

Lost

“Worst”sub-flow

n “Worst” Sub-flow¨ With un-ACKed data¨ Most likely having loss

Possibility of encountering loss

n “Best” Sub-flow¨ With spare CWND¨ Least likely having loss

Spare CWND

“Best”sub-flow

Page 48: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

FUSO in 1 Slide

n If (spare CWND) && (no new data)¨ Utilize the transmission opportunity to proactively recover¨ Use “good” paths to help “bad” paths

n Multi-path diversity offers many transmission opportunities¨ “Good” paths have spare window

16/6/25 48

AppData...

P2

MultipathCongestionControl

SendtobestSub-Flow

P1

P5P6P7R

P4

Sparewindow

...

Sparewindow

Un-ACKeddata

P4

Sender

AppData

R

Recover

P3

Receiver

Sub-Flow1

Sub-FlowN

Sub-Flow2P3

Sub-Flow1

Sub-Flow2

Sub-FlowN

Recoverypackets

Page 49: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

FUSO Implementationn Implemented in Linux kernel; ~900 lines of code

16/6/25 49

https://github.com/1989chenguo/FUSO

Page 50: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Outlinen Motivationn Packet Loss in DCNn Impact of Packet Lossn Challenge for Loss Recoveryn FUSO Designn Evaluationn Summary

16/6/25 50

Page 51: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Testbed Settingsn Network

¨ 1Gbps fabric & 1Gbps hosts; ECMP routing; ECN enabledn TCP

¨ Init_cwnd=16; min_RTO=5ms

16/6/25 51

Page 52: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

99th FCT % of flows encountering

timeout

better

Testbed Resultsn Failure loss

¨ Random-drop

16/6/25 52

Fast

Reducing 99th FCT up to ~82.3%

Reducing the timeout flows up to 100%

Loss rate:0.125%-4%

Latency-sensitiveflows

Page 53: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

better

Testbed Resultsn Congestion loss

¨ Incast

16/6/25 53

Concurrentresponses

Performs the best

Cautious

Page 54: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Testbed Resultsn Failure loss & Congestion loss

¨ From failure-loss-dominated to congestion-loss-dominated

16/6/25

54

Loss rate:2%

Latency-sensitiveflows

Adapt to various loss condition

better

Background long flows

Page 55: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Larger-scale Simulationsn Simulation settings

¨ NS2 simulator; 3-layer, 4-port FatTree

¨ 40Gbps fabric, 10Gbps host; 64 hosts, 20 switches

¨ Empirical failure generation

16/6/25 55

Latency-sensitiveflows

Backgroundlong flows

Random failure

Page 56: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

better

Larger-scale Simulationsn Simulation settings

¨ NS2 simulator; 3-layer, 4-port FatTree fabric¨ 40Gbps fabric, 10Gbps host; 64 hosts, 20 switches¨ Empirical failure generation

16/6/25 56

Reducing the average FCT up to ~60.3%

Reducing the 99th FCT up to ~87.4%

Page 57: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Outlinen Motivationn Packet Loss in DCNn Impact of Packet Lossn Challenge for Loss Recoveryn FUSO Designn Evaluationn Summary

16/6/25 57

Page 58: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Summaryn Loss hurts tail latency

¨ Loss is not uncommon¨ A little loss leads to enough timeout, hurting the tail

n Challenges for loss recovery¨ How to accelerate loss recovery under various loss conditions without

causing congestion?n Philosophy for FUSO

¨ To be fast & cautious are equally important¨ Fast: Proactive loss recovery utilizing spare transmission opportunity,

leveraging multipath diversity ¨ Cautious: Strictly follows congestion control without adding

aggressiveness16/6/25 58

Page 59: Fast and Cautious: Leveraging Multi-path Diversity for ... · Fast and Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan

Q&A?

Thanks