Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis

Performance EngineeringPerformance EngineeringE2EpiPEs and FastTCPE2EpiPEs and FastTCP

Internet2 member meeting - IndianapolisInternet2 member meeting - IndianapolisWorld Telecom 2003 - GenevaWorld Telecom 2003 - Geneva

October 15, 2003 October 15, 2003 [email protected]@caltech.edu

AgendaAgenda

High TCP performance over wide area networks : High TCP performance over wide area networks : TCP at Gbps speedTCP at Gbps speed MTU biasMTU bias RTT biasRTT bias TCP fairnessTCP fairness

How to use 100% of the link capacity with TCP RenoHow to use 100% of the link capacity with TCP Reno Network buffers impactNetwork buffers impact

New Internet2 Land Speed recordNew Internet2 Land Speed record

Single TCP stream performance Single TCP stream performance under periodic lossesunder periodic losses

Effect of packet loss

0102030405060708090

100

0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10Packet Loss frequency (%)

Ban

dw

idth

Util

izat

ion

(%)

WAN (RTT=120ms)

LAN (RTT=0.04 ms)

Loss rate =0.01%:Loss rate =0.01%:LAN BW LAN BW utilization= 99%utilization= 99%WAN BW WAN BW utilization=1.2%utilization=1.2%

Bandwidth available = 1 Gbps

TCP throughput is much more sensitive to packet loss in WANs TCP throughput is much more sensitive to packet loss in WANs than in LANsthan in LANs

TCP’s congestion control algorithm (AIMD) is not suited to gigabit TCP’s congestion control algorithm (AIMD) is not suited to gigabit networksnetworks

Poor limited feedback mechanismsPoor limited feedback mechanisms The effect of packets loss is disastrousThe effect of packets loss is disastrous

TCP is inefficient in high bandwidth*delay networksTCP is inefficient in high bandwidth*delay networks The future performance of computational grids looks bad if we The future performance of computational grids looks bad if we

continue to rely on the widely-deployed TCP RENOcontinue to rely on the widely-deployed TCP RENO

Responsiveness (I)Responsiveness (I)

The responsiveness The responsiveness measures how quickly we go back to using measures how quickly we go back to using the network link at full capacity after experiencing a loss if we the network link at full capacity after experiencing a loss if we assume that the congestion window size is equal to the Bandwidth assume that the congestion window size is equal to the Bandwidth Delay product when the packet is lost.Delay product when the packet is lost.

C . RTTC . RTT

2 . MSS2 . MSS

22 C : Capacity of the linkC : Capacity of the link

TCP responsiveness

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200

RTT (ms)

Tim

e (

s) C= 622 Mbit/s

C= 2.5 Gbit/s

C= 10 Gbit/s

Responsiveness (II)Responsiveness (II)

CaseCase CC RTT (ms)RTT (ms) MSS (Byte)MSS (Byte) ResponsivenessResponsiveness

Typical LAN todayTypical LAN today 1 Gb/s1 Gb/s 22(worst case)(worst case)

14601460 96 ms96 ms

WAN WAN Geneva <-> ChicagoGeneva <-> Chicago

1 Gb/s1 Gb/s 120120 14601460 10 min10 min

WAN WAN Geneva <-> SunnyvaleGeneva <-> Sunnyvale

1 Gb/s1 Gb/s 180180 14601460 23 min23 min

WAN WAN Geneva <-> TokyoGeneva <-> Tokyo

1 Gb/s1 Gb/s 300300 14601460 1 h 04 min1 h 04 min

WAN WAN Geneva <-> SunnyvaleGeneva <-> Sunnyvale

2.5 Gb/s2.5 Gb/s 180180 14601460 58 min58 min

Future WAN Future WAN CERN <-> StarlightCERN <-> Starlight

10 Gb/s10 Gb/s 120120 14601460 1 h 32 min1 h 32 min

Future WAN link Future WAN link CERN <-> StarlightCERN <-> Starlight

10 Gb/s10 Gb/s 120120 8960 8960 (Jumbo Frame)(Jumbo Frame)

15 min15 min

The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the

responsiveness is multiplied by two. Therefore, values above have to be multiplied by tworesponsiveness is multiplied by two. Therefore, values above have to be multiplied by two!!

Single TCP streamSingle TCP stream

TCP connection between Geneva and Chicago: C=1 Gbit/s; MSS=1,460 Bytes; RTT=120msTCP connection between Geneva and Chicago: C=1 Gbit/s; MSS=1,460 Bytes; RTT=120ms

Time to increase the throughout from 100Mbps to 900Mbps = 35 minutesTime to increase the throughout from 100Mbps to 900Mbps = 35 minutes Loss occurs when the bandwidth reaches the pipe sizeLoss occurs when the bandwidth reaches the pipe size 75% of bandwidth utilization (assuming no buffering)75% of bandwidth utilization (assuming no buffering) Cwnd<BDP : Cwnd<BDP :

Throughput < BandwidthThroughput < BandwidthRTT constantRTT constantThroughput = Cwnd / RTTThroughput = Cwnd / RTT

35 minutes35 minutes

Measurements with Different MTUsMeasurements with Different MTUs

TCP connection between Geneva and Chicago: C=1 Gbit/s; RTT=120msTCP connection between Geneva and Chicago: C=1 Gbit/s; RTT=120ms

In both cases: 75% of the link utilizationIn both cases: 75% of the link utilization Large MTU accelerate the growth of the windowLarge MTU accelerate the growth of the window Time to recover from a packet loss decreases with large MTUTime to recover from a packet loss decreases with large MTU Larger MTU reduces overhead per frames (saves CPU cycles, Larger MTU reduces overhead per frames (saves CPU cycles,

reduces the number of packets) reduces the number of packets)

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

MTU and FairnessMTU and Fairness

Two TCP streams share a 1 Gbps Two TCP streams share a 1 Gbps bottleneck bottleneck

RTT=117 msRTT=117 ms MTU = 1500 Bytes; Avg. throughput MTU = 1500 Bytes; Avg. throughput

over a period of 4000s = 50 Mb/sover a period of 4000s = 50 Mb/s MTU = 9000 Bytes; Avg. throughput MTU = 9000 Bytes; Avg. throughput

over a period of 4000s = 698 Mb/sover a period of 4000s = 698 Mb/s Factor 14 !Factor 14 ! Connections with large MTU Connections with large MTU

increase quickly their rate and grab increase quickly their rate and grab most of the available bandwidthmost of the available bandwidth

RR RRGbE GbE SwitchSwitch

Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)

CERN (GVA)CERN (GVA)

RTT and FairnessRTT and Fairness

RR RRGbE GbE SwitchSwitch

Host #1Host #1

POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

RRPOS 10POS 10 Gb/sGb/sRR10GE10GE

Two TCP streams share a 1 Gbps bottleneck Two TCP streams share a 1 Gbps bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/sCERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/sCERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytesMTU = 9000 bytes Connection with small RTT increases quickly there rate and grab most of the available bandwidthConnection with small RTT increases quickly there rate and grab most of the available bandwidth

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

Thr

ough

put

(Mbp

s) RTT=181ms

Average over the life of the connection RTT=181ms

RTT=117ms

Average over the life of the connection RTT=117ms

How to use 100% of the bandwidth?How to use 100% of the bandwidth?

Bandwidth delay productBandwidth delay product

Single TCP stream GVA - CHISingle TCP stream GVA - CHI MSS=8960 Bytes; Throughput = 980MbpsMSS=8960 Bytes; Throughput = 980Mbps Cwnd > BDP => Throughput = BandwidthCwnd > BDP => Throughput = Bandwidth RTT increaseRTT increase Extremely Large buffer at the bottleneckExtremely Large buffer at the bottleneck Network buffers have an important impact on performanceNetwork buffers have an important impact on performance Have buffers to be well dimensioned in order to scale with Have buffers to be well dimensioned in order to scale with

the BDP? the BDP? Why not use the end-to-end delay as congestion indication.Why not use the end-to-end delay as congestion indication.

Single stream TCP performanceSingle stream TCP performance

DateDate From From Geneva toGeneva to

Size of Size of transfertransfer

DurationDuration

(second)(second)

RTTRTT

(ms)(ms)

MTUMTU

(Bytes)(Bytes)

IP IP versionversion

ThroughputThroughput RecordRecord

AwardAward

Feb Feb 2727

SunnyvaleSunnyvale 1,1 TByte1,1 TByte 37003700 180180 90009000 IPv4IPv4 2.38 Gbps2.38 Gbps Internet2 LSRInternet2 LSRCENIC awardCENIC awardGuinness Guinness World RecordWorld Record

May May 2727

TokyoTokyo 65.1 65.1 GByteGByte

600600 277277 15001500 IPv4IPv4 931 Mbps931 Mbps

May May 22

ChicagoChicago 385 GByte385 GByte 36003600 120120 15001500 IPv6IPv6 919 Mbps919 Mbps

May May 22

ChicagoChicago 412 GByte412 GByte 3600 3600 120120 90009000 IPv6IPv6 983 Mbps983 Mbps Internet2 LSRInternet2 LSR

NEW Submission (Oct-11):NEW Submission (Oct-11): 5.65 Gbps from Geneva to Los Angeles across 5.65 Gbps from Geneva to Los Angeles across the LHCnet, Starlight, Abilene and CENIC.the LHCnet, Starlight, Abilene and CENIC.

Early 10 Gb/s 10,000 km TCP TestingEarly 10 Gb/s 10,000 km TCP Testing

Single TCP stream at Single TCP stream at 5,655,65 Gbps Gbps Transferring a full CD in less than 1sTransferring a full CD in less than 1s Un-congestioned networkUn-congestioned network No packet loss during the transferNo packet loss during the transfer Probably qualifies as new Internet2 Probably qualifies as new Internet2

LSRLSR

Monitoring of the Abilene traffic in LAMonitoring of the Abilene traffic in LA

ConclusionConclusion

The future performance of computational grids looks bad if we The future performance of computational grids looks bad if we continue to rely on the widely-deployed TCP RENOcontinue to rely on the widely-deployed TCP RENO

How to define the fairness?How to define the fairness?Taking into account the MTUTaking into account the MTUTaking into account the RTTTaking into account the RTT

Larger packet size (Jumbogram : payload larger than 64K)Larger packet size (Jumbogram : payload larger than 64K)Is standard MTU the largest bottleneck?Is standard MTU the largest bottleneck?New Intel 10GE cards : MTU=16KNew Intel 10GE cards : MTU=16KJ. Cain (Cisco): “It’s very difficult to build switches to switch large J. Cain (Cisco): “It’s very difficult to build switches to switch large packets such as jumbogram”packets such as jumbogram”

Our vision of the network:Our vision of the network:““The network, once viewed as an obstacle for virtual collaborations The network, once viewed as an obstacle for virtual collaborations and distributed computing in grids, can now start to be viewed as a and distributed computing in grids, can now start to be viewed as a catalyst instead. Grid nodes distributed around the world will simply catalyst instead. Grid nodes distributed around the world will simply become depots for dropping off information for computation or become depots for dropping off information for computation or storage, and the network will become the fundamental fabric for storage, and the network will become the fundamental fabric for tomorrow's computational grids and virtual supercomputers” tomorrow's computational grids and virtual supercomputers”

Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis

Documents

Transcript of Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis