1 NC Education Cloud Internet2 Member Meeting October 4, 2011.
Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis
description
Transcript of Performance Engineering E2EpiPEs and FastTCP Internet2 member meeting - Indianapolis
Performance EngineeringPerformance EngineeringE2EpiPEs and FastTCPE2EpiPEs and FastTCP
Internet2 member meeting - IndianapolisInternet2 member meeting - IndianapolisWorld Telecom 2003 - GenevaWorld Telecom 2003 - Geneva
October 15, 2003 October 15, 2003 [email protected]@caltech.edu
AgendaAgenda
High TCP performance over wide area networks : High TCP performance over wide area networks : TCP at Gbps speedTCP at Gbps speed MTU biasMTU bias RTT biasRTT bias TCP fairnessTCP fairness
How to use 100% of the link capacity with TCP RenoHow to use 100% of the link capacity with TCP Reno Network buffers impactNetwork buffers impact
New Internet2 Land Speed recordNew Internet2 Land Speed record
Single TCP stream performance Single TCP stream performance under periodic lossesunder periodic losses
Effect of packet loss
0102030405060708090
100
0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10Packet Loss frequency (%)
Ban
dw
idth
Util
izat
ion
(%)
WAN (RTT=120ms)
LAN (RTT=0.04 ms)
Loss rate =0.01%:Loss rate =0.01%:LAN BW LAN BW utilization= 99%utilization= 99%WAN BW WAN BW utilization=1.2%utilization=1.2%
Bandwidth available = 1 Gbps
TCP throughput is much more sensitive to packet loss in WANs TCP throughput is much more sensitive to packet loss in WANs than in LANsthan in LANs
TCP’s congestion control algorithm (AIMD) is not suited to gigabit TCP’s congestion control algorithm (AIMD) is not suited to gigabit networksnetworks
Poor limited feedback mechanismsPoor limited feedback mechanisms The effect of packets loss is disastrousThe effect of packets loss is disastrous
TCP is inefficient in high bandwidth*delay networksTCP is inefficient in high bandwidth*delay networks The future performance of computational grids looks bad if we The future performance of computational grids looks bad if we
continue to rely on the widely-deployed TCP RENOcontinue to rely on the widely-deployed TCP RENO
Responsiveness (I)Responsiveness (I)
The responsiveness The responsiveness measures how quickly we go back to using measures how quickly we go back to using the network link at full capacity after experiencing a loss if we the network link at full capacity after experiencing a loss if we assume that the congestion window size is equal to the Bandwidth assume that the congestion window size is equal to the Bandwidth Delay product when the packet is lost.Delay product when the packet is lost.
C . RTTC . RTT
2 . MSS2 . MSS
22 C : Capacity of the linkC : Capacity of the link
TCP responsiveness
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200
RTT (ms)
Tim
e (
s) C= 622 Mbit/s
C= 2.5 Gbit/s
C= 10 Gbit/s
Responsiveness (II)Responsiveness (II)
CaseCase CC RTT (ms)RTT (ms) MSS (Byte)MSS (Byte) ResponsivenessResponsiveness
Typical LAN todayTypical LAN today 1 Gb/s1 Gb/s 22(worst case)(worst case)
14601460 96 ms96 ms
WAN WAN Geneva <-> ChicagoGeneva <-> Chicago
1 Gb/s1 Gb/s 120120 14601460 10 min10 min
WAN WAN Geneva <-> SunnyvaleGeneva <-> Sunnyvale
1 Gb/s1 Gb/s 180180 14601460 23 min23 min
WAN WAN Geneva <-> TokyoGeneva <-> Tokyo
1 Gb/s1 Gb/s 300300 14601460 1 h 04 min1 h 04 min
WAN WAN Geneva <-> SunnyvaleGeneva <-> Sunnyvale
2.5 Gb/s2.5 Gb/s 180180 14601460 58 min58 min
Future WAN Future WAN CERN <-> StarlightCERN <-> Starlight
10 Gb/s10 Gb/s 120120 14601460 1 h 32 min1 h 32 min
Future WAN link Future WAN link CERN <-> StarlightCERN <-> Starlight
10 Gb/s10 Gb/s 120120 8960 8960 (Jumbo Frame)(Jumbo Frame)
15 min15 min
The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the
responsiveness is multiplied by two. Therefore, values above have to be multiplied by tworesponsiveness is multiplied by two. Therefore, values above have to be multiplied by two!!
Single TCP streamSingle TCP stream
TCP connection between Geneva and Chicago: C=1 Gbit/s; MSS=1,460 Bytes; RTT=120msTCP connection between Geneva and Chicago: C=1 Gbit/s; MSS=1,460 Bytes; RTT=120ms
Time to increase the throughout from 100Mbps to 900Mbps = 35 minutesTime to increase the throughout from 100Mbps to 900Mbps = 35 minutes Loss occurs when the bandwidth reaches the pipe sizeLoss occurs when the bandwidth reaches the pipe size 75% of bandwidth utilization (assuming no buffering)75% of bandwidth utilization (assuming no buffering) Cwnd<BDP : Cwnd<BDP :
Throughput < BandwidthThroughput < BandwidthRTT constantRTT constantThroughput = Cwnd / RTTThroughput = Cwnd / RTT
35 minutes35 minutes
Measurements with Different MTUsMeasurements with Different MTUs
TCP connection between Geneva and Chicago: C=1 Gbit/s; RTT=120msTCP connection between Geneva and Chicago: C=1 Gbit/s; RTT=120ms
In both cases: 75% of the link utilizationIn both cases: 75% of the link utilization Large MTU accelerate the growth of the windowLarge MTU accelerate the growth of the window Time to recover from a packet loss decreases with large MTUTime to recover from a packet loss decreases with large MTU Larger MTU reduces overhead per frames (saves CPU cycles, Larger MTU reduces overhead per frames (saves CPU cycles,
reduces the number of packets) reduces the number of packets)
Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)
MTU and FairnessMTU and Fairness
Two TCP streams share a 1 Gbps Two TCP streams share a 1 Gbps bottleneck bottleneck
RTT=117 msRTT=117 ms MTU = 1500 Bytes; Avg. throughput MTU = 1500 Bytes; Avg. throughput
over a period of 4000s = 50 Mb/sover a period of 4000s = 50 Mb/s MTU = 9000 Bytes; Avg. throughput MTU = 9000 Bytes; Avg. throughput
over a period of 4000s = 698 Mb/sover a period of 4000s = 698 Mb/s Factor 14 !Factor 14 ! Connections with large MTU Connections with large MTU
increase quickly their rate and grab increase quickly their rate and grab most of the available bandwidthmost of the available bandwidth
RR RRGbE GbE SwitchSwitch
Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)
CERN (GVA)CERN (GVA)
RTT and FairnessRTT and Fairness
RR RRGbE GbE SwitchSwitch
Host #1Host #1
POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE
1 GE1 GE
Host #2Host #2
Host #1Host #1
Host #2Host #2
1 GE1 GE
1 GE1 GE
BottleneckBottleneck
RRPOS 10POS 10 Gb/sGb/sRR10GE10GE
Two TCP streams share a 1 Gbps bottleneck Two TCP streams share a 1 Gbps bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/sCERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/sCERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytesMTU = 9000 bytes Connection with small RTT increases quickly there rate and grab most of the available bandwidthConnection with small RTT increases quickly there rate and grab most of the available bandwidth
Throughput of two streams with different RTT sharing a 1Gbps bottleneck
0
100
200
300
400
500
600
700
800
900
1000
0 1000 2000 3000 4000 5000 6000 7000
Time (s)
Thr
ough
put
(Mbp
s) RTT=181ms
Average over the life of the connection RTT=181ms
RTT=117ms
Average over the life of the connection RTT=117ms
How to use 100% of the bandwidth?How to use 100% of the bandwidth?
Bandwidth delay productBandwidth delay product
Single TCP stream GVA - CHISingle TCP stream GVA - CHI MSS=8960 Bytes; Throughput = 980MbpsMSS=8960 Bytes; Throughput = 980Mbps Cwnd > BDP => Throughput = BandwidthCwnd > BDP => Throughput = Bandwidth RTT increaseRTT increase Extremely Large buffer at the bottleneckExtremely Large buffer at the bottleneck Network buffers have an important impact on performanceNetwork buffers have an important impact on performance Have buffers to be well dimensioned in order to scale with Have buffers to be well dimensioned in order to scale with
the BDP? the BDP? Why not use the end-to-end delay as congestion indication.Why not use the end-to-end delay as congestion indication.
Single stream TCP performanceSingle stream TCP performance
DateDate From From Geneva toGeneva to
Size of Size of transfertransfer
DurationDuration
(second)(second)
RTTRTT
(ms)(ms)
MTUMTU
(Bytes)(Bytes)
IP IP versionversion
ThroughputThroughput RecordRecord
AwardAward
Feb Feb 2727
SunnyvaleSunnyvale 1,1 TByte1,1 TByte 37003700 180180 90009000 IPv4IPv4 2.38 Gbps2.38 Gbps Internet2 LSRInternet2 LSRCENIC awardCENIC awardGuinness Guinness World RecordWorld Record
May May 2727
TokyoTokyo 65.1 65.1 GByteGByte
600600 277277 15001500 IPv4IPv4 931 Mbps931 Mbps
May May 22
ChicagoChicago 385 GByte385 GByte 36003600 120120 15001500 IPv6IPv6 919 Mbps919 Mbps
May May 22
ChicagoChicago 412 GByte412 GByte 3600 3600 120120 90009000 IPv6IPv6 983 Mbps983 Mbps Internet2 LSRInternet2 LSR
NEW Submission (Oct-11):NEW Submission (Oct-11): 5.65 Gbps from Geneva to Los Angeles across 5.65 Gbps from Geneva to Los Angeles across the LHCnet, Starlight, Abilene and CENIC.the LHCnet, Starlight, Abilene and CENIC.
Early 10 Gb/s 10,000 km TCP TestingEarly 10 Gb/s 10,000 km TCP Testing
Single TCP stream at Single TCP stream at 5,655,65 Gbps Gbps Transferring a full CD in less than 1sTransferring a full CD in less than 1s Un-congestioned networkUn-congestioned network No packet loss during the transferNo packet loss during the transfer Probably qualifies as new Internet2 Probably qualifies as new Internet2
LSRLSR
Monitoring of the Abilene traffic in LAMonitoring of the Abilene traffic in LA
ConclusionConclusion
The future performance of computational grids looks bad if we The future performance of computational grids looks bad if we continue to rely on the widely-deployed TCP RENOcontinue to rely on the widely-deployed TCP RENO
How to define the fairness?How to define the fairness?Taking into account the MTUTaking into account the MTUTaking into account the RTTTaking into account the RTT
Larger packet size (Jumbogram : payload larger than 64K)Larger packet size (Jumbogram : payload larger than 64K)Is standard MTU the largest bottleneck?Is standard MTU the largest bottleneck?New Intel 10GE cards : MTU=16KNew Intel 10GE cards : MTU=16KJ. Cain (Cisco): “It’s very difficult to build switches to switch large J. Cain (Cisco): “It’s very difficult to build switches to switch large packets such as jumbogram”packets such as jumbogram”
Our vision of the network:Our vision of the network:““The network, once viewed as an obstacle for virtual collaborations The network, once viewed as an obstacle for virtual collaborations and distributed computing in grids, can now start to be viewed as a and distributed computing in grids, can now start to be viewed as a catalyst instead. Grid nodes distributed around the world will simply catalyst instead. Grid nodes distributed around the world will simply become depots for dropping off information for computation or become depots for dropping off information for computation or storage, and the network will become the fundamental fabric for storage, and the network will become the fundamental fabric for tomorrow's computational grids and virtual supercomputers” tomorrow's computational grids and virtual supercomputers”