TCP/IP Optimizations for
High Performance WANs
Dr. Joseph L White, Juniper Networks
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 22
SNIA Legal Notice
The material contained in this tutorial is copyrighted by the SNIA.
Member companies and individuals may use this material in presentations and literature under the following conditions:
Any slide or slides used must be reproduced without modificationThe SNIA must be acknowledged as source of any material used in the body of any document containing material from these presentations.
This presentation is a project of the SNIA Education Committee.
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 33
Abstract
TCP/IP Optimizations for High Performance WANsThis session provides an overview of TCP/IP high performance behavior and optimizations for bulk data transfers in WAN environments. TCP/IP provides the underlying transport for most of the bulk data transfers across wide area networks (WANs) including distance extension for block storage (iSCSI, FCIP, and iFCP), WAN acceleration for local TCP/IP sessions, and wide area file systems (WAFS) acceleration. Knowledge of TCP/IP performance and behavior characteristics in WAN environments is critical for storage networking professionals and administrators. In particular, the effects of high bandwidth, long latency, impaired, and congested networks as well as the TCP/IP modifications or optimizations used to mitigate these effects is critical for optimal deployment of IP storage solutions.
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 4
iFCP/FCIP Distance ExtensionWAN AccelerationNAS (CIFS/NFS) iSCSI
TCP/IP based protocols
are in your critical path
Data Backup and RecoveryRemote Office Optimized AccessLow end block storage access
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 5
High ThroughputLow LatencyRobustnessWide ScalabilityHigh Availability
Demanding Data Center Requirements Must Still Be Satisfied
TCP/IP does not
get a free pass!
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 6
The WAN as seen by Storage
This presentation explores the consequences of high speed WAN connections used for block data transport
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 7
Typical Implementation
Special purpose gateways provide TCP connectivity and AccelerationFor FC devices gateways also provide protocol conversion or Tunneling
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 8
Long Fat Networks have a large bandwidth-delay productBandwidth-delay product = amount of data ‘in flight’
needed to saturate the network link
Consequence 1
1 ms = 128 KB buffering at 1Gb/s1 ms = 100 Km a maximum separation
For this example we need 2.56 MB of both transmit data and receive window to sustain line rate
…but for this example only 256KB is needed to sustain line rate
100 Mb pipe
1 Gb pipe
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 9
Consequence 2
Effect of packet drops in the network magnified
Slow recovery due to large RTT we’ll explore this in detail
Many more hops and greater variety of equipment increases chances of problems due to design flaws, incorrect configuration, failures
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 10
What can be done?
Understand TCP/IP dynamic behaviorProblems can be avoided in the first place
Good Data Center and Network Design
Problems can be efficiently diagnosed and fixedImproved Monitoring and Error Triangulation
Protocol OptimizationTCP/IP
Various standard and non-standard modifications
Upper layers (remove chattiness)Fast Write, Tape Acceleration, CIFS/NFS acceleration
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 11
Connection OrientedFull DuplexByte Stream (to the application)TCP Port Numbers allow multiple connections between an IP address pairWell known port numbers for some servicesReliable connection open and closeCapabilities negotiated at connection initialization (TCP Options)
ReliableGuaranteed In-Order DeliverySegments carry sequence and acknowledgement informationSender keeps data until receivedSender times out and retransmits when neededSegments protected by checksum
Flow Control and Congestion AvoidanceSender Congestion WindowReceiver Sliding Window
Characteristics of TCP
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 12
Source Port NumberDestination Port NumberSequence NumberACK NumberHeader LengthFlags
SYNFINRSTACKPSHURG
Window SizeChecksumUrgent pointer
TCP Payload
TCP Options
VERS | HLEN4 5
Header ChecksumTTL
Source IP Address
Destination Address
Service Type
Identification Flags | Fragment Offset
Protocol(0x06 TCP)
Total Length = 20 + payload
source port dest port
sequence number
ack number
TCP checksum urgent pointer
window sizeflagshlen | resconn
ectio
n id
entif
ied
by 4
-tupl
e
TCP Header
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 13
Data Stream chopped into Segments: Sent as IP DatagramsCongestion Window (cwnd) Uses AIMD
AIMD is (Additive Increase, Multiplicative Decrease)
Nagle Algorithm (RFC 896): don’t send less than one segment of data unless all sent data has been ACK’d
TCP Transmit Congestion Controls
bytes sent and ACK’d Increasing Sequence Number
Send Unacknowledged Send Next
Bytes sent but not ACK’d Unsent bytes
New bytes added here
Receiver’s Advertised Window
Sender’s Congestion Window
bytes that can be sent
Can’t send these bytes until receiver opens the window
Can’t send these bytes until the sender opens his congestion window
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 14
TCP Transmitter Illustration
ACK Received
ACK Received but without window growing
Application has data
ACK Received
Application has more dataCan’t send all of it due to limits
Updates allow sendingBut still unsent data
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 15
TCP Receive Window
New receive buffer space added here
These bytes have been sent to the application
Sliding window protocolReceiver advertises a window to the sender Out of Order arrival and reassemblyDuplicate segments or out of window segments are discardedAvoid the silly window syndrome: don’t advertise too small windows
Sender’s Persist Timer generates Window Probes to recover from lost window updates
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 16
TCP Receiver Illustration
Receive Buffer added causes Window Update(not considered a duplicate ACK)
Out of Order segmentGenerates a Duplicate ACK
Everything previous ACK’dAnd ready to receive…
TCP Segment Received
Receive Buffer allocated and ACK sent
Receive Buffer NOT allocatedBut ACK still sent
TCP Segment Received
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 17
Slow Start... ...
Rate of packet injection into the network equals to rate which ACKs
are receivedLeads to exponential sender cwnd
rampExponential Ramp stops when
The limit of the receiver’s advertised window is reached
(TCP can only move one window’s worth of data per RTT)
The limit of the sender’s un-acknowledged data buffering or outstanding data is reached
The limit of the network to send data is reached (network saturation)
A congestion event occurs
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 18
Duplicate ACKSent when segment is received out of orderMay indicate a missing segment to the sender
Delayed ACKDo not send an ACK right away, wait a short time to see if Additional segments arrive which can also be ACK’d
ACK/N Wait for a specific number (N) of segments to arrive before sending an ACKSend anyway after a short time interval
TCP ACK Schemes
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 19
Scaled receive windows
Quick Start
Modify Congestion Controls
Deal with network reordering
Detect retransmission timeouts faster
Implement Selective Acknowledgement (SACK)
Reduce the amount of data transferred (compression)
Aggregate multiple TCP/IP sessions together
Bandwidth Management, Rate Limiting, Traffic Shaping
TCP/IP for Block Storage
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 20
RTT and Receive window plateau
For window sizes big enough to support line rate
0.000
20.000
40.000
60.000
80.000
100.000
120.000
140.000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
time (s)
MB/
s
10 ms RTT 30 ms RTT
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 21
LFNs and Receive window plateau
For 30 ms RTT
0.000
20.000
40.000
60.000
80.000
100.000
120.000
140.000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
time (s)
MB
/s1.9 MB window 3.8 MB window
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 22
Quick Start
Two common meaningsQuick Start means giving the sender cwnd
a head startQuick Start means giving the sender no cwnd
limit at start
0.000
20.000
40.000
60.000
80.000
100.000
120.000
140.000
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40
time (s)
MB/
s64KB quick start no quick start
50 ms RTT
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 23
Packet (segment) loss can occur for several reasonsCongestionEthernet/IP proactive flow control schemes (RED)Faulty equipmentUncorrectable bit errors
When packet loss does occur it can be extremely detrimental to the throughput of the TCP connection
Extent of disruption determined by the pattern of dropsThere are TCP features and modifications which mitigate effects of packet loss
Packet Loss
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 24
TCP Retransmission Timeout
0.000
10.000
20.000
30.000
40.000
50.000
60.000
70.000
80.000
0.00 0.50 1.00 1.50 2.00 2.50 3.00
time (s)
rate
(MB/
s)
time oldest sent, unacknowledged dataRequires RTT estimation for connection (typically 500 ms resolution TCP clock)Retransmission timeouts are 500 ms to 1 s with exponential back-off as more timeouts occur
Retransmission timeouts
Unrecoverable drops10 ms RTT
Too many of these and the session closes
Can take minutes
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 25
Improve Retransmission Timeouts
A finer grained TCP clock we can estimate timeouts with much better accuracyInstead of 500 ms to 1s timeouts we could for example have 50-60 ms timeoutsAlso helps the long TCP connection close times (reduced from minutes to seconds)
0.000
10.000
20.000
30.000
40.000
50.000
60.000
70.000
80.000
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50
time (s)
rate
(MB/
s)
Unrecoverable drops
Retransmission timeouts
RTT is 10 ms
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 26
TCP Fast Retransmit, Fast Recovery
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60
time (s)
rate
(MB
/s)
Dropped frames can be detected by looking for duplicate ACKs3 dup ACKs
frames triggers Fast Retransmit and Fast RecoveryWith Fast Retransmit there is no retransmission timeout.
Packet drop
Congestion Avoidance
Fast Recovery
10 ms RTT
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 27
Congestion Control Modifications
Don’t want to break TCP’s fundamental congestion behavior
Modify Congestion ControlsDifferent ramp up or starting value for slow start (aka QuickStart)Less reduction during fast recoveryIgnore or reduce the effects of congestion avoidanceEifel Algorithm
Modify Fast retransmit and Fast Recovery detection scheme
Ignore a larger fixed number of duplicate ACKs but backstop with
a short timer
RFC 4653 –
TCP-NCR (non-congestion robustness)
Retransmission detection is based upon cwnd of data leaving network instead of a fixed limit
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 28
Change Fast Recovery Threshold
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60
time (s)
rate
(MB
/s)
During fast recovery reduce sender cwnd
by 1/8 instead of 1/2
Packet drop10 ms RTT
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 29
Remove Congestion Avoidance
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60
time (s)
MB
/s
Retain exponential ramp even after congestion avoidance should kick in10 ms RTT
Packet drop
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 30
Networks which dynamically load balance cause excessive false congestion events due to extensive reordering and TCP normally only uses a value of 3 for the duplicate ACK threshold
Duplicate ACKs
Causes Fast Retransmit and Fast Recovery by the sender even though no segments were lost!
Can be helped by ignoring more duplicate ACKs before Fast Retransmit/RecoveryMust be careful not to miss a retransmit that should have gone out
Network Reordering
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 31
Selective Acknowledgement (SACK)
Both sides track the current list of holes using additional TCP stateAllows the sender to only retransmit 4, 5, 9, 11and he can fill in multi-segment holes!
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 32
Compression RatioThe size of the incoming data divided by the outgoing dataDetermined by the data pattern and algorithmHistory buffers help the compression ratio since they retain more data for potential matches
Compression RateSpeed of incoming data processingDifferent algorithms need different processing power
Compression
A higher compression ratio generally requires more processing power for the same rateMany algorithmsCan also be used for data at restEncrypted data incompressible
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 33
Multiple bonded TCP connections
Carry a single session’s traffic on multiple TCP sessions at onceAllows parallel paths across the network to be used for a single
sessionCongestion controls and responses overlap giving the effect of TCP being less reactive to drops and recovering fasterReceive Window sizes effectively add to each other giving larger
bandwidth-
delay capacity
TCP Connections
load distribution and encapsulation
load distribution and encapsulation
Traffic session
Traffic session
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 34
Rate Limiting
Receiver can limit TCP transfer rate by controlling its receive window sizeConsequence of 1 window per RTT fundamental property
Sender can control its transmit Rate DirectlyBest done in hardware
Why control the transfer rate?Avoid problems due to overrunning equipment in intermediate networkAvoid congestion through intermediate networks or links
‘router’No flow control
Low bandwidth output
High input bandwidthCan overrun input fifo leading to tail drops
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 35
Summary
TCP/IP is both good and bad for block storage traffic
TCP/IP’s fundamental characteristics are Good™Connection oriented, full duplex, guaranteed in-order delivery
TCP/IP’s congestion controls and recovery of lost segments can cause problems for block storage
However, Many of TCP/IP drawbacks can be mitigatedSome changes only improve TCP behavior
For example better resolution TCP timers leading to more preciseOr SACK
Some have a possible negative effect on other trafficFor example removing congestion avoidance completely
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 3636
Q&A / Feedback
Please send any questions or comments on this presentation to SNIA: [email protected]
Many thanks to the following individuals for their contributions to this tutorial.
-
SNIA Education Committee
Joseph L WhiteHoward Goldstein
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 37
Appendix: High Speed TCP/IP Projects
Industry Proprietary ImplementationsWAN Acceleration CompaniesDistance Extension for SANDistance Extension for TCP
Scalable TCP (STCP)High Speed TCP (HS-TCP)FAST TCPH-TCP
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 38
RFC 793 –
Transmission Control ProtocolRFC 896 –
Congestion control in IP/TCP internetworks RFC 1122 –
Requirements for Internet Hosts -
Communication LayersRFC 1323 –
TCP Extensions for High PerformanceRFC 2018 –
TCP Selective Acknowledgment OptionsRFC 2140 –
TCP Control Block InterdependenceRFC 2581 –
TCP Congestion ControlRFC 2861 –
TCP Congestion Window ValidationRFC 2883 –
An Extension to the Selective Acknowledgement (SACK) Option for
TCP RFC 2988 –
Computing TCP's Retransmission TimerRFC 3042 –
Enhancing TCP's Loss Recovery Using Limited TransmitRFC 3124 –
The Congestion ManagerRFC 3155 –
End-to-end Performance Implications of Links with ErrorsRFC 3168 –
The Addition of Explicit Congestion Notification (ECN) to IPRFC 3390 –
Increasing TCP's Initial WindowRFC 3449 –
TCP Performance Implications of Network Path AsymmetryRFC 3465 –
TCP Congestion Control with Appropriate Byte Counting (ABC) RFC 3517 –
A Conservative Selective Acknowledgment based Loss Recovery Algorithm for TCP RFC 3522 –
The Eifel Detection Algorithm for TCPRFC 3649 –
HighSpeed TCP for Large Congestion WindowsRFC 3742 –
Limited Slow-Start for TCP with Large Congestion WindowsRFC 3782 –
The NewReno Modification to TCP's Fast Recovery AlgorithmRFC 4015 –
The Eifel Response Algorithm for TCPRFC 4138 –
Forward RTO-Recovery (F-RTO)RFC 4653 –
Improving the Robustness of TCP to Non-Congestion EventsRFC 4782 –
Quick-Start for TCP and IPRFC 4828 –
TCP Friendly Rate Control (TFRC): The Small-Packet (SP) Variant
Appendix: Relevant Internet RFCs
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 39
Appendix: Miscellaneous TCP Features
MTU DiscoveryPackets probe the network path to determine maximum packet size
Timestamp Option enables PAWSallows RTT calculation on each ACK instead of once per windowproduces better smoothed averagesTimer rate guidelines: 1 ms <= period <= 1 second
PAWS: protection against wrapped sequencesIn very fast networks where data can be held, protects against old sequence numbers accidentally appearing as though they are in the receiver’s valid windowuses timestamp as 32-bit extension to the sequence numberrequires that the timestamp increment at least once per window
TCP/IP Optimizations for High Performance WANs© 2008 Storage Networking Industry Association. All Rights Reserved. 40
End of option list
No operation
Maximum (Receive) Segment Size [SYN only]
Window Scale Factor [SYN only]
Timestamp
Selective ACK Permitted [SYN packet
only]
Selective ACK block
Options are usually 4 byte aligned with leading NOPs
Appendix: TCP Options
Top Related