Applications and network performance - gdt.id.augdt/presentations/2007-07-13-questnet-tcp/... ·...

Applications and network performance

Introduction

Considerable effort imporoving networks and operating systems has vastly improved the performance of large file transfers. Increasingly the causes of poor network performance are found in application design and programming.

People with long memories will recall the same experience with the performance of relational databases. Just as programmers learnt how to use SQL effectively, programmers now need to learn to use fast longdistance networks effectively. This paper can be a beginning for that learning.

How fast, how long?

The typical host is connected at one gigabit per second to a workgroup ethernet switch. This in turn connects to building ethernet switch, a routing core, a firewall, and finally to a border router with a 1Gbps link to AARNet. AARNet has sufficient capacity to take this potential 1Gbps to many of the world's universities.

The longest production 1Gbps path through the network is from Perth to Sydney, Sydney to Seattle (USA), Seattle to Amsterdam (Netherlands), Amsterdam to Tomsk (Russia). It takes a packet roughly 600ms to travel this path and back.

The average 1Gbps path is much shorter than this. However, Australia's position makes most sites of interest at least a Pacific Ocean, or 100ms, away.

Application developers in other countries do not face such long paths, with their large roundtrip times. 100ms is enough to comfortably cross Europe or North America, but is the minimum value for applications run from Australia.

questnet-2007-gdt-tcp-paper.odt 2007-07-03

Figure 2. Longest 1Gbps path is 600ms.

Figure 1. Typical campus network.

HostsWorkgroupswitch

Coreswitch

Corerouter

Borderfirewall

Borderrouter Internet

TCP

Understanding some behaviours of the Transmission Control Protocol is needed to write applications which use the network perform well.

TCP has two major states: slow start and congestion avoidance.1 These two algorithms determine the scheduling and quantity of data transmitted by the sender

Slow start

When starting TCP does not know how much bandwidth is available. It starts by sending two to four packets.

If all of these packets are acknowledged then the number of packets to be sent is incremented and those packets are sent.

This is repeated until an Ack is lost or until the receiver's TCP buffer size is reached.

At the end of the slow start state we have a good estimate of the roundtrip time and available bandwidth of the link, this is described in a single figure: the congestion window, the number of bytes which can be sent without congesting the path or overrunning the receiver.

Congestion avoidance

The goal of this mode is to schedule packets to arrive just as they are needed by the receiver and not to congest the path. During this mode we send a full congestion window of packets per roundtrip time.

If all the packets are Acked, then the congestion window is slowly advanced2 to probe for any unused bandwidth on the path which may have become available.

An estimate and variance of the roundtrip time is maintained by timing the transmitted packets and their Acks. This is used to determine if a transmitted packet has a late or missing Ack.

If a transmitted packet appears to have been lost due to congestion, then the congestion window and slow start threshold is halved and slow start mode is invoked.

New TCP algorithms

A large number of modified TCP algorithms exist. These seek to improve the performance of the TCP algorithm. They variously feature: faster discovery of available bandwidth, better estimation of the delaybandwidth product, faster recovery from congestion, better resilience to media packet loss.

All recent operating system versions have modified TCP algorithms available.

Inter-flow fairness

Differing TCP algorithms have radically different behaviours when faced with multiple flows.

1 Jacobson's TCP congestion control algorithm is most recently described in Allman, Paxson, Stevens. RFC2581, TCP congestion control. 1999.

2 Specifically, the congestion window is increased by one packet per roundtrip time.

— 2 —

Ideally the available bandwidth will be equally shared between running connections. This ideal is rarely met.

Initiating multiple connections to improve throughput is limited by the interflow fairness of the TCP algorithm, Hamilton TCP has particularly good interflow fairness.

Round-trip time fairness

Almost all algorithms have connections with a long roundtrip time treated much more unfairly than connections with a small roundtrip time. All connections from Australia have a long roundtrip time, so congestion near an offshore server is particularly felt by Australian clients.

Tuning the operating system for TCP

There are three major changes in operating system's treatment of TCP. Firstly, buffer tuning is becoming more automatic. Secondly, more and better TCP algorithms are available for use. Thirdly, concentration on removing overheads from the operating system itself. Zerocopy, VJ's API.

This effort is wasted if applications are hosted on an operating system that does not include these improvements. Microsoft Windows Xp Microsoft Windows Server 2003 or Linux kernels 2.4 and earlier are not suitable if high netwok performance is desired.

There is often conflict between the requirement of networking performance to use a recent operating system and other systems administration objectives. Most systems administrators do not realise that this tradeoff exists.

Buffer sizes

The sender must be able to resend the entire data inflight across the link in case none of the packets is acknowledged. The inflight data includes the data being transmitted across links plus the router buffers on the path. If the first retransmission is lost then there will be a second roundtrip time of data to buffer.

Set the sender's buffer to twice the bandwidthdelay product. Modern TCP algorithms need less buffer, say 1.2 of BDP, since they backoff less radically then the traditional Reno TCP algorithm.

The receiver must be able to accept a full flight's worth of data. The inflight data includes the data being transmitted across links plus the router buffers on the path.

Set the receiver's buffer to a tad more than the bandwidthdelay product. There is no need to allow for all of the router in the path to have full buffers, since this path congestion will lower TCP performance in any case.

Linux buffer auto-tuning

Linux 2.6 has steadily improving automated buffer tuning of both receive and send buffers. To take advantage of this, it is important to use a recent kernel. This can conflict with other system administration goals.

— 3 —

Manually setting the buffer size disables autotuning.

Linux destination route cache

Linux caches the TCP parameters in a destination cache, which records the IP MTU and TCP window. This cache reduces the slowstart period of new connections. Look in /proc/net/rt_cache for the contents of the cache.

When benchmarking TCP performance clear the route cache and the neighbour cache3 between each test.

Path loss

Packet loss on a path can cause TCP to falsely believe that congestion is occurring. The sender enters “congestion avoidance” mode, immediately halving the packet transmission rate.

Loss on underground optical links of gigabit speeds is essentially zero: perhaps one error seconds per year. Loss is so low that there is no visible difference in the reported error rates between gigabit ethernet LAN PHY (which does not have forward error correction) and SDH (which does have FEC).

Loss on undersea fibre segments is much higher than loss on underground fibre segments. AARNet's router's record roughly 400 error seconds per year on crossPacific links. Between these error seconds loss is well under 1013, mainly because the random spread of errors then allows SDH forward error correction to work well.

Distressingly, megabit speed SDH links purchased from other carriers seems to have high loss. The 155Mbps STM1 links to DarwinAdelaide and MelbourneHobart show 10 to 100 error seconds per year, mainly as path loss errors. It is assumed that these links are more environmentally vulnerable, although why a AdelaideDarwin link should be more vulnerable than a AdelaidePerth link is not clear.

Packet loss on wireless networks is greater and more random. A busy wireless LAN may have no errorfree seconds whilst carrying traffic. Wireless LAN links should not be used where good network performance is desired. If there is no choice but to use a wireless LAN then the sender should select a TCP algorithm designed for paths with high media loss, such as Westwood TCP.

Configuration error and loss

Two configuration errors commonly lead to high loss links.

Firstly, network engineers may fail to calculate the loss budget for optical and microwave links. This commonly drives the receiver with too little or too much power, causing loss. Optical budgets should be calculated for all links greater than 220m with 1000BaseSX, all links greater than 2,000m using 1000BaseLX, and all other optical links. Electrical power budgets should be calculated for all G.729 links: the receiver expects 1.0V±10% peaktopeak as the input power, the output power varies by device and DIP switch setting, and the path

3 The “neighbour cache” is the general form of an ARP cache.

— 4 —

loss varies by coaxial cable type and length. Attenuation is typically installed on the receiving interface, as this minimises the opportunity to destroy an interface by connecting unattenuated power.

Secondly, system administrators and network engineers often misunderstand the function of ethernet autonegotiation. Autonegotiation was designed to allow the connection of 100BaseTX interfaces to 100BaseTX interfaces or to 10BaseT interfaces.

Since older 10BaseT interfaces did not support autonegotiation, the autonegotiation protocol assumes that a neighbour which does not run the protocol has speed=10, duplex=half.

When an administrator sets a 100BaseTX interface to speed=100, duplex=full they are also unknowingly disabling the autonegotiation protocol. If the neighbouring interface continues to run the autonegotiation protocol then it will conclude that the other interface is 10BaseT and set speed=10, duplex=half. The clock on the twisted pair cable pulls the speed back up to speed=100. The duplex mismatch remains. The duplex mismatch causes a large number of “late collision” errors on the duplex=half port.

To make matters worse, autonegotiation was incorrectly implemented on some early 100BaseTX network interface cards, particularly those with the first DEC Tulip controller. To work at all at 100Mbps these simply had to have both ports explicitly set to speed=100, duplex=full. This lead to many administrators having a bias to setting the speed and duplex manually, leading to widespread duplex mismatch problems.

Poor network elements

Some packet forwarding elements in the connection's path have poor design.

These devices introduce quantised delays, leading to Ack compression and reducing the ability of the TCP transmitter to accurately measure the available bandwidth.

These devices under high load on an unrelated flow add increased jitter, or even loss, to other flows.

These devices have insufficient output buffer. A device on a network path needs to have output buffers of at least ¼ of the bandwidthdelay product.

These devices are called firewalls.

It is not surprising that firewalls often cause TCP performance issues: routers and switches use carefully designed and tested hardware in the forwarding plane, firewall hardware is often nothing more complex than a server.

Avoiding round trips

Many applications have unnecessary roundtrip times.

For example, painting a web page involves: fetching the HTML, optionally fetching CSS, fetching each image. In the worst case each fetch is sequential and uses a new TCP connection.

Database protocols are the worst. An application which tries to run ODBC, Oracle Net or a

— 5 —

similar protocol across a link with a long roundtrip time is doomed to poor performance. Each SQL statement incurs a roundtrip penalty. Programmers who expect to run applications remotely from the database need to move up the stack and use a transaction technique which allows complete transactions to make the long journey. Choices range from a traditional transaction monitor like Tuxedo to the newer service oriented architecture protocols like SOAP and REST.

Even so, remote procedure calls like SOAP still contain a roundtrip time. The trick is to make each roundtrip do as much as possible. Traditional remote procedure calls, such as Sun's RPC, appear in the program source as identical to function calls. Whilst desirable for programmers, it also makes it easy for them to forget that the line of code could introduce a large delay. The RPC coding style is particularly prevalent in some Grid Computing applications; for example, one program issues a RPC to its neighbour to ask its neighbour to send the next block of data.

Avoiding application windows

Some applications avoid the total roundtrip delay by running their own windowing, often unknowingly echoing the window algorithms of TCP or HDLC. These windowing algorithms often lack sufficient size: a linear window to be recorded in at least 32 bits to fill a long fat pipe.4

OpenSSH is one application know to have a deficient window size; it uses 64KB. The HPNSSH patch increases the size of the SSH window to match the size of the TCP window. OpenSSH 4.7 will use a 1.2MB window.5

Nagle's algorithm and packet delays

Nagle's algorithm6 coalesces small packets until an Ack is needed to be sent. This is intended to coalesce keystrokes from SSH and Telnet into fewer packets.

A sideeffect of Nagle's algorithm is to delay strings of three packets. For example, an application which sends a 4KB disk block will send three 1500B packets. Nagle's algorithm will delay that last packet for one roundtrip time. This problem was first noticed when Sun's Network File System was modified to use TCP. Nagle's algorithm should be disabled for block protocols.

Another side effect is to delay applications which send a few bytes and then wait for a reply. Nagle's algorithm will delay these transactions for 500ms. This problem was first noticed with X Windows. Nagle's algorithm should be disabled for transaction protocols, such as SOAP, REST, RPC and ODBC.

4 TCP's window size is only 16 bits on the wire, but RFC1323 TCP extensions for high performance extends the range of these bits to 231 by multiplying the transmitted window size with a prenegotiated scaling factor.

5 See OpenSSH Bugzilla bug 1311 Performance on high BDP networks.6 Nagle. RFC896 [historic]. Congestion control in IP/TCP internetworks. 1984.

— 6 —

Loss of round-trip time estimate

TCP maintains an estimate of the roundtrip time of the path: the estimated value and its variance. The variance increases over time. The variance decreases when the estimate is refreshed by incoming Acks. When the variance of the estimate renders the estimate unusable then TCP reenters slowstart mode.

Applications which send data with idle periods find it difficult to leave TCP slow start mode, as the variance of the roundtrip time estimate is always degraded.

Constant bit rate sources

TCP assumes that a sending application that is provided data to TCP more quickly than TCP can deal with it can be blocked by the operating system. The application is forced to sleep whilst TCP lowers the sending buffers.

Some applications do not stop generating data when they are blocked. Particularly, data acquisition hardware provides new data at regular intervals. Blocking the related application causes unacceptable loss of data.

If the hardware itself buffers samples until the application is ready to receive them, and then hands over a flood of samples, then TCP is stressed and can fail in ways we don't yet entirely understand. Experiments suggest that once TCP reenters slow start mode, it can never leave this mode. The application becomes nearpermanently blocked.

A constantbit rate source should use UDP, with broad congestion control. This method is not yet as convenient for applications programmers as using TCP.

What the operating system wants

Operating systems work best when data is streamed to the network. There are optimisations in Windows, BSD and Linux for the case where disk data is to be streamed through a network socket.

In Linux the optimal process is: set TCP_CORK on the socket, send any header data, call sendfile() to send the data from a file, and uncork the socket to send the last packet.

It is very easy to wander away from this optimal scenario: for example, by onthefly data reformatting, such as the encryption done by ssh. It might be tempting to move data reformatting to the receiver, but it is under even more load than the sender.

The place where data reformatting is done, usually regarded as too trivial to model, needs deeper consideration during the design of systems and applications.

For example, if privacy of transmitted data is needed then the data could be encrypted as it is written to disk and then simply streamed across the network using FTP or HTTP.

Load allocation

One of the flaws of TCP is that the sender makes all of the decisions whilst the receiver does all of the work.

There are a host of optimisations to assist the sender: TCP_CORK and sendfile(), scatter

— 7 —

gather buffers, zerocopy buffers, and so on. But the receiver has to deal with the incoming data with much less help from the operating system: there's no revcfile() option to send a socket to a file.

This may not matter. Perhaps the server is servicing hundreds of connections and it is more appropriate to place the load on the receivers. This is the decision taken by HTTP: a HTTP server is simple enough to be written as a student assignment; a HTTP client is complex.

In highperformance computing the server is often running just one connection. In this scenario moving load to the server will improve performance.

Systems design needs to consider the most appropriate point to inflict load.

Caches

Moving the data closer to the client will decrease the roundtrip time, making high performance easier to achieve. “Closer” is in the sense of network topology, so the USA is closer than Japan. Some research projects cut the world into zones and assign a cache to each zone. Australian researchers should join a North America zone rather than a Asia zone.

Systems design needs to take care that a cache does not simply move the problem from serverclient performance to servercache performance.

Poor performance other than the network

Hammering the operating system

Each received packet can cause an interrupt. This can reduce the receiver to do no more than servicing interrupts. Network interfaces have a feature called interrupt coalescing; this lowers the interrupt rate but delivering multiple packets per interrupt. Some circumstances can cause interrupt coalescing to fail: use ifconfig to compare the number of received packets to the number of interrupts.

Communicating with the operating system is done using system calls. There is considerable interest in operating systems design in an improving in the system call interface for networking beyond BSD sockets. Applications programmers should become familiar with these alternatives as they appear.

Operating systems designer go to considerable lengths not to copy the data held in packets. Copying data not only takes a long time, but it also loads the CPU cache with useless data, causing everything else to be slower too. Applications programmers should not undo the good work done by the operating system. There are scattergather system calls and these should be used rather than doing a copy within the application.

Overall system load from can be reduced by doing less work. Jumbo frames send six times less packets and greatly reduce the load on the operating system.

Hammering the disks

The disk bandwidth of the average server is decreasing. Disk rotational speeds have been dropping from 10,000RPM, to 7,200RPM to 5,400RPM. The faster 10,000RPM disks can still

— 8 —

be purchased, but these have much less capacity than slower 7,200RPM disks. The move to 5,400RPM disks is driven by the increased speed of CPUs — in a one rack unit chassis the heat from faster CPUs leaves cooling available to the disks — and by the desire for more shallow chassis.

Rotational rate(RPM)

Capacity(1,000 MB)

Transfer rate(MB/s)

Application

15,000 300 125 to 73 3.5in server

15,000 72 112 to 79 2.5in server

10,000 300 80 to 39 3.5in server

10,000 146 89 to 55 2.5in server

7,200 1,000 105 to 51 3.5in server

7,200 1,000 105 to 51 3.5in desktop

7,200 750 78 to 38 3.5in server

5,400 160 44 to 22 2.5in laptop

The disk throughput figures should be taken with a grain of salt. The www.storagereview.com website tests manufacturer claims and they often do not match the claimed specifications.

Faulty hardware design

Computers have a design life of less than one year, so computers have less solid design than in the past. Finding a computer than performs as specified is surprisingly difficult. I have seen gigabit interfaces which cannot do a gigabit of traffic, SATA disk controllers that fail when all channels are in use.

Benchmarking computers to see if they perform adequately doesn't seem to be an effective response. One research group finished its benchmarking to find that the winning computer was no longer available and its replacement used different chipsets.

The supply of computers is mostly a commodity business. Few suppliers will allow a system to be tentatively purchased, contingent on adequate performance. Most suppliers simply don't have the commercial margin to take that risk.

Tools

A number of testing tools exist.

ping

A simple program for determining roundtrip time, using ICMP Echo Request and ICMP Echo Reply. Originally written by Mike Muuss, ping should already be installed on your operating system.

Because of the misuse of ICMP traffic, ICMP Echo traffic often follows a path with worse performance than other traffic, so ping results need to be regarded as the worst case roundtrip time. Some ping responses are entirely bogus, such as those from MPLS tunnel midpoints.

— 9 —

http://www.storagereview.com/

traceroute ― Trace route

A program for determining the network path. Originally written by Van Jacobson, traceroute should already be installed on your operating system. Traceroute is known as tracert in Microsoft Windows, because of file naming limitations in MSDOS.

Traceroute shows the path transmitted packets take, so a traceroute from both hosts is needed to diagnose some performance problems. Like ping, the times printed by traceroute are the worst case and the performance of a typical packet can be considerably better.

ttcp ― Test TCP

A small and light utility to send a TCP stream between cooperating hosts with command line access. Originally written by Mike Muuss and Terry Slattery, ttcp is available for most operating systems but is not present in a typical installation.

Some versions of Cisco Systems' IOS have ttcp available as a hidden command. This is useful for connectivity testing. For realistic results set ip tcp pathmtudiscovery, ip tcp timestamp and ip tcp windowsize.

iperf ― Internet performance test

Iperf is the standard TCP performance measurement tool. It is a clientserver application. Iperf is available for most operating systems but is not present in a typical installation.

Iperf is a single threaded program, asking for ongoing performance output will also add jitter to the transfer. This is particularly annoying when using iperf to simulate VoIP traffic,

NetEm Network emulator―

NetEm is a Linux traffic control (tc) module. It can add delay and loss. Applications acceptance testing should use NetEm to add 600ms of delay so that the performance of the application with the worst case link can be assessed. If the application is intended to be accessed from a wireless network then applications acceptance testing can use NetEm to add 3% of random loss.

Web100

The Web100 kernel patch from Pittsburgh Supercomputer Centre adds instrumentation to the Linux TCP implementation. It allows all TCP variables to be examined. It comes with a useful set of tools, including a “triage” tool which allocates performance issues to the server, network or client.

NPAD Network path and application diagnosis―

NPAD is a Java plugin from Pittsburgh Supercomputing Centre which tests network performance to NPAD servers. The servers run Web100 and they use these detailed statistics from a set of test transfers to identify common performance bottleneck. NPAD allows many of the diagnostic gains of a Web100 kernel to be achieved without the need to run the Linux operating system and a modified kernel.

— 10 —

Glen Turner, Australia's Academic and Research Network.Presented 13 July 2007 at Questnet 2007.

— 11 —

Applications and network performance - gdt.id.augdt/presentations/2007-07-13-questnet-tcp/... ·...

Documents

Transcript of Applications and network performance - gdt.id.augdt/presentations/2007-07-13-questnet-tcp/... ·...