Network-aware OS

14
Network-aware OS ESCC Miami February 5, 2003 Tom Dunigan [email protected] Matt Mathis [email protected] Brian Tierney [email protected]

description

Network-aware OS. ESCC Miami February 5, 2003. Tom Dunigan [email protected] Matt Mathis [email protected] Brian Tierney [email protected]. Roadmap. www.net100.org. Motivation Net100 project overview Web100 network probes & sensors protocol analysis and tuning Year 1 Results - PowerPoint PPT Presentation

Transcript of Network-aware OS

Page 1: Network-aware  OS

Network-aware OS

ESCC Miami

February 5, 2003

Tom Dunigan [email protected] Mathis [email protected] Tierney [email protected]

Page 2: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Roadmap• Motivation• Net100 project overview

– Web100– network probes & sensors– protocol analysis and tuning

• Year 1 Results– A TCP tuning daemon– Tuning experiments

• Year 2– ongoing research, Web100 update (Mathis)

www.net100.org

DOE-funded project (Office of Science) $1M/yr, 3 yrs beginning 9/01 LBL, ORNL, PSC, NCAR

Net100 project objectives: (network-aware operating systems)• measure, understand, and improve end-to-end network/application performance• tune network protocols and applications (grid and bulk transfer)• first year emphasis: TCP bulk transfer over high delay/bandwidth nets

Page 3: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Motivation

• Poor network application performance– High bandwidth paths, but app’s slow– Is it application? OS? network? … Yes– Often need a network “wizard”

• Changing: bandwidths– 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Gbs

• Unchanging: TCP– speed of light (RTT)– MTU (still 1500 bytes)– TCP congestion avoidance

• TCP is lossy by design !– 2x overshoot at startup, sawtooth– recovery after a loss can be very slow on

today’s high delay/bandwidth links– Non-congestive loss c*MSS/(RTT*p½)– Recovery proportional to MSS/RTT2

Linear recovery at 0.5 Mb/s!

Instantaneous bandwidth

Average bandwidth

Early startup losses

ORNL to NERSC ftp

8 Mbs

GigE/OC12 80ms RTT

Page 4: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

TCP tuning

• “enable” high speed– need buffer = bandwidth*RTT - autotune

ORNL/NERSC (80 ms, OC12) need 6 MB

– faster slow-start• avoid losses

– modified slow-start– reduce bursts– anticipate loss (ECN,Vegas?) – reorder threshold

• speed recovery– bigger MTU or “virtual MSS”– modified AIMD (0.5,1)– delayed ACKs, initial window, slow-start

increment• avoid congestion collapse• be fair (?) … intranets, QoS

ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow start.Standard TCP with del ACK takes 10 minutes to recover!

Page 5: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Net100 components for tuning• TCP protocol analysis

– simulation/emulation– kernel tuning extensions

• Web100 Linux kernel (NSF) www.web100.org– instrumented TCP stack (IETF MIB draft)– 100+ variables per flow (/proc/web100)– socket open/close event notification– API and tools for tracing and tuning, e.g., bw tester:

http://firebird.ccs.ornl.gov:7123• Path characterization

– Network Tuning and Analysis Framework (NTAF)– both active and passive measurement

• iperf, pipechar• Web100 data augments probe data

– schedule probes and distribute/archive results– data base of measurements– NTAF/Net100 hosts at PSC, NCAR,LBL,ORNL,

NERSC,CERN,UT,SLAC• TCP tuning daemon

Page 6: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

TCP Tuning Daemon

• Work-around Daemon (WAD) – tune unknowing sender/receiver at startup and/or during flow– Web100 kernel extensions

• pre-set windowscale to allow dynamic tuning• uses netlink to alert daemon of socket open/close (or poll)• besides existing Web100 buffer tuning, new tuning options

using WAD_* variables• knobs to disable Linux 2.4 caching, burst mgt., and sendstall

– config file with static tuning data• mode specifies dynamic tuning (Floyd AIMD, NTAF buffer size,

concurrent streams)

– daemon periodically polls NTAF for fresh tuning data– written in C (also python version)

WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1

Page 7: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Experimental results (year 1)

• Evaluating the tuning daemon in the wild– emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet)– tests over: 10GigE,OC48, OC12, OC3, ATM/VBR, GigE,FDDI,100/10T,cable,

ISDN,wireless (802.11b),dialup– tests over NistNET 100T testbed

• Various TCP tuning options– buffer tuning– AIMD mods (including Floyd, both in-kernel and in WAD)– slow-start mods– parallel streams vs single tuned

• Results are anecdotal– more systematic testing is on-going – Your mileage may vary ….

Network professionals on a closed course. Do not attempt this at home.

Page 8: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

WAD tuning results

Classic buffer tuning• ORNL to PSC, OC12, 80ms RTT• network-challenged app. gets 10 Mbs• same app., WAD/NTAF tuned buffer gets 143 Mbs

Virtual MSS• tune TCP’s additive increase (WAD_AI)• add k segments per RTT during recovery• k=6 like GigE jumbo frame, but:

•interrupt rate not reduced•doesn’t do k segments for initial window

Page 9: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Tuning around Linux (2.4) TCP

• Tunable ssthresh caching• Tunable “sendstall” (TXQUELEN) 600 mbs

Amsterdam-Chicago GigE via 10GigE, 100 ms RTT

sendstalls

UDP event

Floyd AIMD

Standard AIMD

Floyd AIMD: as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks

Added to Net100 kernel and to WAD (WAD tunable)

Page 10: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

WAD tuning

Modified slow-start and AI• ORNL to NERSC, OC12, 80 ms RTT• often losses in slow-start• WAD tuned Floyd slow-start and fixed AI (6)

WAD-tuned AIMD and slow-start• ORNL to CERN, OC12, 150ms RTT• parallel streams AIMD (1/(2k),k)• WAD-tuned single stream (0.125,4)

Page 11: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

GridFTP tuning

Can tuned single stream compete with parallel streams?Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage.

WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive so far…. Testing on real Internet is problematic.

Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmitsuntuned 28 4 30tuned 74 5 295parallel 52 30 401

untuned 25 7 25tuned 67 2 420parallel 88 17 440

Data/plots from Web100 tracerBuffers: 64K I/O, 4MB TCP (untuned 64K TCP: 8 mbs, 200s)

Page 12: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Ongoing/Planned Net100 research (year 2)

– analyze effectiveness/fairness of current tuning options• simulation• emulation• on the net (systematic tests)

– NTAF probes -- characterizing a path to tune a flow• router data (passive)• monitoring applications with Web100• latest probe tools

– additional tuning algorithms• Vegas• slow-start increment, reorder resiliance, delayed ACKs• non-TCP (SABUL, FOBS, TSUNAMI, ?)• identify non-congestive loss, ECN?

– parallel/multipath selection/tuning– WAD-to-WAD tuning– jumbo frames experiments… the quest for bigger and bigger MTUs– more user-friendly, usable accelerants– port to Cray X1 network front-end– port to other OS’s

Page 13: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Future TCP tuning

Reorder threshold• seeing more out of order packets• WAD tune a bigger reorder threshold for path

• 40x improvement!• Linux 2.4 does a good job already

• adjusts and caches reorder threshold• “undo” congestion avoidance

Delayed ACKs• WAD could turn off delayed ACKs -- 2x improvement in recovery rate and slow-start• Linux 2.4 already turns off delayed ACKs for initial slow-start

ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow-start.Standard TCP with del ACK takes 10 minutes to recover!NOTE aggressive static AIMD (Floyd pre-tune)

LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded!

Page 14: Network-aware  OS

UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory

Summary

• Novel approaches– non-invasive dynamic tuning of legacy applications– using TCP to tune TCP (Web100)– tuning on a per flow/path

• Effective evaluation framework– protocol analysis and tuning + net/app/OS debugging– out-of-kernel tuning

• Beneficial interactions– TCP protocols (Floyd, Wu Feng (DRS), Web100, parallel/non-TCP)– Path characterization research (SciDAC, CAIDA, Pinger, pathrate,SCNM)– Scientific application and Data grids (SciDAC, CERN)

• Performance improvements

www.net100.org