Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan [email protected] Matt Mathis...

22
Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan [email protected] Matt Mathis [email protected] Brian Tierney [email protected]

Transcript of Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan [email protected] Matt Mathis...

Page 1: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

Network-aware OS

DOE/MICS Project Review

August 18, 2003

Tom Dunigan [email protected] Mathis [email protected] Tierney [email protected]

Page 2: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Roadmap• Motivation & Background• Net100 project components

– Web100– network probes & sensors– protocol analysis and tuning

• Results– TCP tuning daemon– Tuning experiments

• Ongoing & future research

www.net100.org

DOE-funded project (Office of Science) $2.6M, 3 yrs beginning 9/01 LBNL, ORNL, PSC, NCAR

Net100 project objectives: (network-aware operating systems)• measure, understand, and improve end-to-end network/application performance• tune network protocols and applications (grid and bulk transfer)• emphasis: TCP bulk transfer over high delay/bandwidth nets

Page 3: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Motivation

• Poor network application performance– High bandwidth paths, but app’s slow– Is it application? OS? network? … Yes– Often need a network “wizard”

• Changing: bandwidths– 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Gbs

• Unchanging: TCP– speed of light (RTT)– packet size (MSS/MTU) still 1500 bytes– TCP congestion control

• TCP is lossy by design !– 2x overshoot at startup, sawtooth– Recovery proportional to MSS/RTT2

– recovery after a loss can be very slow on today’s high delay/bandwidth links -- unacceptable on tomorrow’s links:

• 10 Gbs cross country: recovery time > 1 hr.!!

Linear recovery at 0.5 Mb/s!

Instantaneous bandwidth

Average bandwidth

Early startup losses

ORNL to NERSC ftp

8 Mbs

GigE/OC12 (600 Mbs) 80ms RTT

40 seconds

Page 4: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP 101

• adaptable and fair• flow-controlled by sender/receiver buffer sizes• self-clocking with positive ACK’s of in-sequence data• sensitive to packet size (MTU) and RTT• slow start -- +1 packet per each packet ACK’d (exponential)• congestion window (cwnd)-- max packets that can be in flight• packet loss: 3 dup ACKs or timeout (AIMD)

– cut cwnd in half (Multiplicative Decrease)– add 1 packet to cwnd per RTT (Additive Increase)

• Workarounds:– parallel streams– non-TCP (UDP) applications– Net100 (no changes to applications)

Page 5: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Net100 components

• Web100 Linux kernel (NSF)– instrumented TCP stack (IETF MIB draft)

• Path characterization– Network Tuning and Analysis Framework (NTAF)– both active and passive measurement tools– data base of measurements

• TCP protocol analysis and tuning– simulation/emulation

• ns• TCP-over-UDP (atou)• NISTNet

– kernel tuning extensions– tuning daemon

Page 6: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Web100• NSF funded (PSC/NCAR/NCSA) web100.org• Modified Linux kernel

– instrumented kernel to read/set TCP variables for a specific flow– readable: RTT, counts (bytes, pkts, retransmits,dups), state (SACKs, windowscale, cwnd,

ssthresh)– settable: buffer sizes– 100+ TCP variables (IETF MIB) ( /proc/web100/)

• GUI to display/modify a flow’s TCP variables, real-time• API for network-aware applications or tuning daemon• Net100 extensions:

– additional tuning variables and algorithms– event notification– Java bandwidth tester http://firebird.ccs.ornl.gov:7123

Page 7: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Network Tool Analysis Framework (NTAF)

• Configure and launch network tools– measure bandwidth/latency (iperf, pchar, pipechar)– augment tools to report Web100 data

• Collect and transform tool results – use Netlogger to transform common format

• Save results for short-term auto-tuning and archive for later analysis– compare predicted to actual performance– measure effectiveness of tools and auto-tuning– provide data that can be used to predict future

performance– invaluable for comparing tools (pathload/pchar/netest)

Net100 hosts at: LBNL,ORNL,PSC,NCAR NERSC, SLAC, UT, CERN, Amsterdam,ANL

Page 8: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP flow visualization

- Web interface for data archive and visualization

Page 9: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Monitoring Tool Comparison

Page 10: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP tuning

• “enable” high speed– need buffer = bandwidth*RTT - autotune

ORNL/NERSC (80 ms, OC12) need 6 MB

– faster slow-start• avoid losses

– modified slow-start– reduce bursts– anticipate loss (ECN,Vegas?) – reorder threshold

• speed recovery– bigger MTU or “virtual MSS”– modified AIMD (0.5,1) (Floyd, Kelly)– delayed ACKs, initial window, slow-start increment

• avoid congestion collapse, be fair (?) … intranets, QoS

• Net100: ns simulation, NISTNet emulation, “almost TCP over UDP” (atou), WAD/Internet

ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow start.Standard TCP with del ACK takes 10 minutes to recover!

Page 11: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP Tuning Daemon

• Work-around Daemon (WAD) – tune unknowing sender/receiver at startup and/or during flow– Web100 kernel extensions

• pre-set windowscale to allow dynamic tuning• uses netlink to alert daemon of socket open/close (or poll)• besides existing Web100 buffer tuning, new tuning parameters

and algorithms• knobs to disable Linux 2.4 caching, burst mgt., and sendstall

– config file with static tuning data• mode specifies dynamic tuning (AIMD options, NTAF buffer size,

concurrent streams)

– daemon periodically polls NTAF for fresh tuning data– can do out-of-kernel tuning (e.g., Floyd)– written in C (also Python version)

WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1 kellyai: 0

Page 12: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Experimental results

• Evaluating the tuning daemon in the wild– emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet)– tests over: 10GigE/OC192,OC48, OC12, OC3, ATM/VBR, GigE,FDDI,100/10T,cable,

ISDN,wireless (802.11b),dialup– tests over NISTNet testbed (speed, loss, delay)

• Various TCP tuning options– buffer tuning (static and dynamic/NTAF)– AIMD mods (including Floyd, Kelly, static, virtual MSS, and autotuning)– slow-start mods– parallel streams vs single tuned

NISTNethost

Page 13: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Buffer tuning

Classic buffer tuning•network-challenged app. gets 10 Mbs• same app., WAD/NTAF tuned buffer gets 143 Mbs

Autotuning buffers (kernel)• Linux 2.4, Feng’s Dynamic Right Sizing• Net100 autotuning

• receiver estimates RTT• receiver advertises window 2 times data recv’d in RTT• buffer size grows dynamically to 2x bandwidth*RTT• separate application buffers from kernel buffers

ORNL to PSC, OC192, 30 ms RTT

ORNL to PSC, OC12, 80ms RTT

Page 14: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Speeding recovery

Amsterdam-Chicago GigE via 10GigE, 100 ms RTT

UDP burst

Selectable TCP AIMD algorithms: Floyd HS TCP: as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks Kelly scalable TCP: use MD of 1/8 instead of 1/2 and add % of cwnd (e.g. 1%) each RTT

Virtual MSS• tune TCP’s additive increase (WAD_AI)• add k segments per RTT during recovery• k=6 like GigE jumbo frame, but:

•interrupt rate not reduced•doesn’t do k segments for initial window

Page 15: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

WAD tuning

Modified slow-start and AI• often losses in slow-start• WAD tuned Floyd slow-start and fixed AI (6)

WAD-tuned AIMD and slow-start • parallel streams AIMD (1/(2k),k)

•exploit TCP’s fairness• WAD-tuned single stream (0.125,4)• “ “ + Floyd slow-start

ORNL to NERSC, OC12, 80 ms RTT

ORNL to CERN, OC12, 150ms RTT

Page 16: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Workaround: parallel streams• Takes advantage of TCP’s fairness• Faster startup, k buffers• faster recovery

– often only 1 stream loses a packet– MD: 1/(2k) rather than 1/2– AI: k times faster linear phase

• BUT– requires rewrite of applications– how many streams? Buffer size?

• GridFTP, bbftp, psocket lib Alice and Bob sharing Clever Alice -- 3 streams

Bad girl ...

Page 17: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

GridFTP tuning

Can tuned single stream compete with parallel streams?Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage.

WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive so far…. Testing on real Internet is problematic.

Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmitsuntuned 28 4 30tuned 74 5 295parallel 52 30 401

untuned 25 7 25tuned 67 2 420parallel 88 17 440

Data/plots from Web100 tracerBuffers: 64K I/O, 4MB TCP

Page 18: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Ongoing Net100 research

– more user-friendly WAD– invited to submit Web100/Net100 mods to Linux 2.6– port of Web100 to FreeBSD (Web100 team)

• base for AIX, SGI, Solaris, OSF– port to ORNL Cray X1

• Linux network front-end• added Net100 kernel, 4x improvement in wide-area TCP!

– TCP Vegas• Vegas avoids loss (if RTT increasing, Vegas backs off)• can be configured to compete with standard TCP (Feng)• CalTech’s FAST

– comparison with other “work arounds”• parallel streams• non-TCP (SABUL, FOBS, TSUNAMI, RBUDP, SCTP)

– additional accelerants• slow-start initial/increment• reorder resiliance• delayed ACKs

Page 19: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

TCP tuning for other OS’s

Reorder threshold• seeing more out of order packets (future: multipath?)• WAD tune a bigger reorder threshold for path

• 40x improvement!• Linux 2.4 does a good job already

• adjusts and caches reorder threshold• “undo” congestion avoidance

Delayed ACKs• WAD could turn off delayed ACKs 2x improvement in recovery rate and slow-start• Linux 2.4 already turns off delayed ACKs for initial slow-start

ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow-start.Standard TCP with del ACK takes 10 minutes to recover!NOTE aggressive static AIMD (Floyd pre-tune)

LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded!

Page 20: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Planned Net100 research

– improve ease of use (WAD WAND)– analyze effectiveness/fairness of current tuning options

• simulation• emulation• on the net (systematic tests)

– NTAF probes -- characterizing a path to tune a flow• integration with SCNM• monitoring applications with Web100• latest probe tools

– additional tuning algorithms• identify non-congestive loss, ECN?• Tuning for dedicated path (lambda/10GigE)

– parallel/multipath selection/tuning– WAD-to-WAD tuning– WAD caching – SGI/Linux

–jumbo frame experiments… the quest for bigger and bigger MTUs

Page 21: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Interactions

• Scientific applications– SciDAC supernova and global climate– Data grids (CERN, SLAC)– Radio telescopes (MIT)

• Middleware – Globus/gridFTP– HSI/HPSS

• Network measurement– Internet2 end-to-end– Pinger (Cottrell)– Claffy/Dovrolis pathload– netest (Guojun)– SCNM

• Protocol research – Dynamic Right-Sizing (Feng)– HS TCP (Floyd)– Scalable TCP (Kelly)– TCP Vegas (Feng, Low)– Tsunami/SABUL/FOBS/RBUDP– parallel streams (Hacker)

• OS vendors– Linux– IBM AIX/Linux – Cray X1

• Talks/papers/software/ www.net100.org

Page 22: Network-aware OS DOE/MICS Project Review August 18, 2003 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov.

U.S. Department of Energy Office of Science LBNL/ORNL/PSC

Summary• Novel approaches

– non-invasive dynamic tuning of legacy applications– out-of-kernel tuning– using TCP to tune TCP – tuning on a per flow/destination based on recent path metrics or policy (QoS)

• Effective evaluation framework– protocol analysis and tuning – network/application/OS debugging– path characterization tools, archive, and visualization tools

• Performance improvements– WAD tuned:

• buffers 10x• AIMD 2x to 10x• delayed ACK 2x• slowstart 3x• reorder 40x

• Timely -- needed for science on today’s and tomorrow’s networks