Masaki Hirabaru [email protected] NICT

29
Masaki Hirabaru [email protected] NICT The 3rd International HEP DataGrid Work shop August 26, 2004 Kyungpook National Univ., Daegu, Korea High Performance Data Tra nsfer over TransPAC

description

High Performance Data Transfer over TransPAC. The 3rd International HEP DataGrid Workshop August 26, 2004 Kyungpook National Univ., Daegu, Korea. Masaki Hirabaru [email protected] NICT. Acknowledgements. NICT Kashima Space Research Center Yasuhiro Koyama, Tetsuro Kondo - PowerPoint PPT Presentation

Transcript of Masaki Hirabaru [email protected] NICT

Page 1: Masaki Hirabaru masaki@nict.go.jp NICT

Masaki [email protected]

NICT

The 3rd International HEP DataGrid WorkshopAugust 26, 2004

Kyungpook National Univ., Daegu, Korea

High Performance Data Transfer over TransPAC

Page 2: Masaki Hirabaru masaki@nict.go.jp NICT

Acknowledgements

•NICT Kashima Space Research CenterYasuhiro Koyama, Tetsuro Kondo

•MIT Haystack ObservatoryDavid Lapsley, Alan Whitney

•APAN Tokyo NOC

•JGN II NOC

•NICT R&D Management Department

•Indiana U. Global NOC

Page 3: Masaki Hirabaru masaki@nict.go.jp NICT

Contents

• e-VLBI

• Performance Measurement

• TCP test over TransPAC

• TCP test in the Laboratory

Page 4: Masaki Hirabaru masaki@nict.go.jp NICT

Motivations• MIT Haystack – NICT Kashima e-VLBI Experiment

on August 27, 2003 to measure UT1-UTC in 24 hours– 41.54 GB CRL => MIT 107 Mbps (~50 mins)

41.54 GB MIT => CRL 44.6 Mbps (~120 mins)

– RTT ~220 ms, UDP throughput 300-400 MbpsHowever TCP ~6-8 Mbps (per session, tuned)

– BBFTP with 5 x 10 TCP sessions to gain performance

• HUT – NICT Kashima Gigabit VLBI Experiment

- RTT ~325 ms, UDP throughput ~70 MbpsHowever TCP ~2 Mbps (as is), ~10 Mbps (tuned)

- Netants (5 TCP sessions with ftp stream restart extension)

They need high-speed / real-time / reliable / long-haul high-performance, huge data transfer.

Page 5: Masaki Hirabaru masaki@nict.go.jp NICT

VLBI (Very Long Baseline Interferometry)de

lay

radio signal from a star

correlator

A/D clockA/D

 Internet

clock

•e-VLBI geographically distributed observation, interconnecting radio antennas over the world

•Gigabit / real-time VLBI multi-gigabit rate sampling

High Bandwidth – Delay Product Network issue

(NICT Kashima Radio Astronomy Applications Group)Data rate 512Mbps ~

Page 6: Masaki Hirabaru masaki@nict.go.jp NICT

Recent Experiment of UT1-UTC Estimationbetween NICT Kashima and MIT Haystack (via Washington DC)

•July 30, 2004 4am-6am JSTKashima was upgraded to 1G through JGN II 10G link.All processing done in ~4.5 hours (last time ~21 hours)Average ~30 Mbps transfer by bbftp (under investigation)

test

experiment

Page 7: Masaki Hirabaru masaki@nict.go.jp NICT

 

KwangjuBusan

2.5G

Fukuoka

Korea

                                       

2.5G SONET

KORENTaegu

Daejon

10G

1G (10G)1G

1G

Seoul XP

Genkai XP

Kitakyushu

Kashima

1G (10G)

Fukuoka Japan

250km

1,000km2.5G

TransPAC / JGN II

9,000km

4,000km

Los Angeles

Chicago

Washington DC

MIT Haystack

10G

2.4G (x2)

APII/JGNII

Abilene

Koganei

1G(10G)

Indianapolis

100kmbwctl server

Network Diagram for e-VLBI and test servers

10G

Tokyo XP

*Info and key exchange page needed like:http://e2epi.internet2.edu/pipes/ami/bwctl/

perf server

e-vlbi server

– Done 1 Gbps upgrade at Kashima

– On-going 2.5 Gbps upgrade at Haystack– Experiments using 1 Gigabit bps or more– Using real-time correlation

JGNII

e-VLBI:

Page 8: Masaki Hirabaru masaki@nict.go.jp NICT

APAN JP Maps written in perl and fig2div

Page 9: Masaki Hirabaru masaki@nict.go.jp NICT

Purposes• Measure, analyze and improve end-to-end perfor

mance in high bandwidth-delay product networks– to support for networked science applications– to help operations in finding a bottleneck– to evaluate advanced transport protocols

(e.g. Tsunami, SABUL, HSTCP, FAST, XCP, [ours])

• Improve TCP under easier conditions– with a signle TCP stream– memory to memory– bottleneck but no cross traffic

Consume all the available bandwidth

Page 10: Masaki Hirabaru masaki@nict.go.jp NICT

Path

ReceiverSender

Backbone

B1 <= B2 & B1 <= B3

Access Access

B1B2

B3

a) w/o bottleneckqueue

ReceiverSender

Backbone

B1 > B2 || B1 > B3

Access Access

B1 B2 B3

b) w/ bottleneckqueue

bottleneck

Page 11: Masaki Hirabaru masaki@nict.go.jp NICT

TCP on a path with bottleneck

bottleneck

overflowqueue

The sender may generate burst traffic.The sender recognizes the overflow after the delay < RTT.The bottleneck may change over time.

loss

Page 12: Masaki Hirabaru masaki@nict.go.jp NICT

Limiting the Sending Rate

ReceiverSender

1Gbps

a)

b)

congestion20Mbpsthroughput

ReceiverSender

100Mbps

congestion90Mbpsthroughput

better!

Page 13: Masaki Hirabaru masaki@nict.go.jp NICT

Web100 (http://www.web100.org)• A kernel patch for monitoring/modifying TCP

metrics in Linux kernel• We need to know TCP behavior to identify a

problem.

• Iperf (http://dast.nlanr.net/Projects/Iperf/)– TCP/UDP bandwidth measurement

• bwctl (http://e2epi.internet2.edu/bwctl/)– Wrapper for iperf with authentication and scheduling

Page 14: Masaki Hirabaru masaki@nict.go.jp NICT

1st Step: Tuning a Host with UDP

• Remove any bottlenecks on a host– CPU, Memory, Bus, OS (driver), …

• Dell PowerEdge 1650 (*not enough power)– Intel Xeon 1.4GHz x1(2), Memory 1GB– Intel Pro/1000 XT onboard PCI-X (133Mhz)

• Dell PowerEdge 2650– Intel Xeon 2.8GHz x1(2), Memory 1GB– Intel Pro/1000 XT PCI-X (133Mhz)

• Iperf UDP throughput 957 Mbps – GbE wire rate: headers: UDP(20B)+IP(20B)+EthernetII(38B)– Linux 2.4.26 (RedHat 9) with web100– PE1650: TxIntDelay=0

Page 15: Masaki Hirabaru masaki@nict.go.jp NICT

2nd Step: Tuning a Host with TCP• Maximum socket buffer size (TCP window size)

– net.core.wmem_max net.core.rmem_max (64MB)– net.ipv4.tcp_wmem net.tcp4.tcp_rmem (64MB)

• Driver descriptor length– e1000: TxDescriptors=1024 RxDescriptors=256 (default)

• Interface queue length– txqueuelen=100 (default)– net.core.netdev_max_backlog=300 (default)

• Interface queue descriptor– fifo (default)

• MTU– mtu=1500 (IP MTU)

• Iperf TCP throughput 941 Mbps– GbE wire rate: headers: TCP(32B)+IP(20B)+EthernetII(38B)– Linux 2.4.26 (RedHat 9) with web100

• Web100 (incl. High Speed TCP)– net.ipv4.web100_no_metric_save=1 (do not store TCP metrics in the route cache)– net.ipv4.WAD_IFQ=1 (do not send a congestion signal on buffer full)– net.ipv4.web100_rbufmode=0 net.ipv4.web100_sbufmode=0 (disable auto tuning)– Net.ipv4.WAD_FloydAIMD=1 (HighSpeed TCP)– net.ipv4.web100_default_wscale=7 (default)

Page 16: Masaki Hirabaru masaki@nict.go.jp NICT

 

                                       

Tokyo XP

Kashima

0.1G

2.5G

TransPAC

9,000km

4,000km

Los AngelesWashington DC

MIT Haystack

10G

1GAbilene

Koganei

1G

Indianapolis

I2 Venue1G

10G

100km

server (general)

server (e-VLBI)

Network Diagram for TransPAC/I2 Measurement (Oct. 2003)

1G x2

sender

receiver

Mark5Linux 2.4.7 (RH 7.1)P3 1.3GHzMemory 256MBGbE SK-9843

PE1650Linux 2.4.22 (RH 9)Xeon 1.4GHzMemory 1GBGbE Intel Pro/1000 XT

Iperf UDP ~900Mbps (no loss)

Page 17: Masaki Hirabaru masaki@nict.go.jp NICT

TransPAC/I2 #1: High Speed (60 mins)

Page 18: Masaki Hirabaru masaki@nict.go.jp NICT

TransPAC/I2 #2: Reno (10 mins)

Page 19: Masaki Hirabaru masaki@nict.go.jp NICT

TransPAC/I2 #3: High Speed (Win 12MB)

Page 20: Masaki Hirabaru masaki@nict.go.jp NICT

Test in a laboratory – with bottleneck

PacketSphere

ReceiverSender

L2SW(FES12GCF)

Bandwidth 800Mbps Buffer 256KBDelay 88 msLoss 0

GbE/SX

GbE/TGbE/T

PE 2650 PE 1650

• #1: Reno => Reno

• #2: High Speed TCP => Reno

2*BDP = 16MB

Page 21: Masaki Hirabaru masaki@nict.go.jp NICT

Laboratory #1,#2: 800M bottleneck

Reno

HighSpeed

Page 22: Masaki Hirabaru masaki@nict.go.jp NICT

Laboratory #3,#4,#5: High Speed (Limiting)

Window Size(16MB)

Rate Control

Cwnd Clamp

270 us every 10 packetsWith limited slow-start (1000)

(95%)With limited slow-start (100)

With limited slow-start (1000)

Page 23: Masaki Hirabaru masaki@nict.go.jp NICT

How to know when bottleneck changed

• End host probes periodically (e.g. packet train)• Router notifies to the end host (e.g. XCP)

Page 24: Masaki Hirabaru masaki@nict.go.jp NICT

Another approach: enough buffer on router

• At least 2xBDP (bandwidth delay product)e.g. 1G bps x 200ms x 2 = 500Mb ~ 50MB

• Replace Fast SRAM with DRAMin order to reduce space and cost

Page 25: Masaki Hirabaru masaki@nict.go.jp NICT

Test in a laboratory – with bottleneck (2)

NetworkEmulator

ReceiverSender

L2SW(FES12GCF)

Bandwidth 800Mbps Buffer 64MBDelay 88 msLoss 0

GbE/SX

GbE/TGbE/T

PE 2650 PE 1650

• #6: High Speed TCP => Reno

2*BDP = 16MB

Page 26: Masaki Hirabaru masaki@nict.go.jp NICT

Laboratory #6: 800M bottleneck

HighSpeed

Page 27: Masaki Hirabaru masaki@nict.go.jp NICT

Report on MTU• Increasing MTU (packet size) results in better p

erformance. Standard MTU is 1500B. MTU 9KB is available throughout Abilene, TransPAC, APII backbones.

• On Aug 25, 2004, a remaining link with 1500B was upgraded to 9KB in Tokyo XP. MTU 9KB is available from Busan to Los Angeles.

Page 28: Masaki Hirabaru masaki@nict.go.jp NICT

Current and Future Plans of e-VLBI

• KOA (Korean Observatory of Astronomy) has one existing radio telescope but in a different band from ours. They are building another three radio telescopes.

• Using a dedicated light path from Europe to Asia through US is being considered.

• e-VLBI Demonstration in SuperComputing2004 (November) is being planned, interconnecting radio telescopes from Europe, US, and Japan.

• Gigabit A/D converter ready and now implementing 10G.

• Our peformance measurement infrastructure will be merged into a framework of Global (Network) Observatory maintained by NOC people. (Internet2 piPEs, APAN CMM, and e-VLBI)

Page 29: Masaki Hirabaru masaki@nict.go.jp NICT

Questions?• See http://www2.nict.go.jp/ka/radioastro/index.ht

mlfor VLBI