Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne...
-
Upload
rosalind-candice-owens -
Category
Documents
-
view
228 -
download
0
Transcript of Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne...
Towards a Common Communication Infrastructure for Clusters and Grids
Darius Buntinas
Argonne National Laboratory
2
Overview
Cluster Computing vs. Distributed Grids InfiniBand
– IB for WAN IP and Ethernet
– Improving performance Other LAN/WAN Options Summary
3
Cluster Computing vs. Distributed Grids
Typical clusters
– Homogenous architecture
– Dedicated environments Compatibility is not a concern
– Clusters can use high-speed LAN networks• E.g., VIA, Quadrics, Myrinet, InfiniBand
– And specific hardware accelerators• E.g., Protocol offload, RDMA
4
Cluster Computing vs. Distributed Grids cont’ed
Distributed environments
– Heterogeneous architecture
– Communication over WAN
– Multiple administrative domains Compatibility is critical
– Most WAN stacks are IP/Ethernet
– Popular grid communication protocols• TCP/IP/Ethernet• UDP/IP/Ethernet
But what about performance?
– TCP/IP/Ethernet latency: 10s of µs
– InfiniBand latency: 1s of µs How do you maintain high intra-cluster performance while enabling inter-
cluster communication?
5
Solutions
Use one network for LAN and another for WAN
– You need to manage two networks
– Your communication library needs to be multi-network capable• May have impact on performance or resource utilization
Maybe a better solution: A common network subsystem
– One network for both LAN and WAN
– Two popular network families• InfiniBand• Ethernet
6
InfiniBand
Initially introduced as a LAN
– Now expanding onto WAN
Issues with using IB on the WAN
– IB copper cables have limited lengths
– IB uses end-to-end credit-based flow control
7
Cable Lengths
IB copper cabling
– Signal integrity decreases with length and data rate
– IB 4x-QDR (32Gbps) max cable length is < 1m Solution: optical cabling for IB E.g., Intel Connects Cables
– Optical cables
– Electrical-to-optical converters at ends• ~50 ps conversion delay
– Plug into existing copper-based adapters
8
End-to-End Flow Control
IB uses end-to-end credit-based flow control
– One credit corresponds to one buffer unit at receiver
– Sender can send one unit of data per credit
– Long one-way latencies impact achievable throughput• WAN latencies are on the order of ms
Solution: Hop-by-hop flow control
– E.g., Obsidian Networks Longbow switches
– Switches have internal buffering
– Link-level flow control is performed between node and switch
9
Effect of Delay on Bandwidth
Distance (km)
Delay (µs)
1 5
2 10
20 100
200 1000
2000 10000
Source: S. Narravula, et. al., Performance of HPC Middleware over InfiniBand WAN , Ohio State Technical Report, 2007. OSU-CISRC-12/07-TR77
10
IP and Ethernet
Traditionally
– IP/Ethernet is used for WAN
– and for low-cost alternative on LAN
– Software-based TCP/IP stack implementation• Software overhead limits performance
Performance limitations
– Small 1500-byte maximum transfer unit (MTU)
– TCP/IP software stack overhead
11
Increasing Maximum Transfer Unit
Ethernet standard specifies 1500-byte MTU
– Each packet requires hardware and software processing
– Is considerable at gigabit speeds
MTU can be increased
– 9K Jumbo frames
– Reduce per-byte processing overhead
Not compatible on WAN
12
Large Segment Offload Engine on NIC
a.k.a. Virtual MTU Introduced by Intel and Broadcom Allow TCP/IP software stack to use 9K or 16K MTUs
– Reducing software overhead Fragmentation performed by NIC Standard 1500-byte MTU on the wire
– Compatible with upstream switches and routers
13
Offload Protocol Processing to NIC
Handling packets at gigabit speeds requires considerable processing
– Even with large MTU
– Uses CPU time that would otherwise be used by application Protocol Offload Engines (POE)
– Perform communication processing on NIC
– Myrinet, Quadrics, IB TCP Offload Engines (TOE) is a specific kind of POE
– Chelsio, NetEffect
14
TOE vs Non-TOE: Latency
Source: P. Balaji, W. Feng and D. K. Panda, Bridging the Ethernet-Ethernot Performance Gap. IEEE Micro Journal Special Issue on High-Performance Interconnects, pp. 24-40, May/June Volume, Issue 3, 2006.
17
Other LAN/WAN Options
iWARP protocol offload– Runs over IP– Has functionality similar to TCP– Adds RDMA
Myricom– Myri-10G adapter– Uses 10G Ethernet physical layer– POE– Can handle both TCP/IP and MX
Mellanox– ConnectX adapter– Has multiple ports that can be configured for IB or Ethernet– POE– Can handle both TCP/IP and IB
Convergence in software stack: OpenFabrics– Supports IB and Ethernet adapters– Provides a common API to upper layer