Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1...
Transcript of Scalable Cluster Interconnect...Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006 1...
1Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Scalable ClusterInterconnect
Overview and Technology Roadmap
Charles L. [email protected]
Linux Superclusters Users ConferenceAlbuquerque, NM13 September 2000
2Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
What is Myrinet?
• A high-performance, cost-effective, packet communicationand switching technology– ANSI Standard (ANSI/VITA 26-1998)
– Packets follow the route specified by the source host (sourcerouting).
– Processing power at the hosts and in the interfaces
– This architecture allows an elegant, streamlined, switchingtechnology
• A descendant of packet communication and routing inMPPs, but commodity and open
• Used principally for scalable clusters
3Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Products and Applications
Myricom supplies all that is required to make a high-performance cluster from a collection of computers.
Software Host Interface
PCI Interfaces
Link Cables SAN (to 3m) Serial (to 10m) Fiber (to 200m) Long-wave Fiber
Cut-Through Switches
In-CabinetClusters
Desktop Hosts
VME Single-Board-Computer Clusters
Any NetworkTopology
2+2Gbits/s
4Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Technology “in the Large”Sandia National Laboratory Cplant™
2,576 Compaq Alpha Personal Workstations,400 EV-5 + 768 EV-6 + 1408 EV-6, but not allin one cluster.
Compaq CustomSystems was the integrator.The system was built in three phases, in thesummers 1998, 1999, and 2000.
Cplant originally used 16-port Myrinet switchesin each 8-host cabinet. The latest increment usesa mesh variant of the M2LM-Clos64 “Networkin a Box” products for switching.
(Photo adapted from http://www.cs.sandia.gov/cplant/)
5Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Technology “in the Small”CSPI Quad-PowerPC VME Signal-Processing Board
This CSPI two-level-multicomputer productuses the Myricom LANai-5 chip to
interface the PowerPCs tothe message-passing
network.
This single-width VMEboard includes a packet-switchedMyrinet network interconnecting the 4 nodes onthe board and 4 external ports with an 8-portMyrinet switch (a chip not visible in this photo).
6Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Why Myrinet? The “selling points.”• Low latency
– ~8.5µs today (UNIX user process touser process, fully protected, withend-to-end data integrity checking)
– The lower the latency, the wider theapplication span
• High data rate– 2+2 Gb/s shipping now
– 1.28+1.28 Gb/s legacy
– Copper and fiber links
• Unlimited scalability
• Very low host-CPU utilization– logP = ~1µs
• “Peg-the-needle” PCIimplementations
• High Availability features– Self-mapping, self-healing
– Link-continuity monitoring
• Data Integrity features– Memory and bus parity
– Link CRC
– Packet payload CRC
• More cost-effective than GigabitEthernet or Fibre Channel
– Cost per node < $1,500 today
– Cost per node < $1,000 soon
• Software drivers for all majorplatforms
– Download them from the Web
– Open source
7Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet = ANSI/VITA 26-1998
Myrinet is defined at the Data-Link level (level 2 of the ISO reference model for computer networks) by its packet format and flow control. Think of Myrinet as the simplest packet-switched network you can devise.
Sourcerouteusedby theswitches, which strip the bytes as they are used
Type (allows multiple protocols on one Myrinet)
Payload (any length)CRC
(Bytes)
http://www.myri.com/open-specs/
There are multiple Physical-level implementations.
8Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Switches -- “Just Technology”
16-port 2nd-generation Myrinet switch (M2LM-SW16) with 8 SAN ports and 8 LAN ports
• 20.48 Gb/s bisection data rate (!) from a single-chip 16x16 crossbar.
• Path-formation latency 100ns SAN-SAN, 200ns SAN-LAN, 300ns LAN-LAN.
• 32 Watts, 2U rack mount size, no fan.
• SNMP/Ethernet monitoring & control (out of band) + Myrinet heartbeat.
• $5K US-list. The “workhorse” 2nd-generation Myrinet switch.
9Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
2nd-Generation Myrinet “Network in a Box”
Clos network of 16 16-port switches,with 64 LAN host ports, and 64 SANinter-switch ports.
Full (maximal) bisection data ratebetween the 64 host ports = 32 links(41+41 Gb/s). Data rate between thehost ports and the inter-switch ports =64 links (82+82 Gb/s).
160 Watts, 12U rack mount size
SNMP/Ethernet monitoring andcontrol, with the full set of Myrinethigh-availability features.
$40K US-list.
10Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Interfaces
Network
InterfaceFast SRAM
RISC
DMA controller
& bus bridge
Packet
DMASANport
Parts of the LANai chipPCIDMA chip
M3M-PCI64B-2Universal 64/32-bit, 66/33MHzMyrinet-2000-SAN/PCI Interface
From a customer:“What makes Myrineteffective for clusters is
the autonomy of the interfaces,which lets us
get the OS out of the way.”
11Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Software Interfaces
Applications
MPI Middleware
TCPUDP
IP
Ethernet Myrinet
Myrinet Control Program (MCP)
HostOS
OS-bypassAPIs (multiple host processes)
(executes in the Myrinet interface)
10/100/1000 Mb/s1280+1280 Mb/s
VIA
12Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
The GM Message-Passing System
No Compromises• Concurrent, protected,
user-level access
• Reliable, orderedmessage delivery
• Very low CPU overhead
• Robust under networkfaults
• Mapping
• Segmentation andreassembly of longmessages
• High-level flow control
• “Clean” API, withexception handling
• Zero-copy layering ofother APIs
GM Data-Rate Performance (Myrinet-2000 SAN Interfaces)
GM short-message latency (Myrinet-2000 interfaces)~ 8.5µs (best numbers)
GM CPU overhead = 1-2µs per message (LogP)
UNIX user process to user processFully protected
End-to-end data integrity
13Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
GM and MPICH-over-GM Latencies
UNIX user process to user processFully protected
End-to-end data integrity
14Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
MPICH over GM Data Rate
UNIX user process to user processFully protected
End-to-end data integrity
15Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet 2000 – Third-Generation MyrinetThis evolutionary step improves the links at the Physical level -- boththe performance and the “look and feel” of Myrinet --, and introducesinterfaces with 1.7x and 2.5x faster RISCs, but Myrinet-2000 iscompatible with 2nd-generation Myrinet at the Data Link leveland in the software. (Don’t try to innovate along too manydimensions at once! This is a technology push, not an architecturechange.)
SAN-1280 SAN-2000 Circuit boards & ribbon cables (3m)
LANSerial copper HSSDC, 2+2 Gb/s to 10m
Low-cost fiber Multimode fiber, 2+2 Gb/s to 200m
16Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet-2000
• 2+2 Gb/s links using the same Physical mediaand signaling as {2.5GbE, 2xFC, & 1xInfiniBand}.– HSSDC cables to 10m and low-cost fiber to 200m.
• 64/32-bit, 66/33MHz, Myrinet/PCI interfaces(LANai 9)– 132 MHz RISC, 1,056 MB/s local-memory data
rate (achieves 8.5µs GM latency)
– In 1Q01, 200MHz RISC, 1,600 MB/s local-memory data rate (~6.5µs GM latency)
• Modular Switches– 16-port crossbar and 32|64|128-host Clos switches,
with line-card options for SAN, serial, or fiberlinks.
17Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet-2000 128-Host “Network in a Box”
This family of products support hot-plugging of line cards, fans, and dual redundant powersupplies. Microcomputer monitoring (SNMP over Ethernet) provides extensive diagnosticcapabilities, and management features needed for high-availability applications.
Different types of line cards have Serial, Fiber, SAN, or legacy LAN ports
Spine of the Clos Network (backplane)
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
8 hosts
Closspreadernetwork
Ports to up to 128 hosts (line cards)
18Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
The family of Myrinet-2000 switch products
Clos“spreader”
network(128 links)
8 16-portswitcheson the
backplane Up to 1616-portswitches
on the linecards
17-slotenclosureup to 128
hosts
Clos“spreader”
network(64 links)
4 16-portswitcheson the
backplane
Up to 8 16-portswitches
on the linecards
9-slotenclosureup to 64
hosts
…(32 links)
2 16-portswitcheson the
backplane
Up to 416-portswitches
on the linecards
5-slotenclosureup to 32
hosts
One line cardwith a 16-port
switch, and onestraight-through
line card
3-slotenclosureup to 16
hosts
Add the optional monitoring line card to provide SNMP/Ethernet monitoring andcontrol. The monitoring line card includes a microcontroller and dual Ethernetports. All line cards are interchangable across the product family.
19Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Why Clos Networks?; Maximal performance under arbitrary traffic patterns
; Minimum bisection is the largest possible; “Rearrangable Network” (can route any permutation); Network looks the same from any host (simplifies cluster management)
; Multiple paths; All progressive routes are deadlock-free; Use multiple paths for redundancy; Use multiple paths to avoid hot spots (random dispersion)
; Scales well. For n hosts (minimum bisection = n /2):; Diameter varies as log(n); Cost varies as nlog(n); Modular
; Economies of sharing the power supply and microcontroller betweenmany switches, and implementing many of the inter-switch links oncircuit boards rather than cables.
20Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Technology – History & Roadmap
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
1st Generation0.64+0.64 Gb/s links
2nd Generation1.28+1.28 Gb/s links
3rd Generation“Myrinet 2000”2+2 Gb/s links
32-bit SBus (SPARC) interfaces, 8-port switches
32-bit PCI interfaces (LANai 4), 8-port switches
SAN PHY level
Clos “network in a box” of 8-port switches
16-port switches, HA features
64-bit PCI interfaces (LANai 7), GM message system
Clos “network in a box” of 16-port switches
64-bit PCI interfaces (LANai 9), SW16, Clos128
PCI-X, multiple virtual channelsGigabit Ethernet ports on Myrinet switches
PastFuture
Full Interoperability with 1x InfiniBand
4x InfiniBand links
Products & Features
21Myricom, Inc. 325 N. Santa Anita Ave. Arcadia CA 91006
626-821-5555 Fax: 626-821-5316 http:/ /www.myri.com
Myrinet Technology Roadmap
• In mid-2001, PCI-X interfaces– PCI-X is not only 2x faster than 66MHz PCI, PCI-X allows concurrent,
interleaved transactions.
• Also in mid-2001, multiple virtual channels.– Allows “express lanes” for latency-sensitive traffic.
– Coordinated with PCI-X, because today’s PCI would otherwise get in the wayof latency-sensitive transactions.
– Required later for full interoperability with InfiniBand.
• Programmable bridges/routers between {Myrinet, Gigabit Ethernet,InfiniBand} with “Myrinet inside.”
• Support or converge with InfiniBand.– We have all of the necessary technology now for the PHY layer.
– Track and support the protocols and APIs in firmware.