Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500...

29
Case study • IBM Bluegene/L system • InfiniBand

Transcript of Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500...

Page 1: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Case study

• IBM Bluegene/L system

• InfiniBand

Page 2: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Interconnect Family share for 06/2011 top 500 supercomputers

Interconnect Family

Count Share %Rmax Sum (GF)

Rpeak Sum (GF)

Processor Sum

Myrinet 4 0.80 % 384451 524412 55152

Quadrics 1 0.20 % 52840 63795 9968

Gigabit Ethernet

232 46.40 % 11796979 22042181 2098562

Infiniband 206 41.20 % 22980393 32759581 2411516

Mixed 1 0.20 % 66567 82944 13824

NUMAlink 2 0.40 % 107961 121241 18944

SP Switch 1 0.20 % 75760 92781 12208

Proprietary 29 5.80 % 9841862 13901082 1886982

Fat Tree 1 0.20 % 122400 131072 1280

Custom 23 4.60 % 13500813 15460859 1271488

Totals 500 100% 58930025.59 85179949.00 7779924

Page 3: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Overview of the IBM Blue Gene/L System Architecture

• Design objectives

• Hardware overview– System architecture– Node architecture– Interconnect architecture

Page 4: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Highlights

• A 64K-node highly integrated supercomputer based on system-on-a-chip technology– Two ASICs

• Blue Gene/L compute (BLC), Blue Gene/L Link (BLL)

• Distributed memory, massively parallel processing (MPP) architecture.

• Use the message passing programming model (MPI).

• 360 Tflops peak performance• Optimized for cost/performance

Page 5: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Design objectives

• Objective 1: 360-Tflops supercomputer– Earth Simulator (Japan, fastest supercomputer

from 2002 to 2004): 35.86 Tflops

• Objective 2: power efficiency– Performance/rack = performance/watt *

watt/rack• Watt/rack is a constant of around 20kW• Performance/watt determines performance/rack

Page 6: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Power efficiency: – 360Tflops => 20 megawatts with conventional

processors

– Need low-power processor design (2-10 times better power efficiency)

Page 7: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Design objectives (continue)

• Objective 3: extreme scalability– Optimized for cost/performance use low

power, less powerful processors need a lot of processors

• Up to 65536 processors.

– Interconnect scalability

Page 8: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Blue Gene/L system components

Page 9: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Blue Gene/L Compute ASIC

• 2 Power PC440 cores with floating-point enhancements– 700MHz– Everything of a typical superscalar processor

• Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc

– 1 W each through extensive power management

Page 10: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Blue Gene/L Compute ASIC

Page 11: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Memory system on a BGL node

• BG/L only supports distributed memory paradigm.• No need for efficient support for cache coherence

on each node.– Coherence enforced by software if needed.

• Two cores operate in two modes:– Communication coprocessor mode

• Need coherence, managed in system level libraries

– Virtual node mode• Memory is physical partitioned (not shared).

Page 12: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Blue Gene/L networks

• Five networks.– 100 Mbps Ethernet control network for diagnostics,

debugging, and some other things.– 1000 Mbps Ethernet for I/O– Three high-band width, low-latency networks for data

transmission and synchronization.• 3-D torus network for point-to-point communication• Collective network for global operations• Barrier network

• All network logic is integrated in the BG/L node ASIC– Memory mapped interfaces from user space

Page 13: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

3-D torus network

• Support p2p communication• Link bandwidth 1.4Gb/s, 6

bidirectional link per node (1.2GB/s).

• 64x32x32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6.4us.

• Cut-through routing• Adaptive routing

Page 14: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Collective network• Binary tree topology, static routing• Link bandwidth: 2.8Gb/s• Maximum hardware latency: 5us • With arithmetic and logical

hardware: can perform integer operation on the data– Efficient support for reduce, scan,

global sum, and broadcast operations– Floating point operation can be done

with 2 passes.

Page 15: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Barrier network

• Hardware support for global synchronization.

• 1.5us for barrier on 64K nodes.

Page 16: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

IBM BlueGene/L summary

• Optimize cost/performance– limiting applications.

– Use low power design• Lower frequency, system-on-a-chip

• Great performance per watt metric

• Scalability support– Hardware support for global communication and

barrier

– Low latency, high bandwidth support

Page 17: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Case 2: Infiniband architecture– Specification (Infiniband

architecture specification release 1.2.1, January 2008/Oct. 2006) available at Infiniband Trade Association (http://www.infinibandta.org)

Page 18: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Infiniband architecture overview

Page 19: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Infiniband architecture overview– Components:

• Links• Channel adaptors• Switches• Routers

– The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network.

– Topology:• Irregular• Regular: Fat tree

– Link speed:• Single data rate (SDR): 2.5Gbps (X), 10Gbps (4X), and 30Gbps

(12X).• Double data rate (DDR): 5Gbps (X), 20 Gbps (4X)• Quad data rate (QDR): 40Gbps (4X)

Page 20: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Layers: somewhat similar to TCP/IP– Physical layer

– Link layer• Error detection (CRC checksum)

• flow control (credit based)

• switching, virtual lanes (VL),

• forwarding table computed by subnet manager– Single path deterministic routing (not adaptive)

– Network layer: across subnets.• No use for the cluster environment

– Transport layer• Reliable/unreliable, connection/datagram

– Verbs: interface between adaptors and OS/Users

Page 21: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Infinoband Link layer Packet format:

• Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet

• Global Route Header (GRH): 40 Bytes. Used for routing between subnets

• Base Transport header (BTH): 12 Bytes, for IBA transport• Extened transport header

– Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram

– Datagram extended transport header (DETH): 8 bytes– RDMA extended transport header (RETH): 16 bytes– Atomic, ACK, Atomic ACK,

• Immediate DATA extended transport header: 4 bytes, optimized for small packets.

• Invariant CRC and variant CRC: – CRC for fields not changed and changed.

Page 22: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Local Route Header:

– Switching based on the destination port address (LID)

– Multipath switching by allocating multiple LIDs to one port

Page 23: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Subnet management

• Initialize the network– Discover subnet topology and topology changes, compute

the paths, assign LIDs, distribute the routes, configure devices.

– Related devices and entities• Devices: Channel Adapters (CA), Host Channel Adapters,

switches, routers• Subnet manager (SM): discovering, configuring, activating and

managing the subnet• A subnet management agent (SMA) in every device generates,

responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management

• SM exchange control packets with SMA with subnet management interface (SMI).

Page 24: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Subnet Management phases:– Topology discovery: sending direct routed SMP

to every port and processing the responses.– Path computation: computing valid paths

between each pair of end node– Path distribution phase: configuring the

forwarding table

Page 25: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Base transport header:

Page 26: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• Verbs– OS/Users access the adaptor through verbs– Communication mechanism: Queue Pair (QP)

• Users can queue up a set of instructions that the hardware executes.

• A pair of queues in each QP: one for send, one for receive.

• Users can post send requests to the send queue and receive requests to the receive queue.

• Three types of send operations: SEND, RDMA-(WRITE, READ, ATOMIC), MEMORY-BINDING

• One receive operation (matching SEND)

Page 27: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)
Page 28: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• To communicate:– Make system calls to setup everything (open

QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc).

– Post send/receive requests as user level instructions.

– Check completion.

Page 29: Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

• InfiniBand has an almost perfect software/network interface:– The network subsystem realizes most user level

functionality.• Network supports in-order delivery and and fault

tolerance.• Buffer management is pushed out to the user.

– OS bypass: User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS.