InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices....

30
InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. A larger role: replaceing all I/O standards for data centers: PCI (backplane), Fibre Channel, and Ethernet: everything connects through InfiniBand. Not long haul yet. A less role: Low latency, high bandwidth, low overhead interconnect for commercial datacenters between servers and storage. Can form local area or even large area networks. Has become the de-facto interconnect for high performance clusters (100+ systems in top 500 supercomputer list).

Transcript of InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices....

Page 1: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

InfiniBand

• Was originally designed as a “system area network”: connecting CPUs and I/O devices.– A larger role: replaceing all I/O standards for data centers:

PCI (backplane), Fibre Channel, and Ethernet: everything connects through InfiniBand. Not long haul yet.

– A less role: Low latency, high bandwidth, low overhead interconnect for commercial datacenters between servers and storage.

• Can form local area or even large area networks.• Has become the de-facto interconnect for high

performance clusters (100+ systems in top 500 supercomputer list).

Page 2: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Infiniband architecture– Specification (Infiniband

architecture specification release 1.3, March 3, 2015) available at Infiniband Trade Association (http://www.infinibandta.org)

Page 3: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Infiniband architecture overview

Page 4: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Infiniband architecture overview– Components:

• Links, Channel adaptors, Switches, Routers

– The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network.

• Cabling specification?

– Topology:• Irregular

• Regular: Fat tree, hypercube, etc

Page 5: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Infiniband architecture overview– Physical layer: Cabling standard?– Link speed (signal rate):

• Single data rate (SDR): 2.5Gbps (1X), 10Gbps (4X), and 30Gbps (12X).

• Double data rate (DDR): 5Gbps (1X), 20 Gbps (4X), 60Gbps(12X)

• Quad data rate (QDR): 10Gbps (1X), 40Gbps(4X), 120Gbps(12X)• Fourteen data rate (FDR): 14Gbps(1X), 56Gbps(4X),

168Gbps(12X)• Enhanced data rate (EDR): 25Gbps(1X), 100Gbps(4X),

300Gbps(12X)

– 8b/10b enconding in SDR, DDR, and QDR• Map 8bit symbol to 10-bit symbol to have DC-balance

(similar number of 0’s and 1’s in 20 bits, no more than five 1’s or 0’s in a row, etc).

– 64b/66b enconding in FDR and EDR

Page 6: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

Infiniband link speed

Infiniband Roadmap from InfiniBand trade associationhttp://www.infinibandta.org/content/pages.php?pg=technology_overview

Page 7: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Layer architecture: somewhat similar to TCP/IP– Physical layer, somewhat simple

– Link layer• Error detection (CRC checksum)

• flow control (credit based)

• switching, virtual lanes (VL),

• forwarding table computed by subnet manager– Not adaptive/adaptive

– Network layer: across subnets.• No use for the cluster environment

– Transport layer• Reliable/unreliable, connection/datagram

– Verbs: interface between adaptors and OS/Users

Page 8: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Link layer Packet format:

• Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet

• Global Route Header (GRH): 40 Bytes. Used for routing between subnets

• Base Transport header (BTH): 12 Bytes, for IBA transport• Reliable datagram extended transport header (RDETH): 4 bytes, just

for reliable datagram• Datagram extended transport header (DETH): 8 bytes• RDMA extended transport header (RETH): 16 bytes• Atomic, ACK, Atomic ACK, • Immediate DATA extended transport header: 4 bytes, optimized for

small packets.• Invalidate• Invariant CRC and variant CRC:

– CRC for fields not changed and changed.

Page 9: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Local Route Header:

– Switching based on the destination port address (LID)

– Multipath switching by allocating multiple LIDs to one port

Page 10: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Local Route Header:– Switching based on the destination port address

(LID).• Forwarding table entry: (LID, outgoing-port)

Page 11: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Local Route Header:– Multipath switching by allocating multiple

LIDs to one port, see the previous example.

• GRH: same format as IPV6 address (16 bytes address)

Page 12: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

Subnet management

• Discover subnet topology and topology changes, compute the paths, assign LIDs, distribute the routes, configure devices– Not well-defined in the specification– Forwarding table must be computed such that all

devices in the network can be reached.

• References• A. Bermudez, R. Casado, F.J. Quiles, T. M. Pinkston, J. Duato, “Evaluation of

a Subnet Management Mechanism for Infiniband Networks”, ICPP 2003.• A. Vishnu, A. R. Mamidala, H. Jin, D. K. Panda, “Performance Modeling of

Subnet Management on Fat Tree Infiniband Networks using OpenSM”, Workshop on System Management Tools on Large Scale Parallel Systems, Held in Conjunction with IPDPS 2005

Page 13: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• InfiniBand devices and entities related to subnet management

• Devices: Channel Adapters (CA), Host Channel Adapters, switches, routers

• Subnet manager (SM): discovering, configuring, activating and managing the subnet

• A subnet management agent (SMA) in every device generates, responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management

• SM exchange control packets with SMA with subnet management interface (SMI).

Page 14: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:
Page 15: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Subnet management packets (SMP)– 256 bytes of data– Use unreliable datagram service on the

management virtual lane (VL 15)– Two routing schemes

• LID routed: use lookup table for forwarding– Use after the subnet is setup. E.g. Check the status of an

active port

• Direct routed: has the information of the output port for each intermediate hop.

– Subnet discovery for the subnet is setup

Page 16: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Subnet management packets (SMP)– Define the operation to be performed by SM– Get: get the information about CA, switch, port– Set: set the attribute of a port (e.g. LID)– GetResp: get response– Trap: inform SM about the state of a local node

• A SMA stop sending Trap message until it receives TrapRepress packet.

• Topology information can be obtained by a sweep and by peridical Traps.

Page 17: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Subnet Management phases:– Topology discovery: sending direct routed SMP

to every port and processing the responses.– Path computation: computing valid paths

between each pair of end node– Path distribution phase: configuring the

forwarding table

Page 18: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Subnet discovery– SM starts by sending a direct routed Get SMP to its

local node. Upone receiving response, SM sends SMPs with additive depth.

Page 19: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Path computation:– Compute paths between all pair of nodes

– For irregular topology: • Up/Down routing does not work directly

– Need information about the incoming interface and the destination and Infiniband only uses destination

– Potential solution:

» find all possible paths

» remove all possible down link following up links in each node

» find one output port for each destination

– Other solutions: destination renaming

– Fat tree topology:• What is the best that can be achieved (optimal routing) is also

not clear.

Page 20: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Path distribution:– Ordering issue: the network may be in an

inconsistent state when partially updated, which may result in deadlock during this period.

• Traditional solution, no data packets for a period of time

• deadlock free reconfiguration schemes.

– How to do this correctly, effectively, and incrementally is still open.

Page 21: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Base transport header:

Page 22: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Verbs– OS/Users access the adaptor through verbs– Communication mechanism: Queue Pair (QP)

• Users can queue up a set of instructions that the hardware executes.

• A pair of queues in each QP: one for send, one for receive.

• Users can post send requests to the send queue and receive requests to the receive queue.

• Three types of send operations: SEND, RDMA-(WRITE, READ, ATOMIC), MEMORY-BINDING

• One receive operation (matching SEND)

Page 23: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:
Page 24: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:
Page 25: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Queue Pair:– The status of the result of an operation

(send/receive) is stored in the complete queue.– Send/receive queues can bind to different

complete queues.

• Related system level verbs:– Open QP, create complete queue, Open HCA,

open protection domain, register memory, allocate memory window, etc

• User level verbs: – post send/receive request, poll for completion.

Page 26: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• To communicate:– Make system calls to setup everything (open

QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc).

– Post send/receive requests.– Check completion.

Page 27: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• InfiniBand has an almost perfect software/network interface (Chien'94 paper):– The network subsystem realizes all user level

functionality.– User level accesses to the network interface. A

few machine instructions will accomplish the transmission task without involving the OS.

– Network supports in-order delivery and and fault tolerance.

– Buffer management is pushed out to the user.

Page 28: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Mellanox product brief: “Switch-2 Virtual Protocol Interconnect Optimized for SDN”

Page 29: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

• Mellanox product brief: “Switch-2 Virtual Protocol Interconnect Optimized for SDN”– Virtual protocol interconnect

• Automatically sensing Infiniband, Ethernet and Fiber channel, and data center bridging

• Flexible port configuration– 36 IB FDR ports or 40/56GbE Ports– 64 10GbE ports– 24 2/4/5Gb FC ports

– SDN support• Complete support for Openflow and Subnet

management• Remote configurable routing table, overlay, control

plan.

Page 30: InfiniBand Was originally designed as a “system area network”: connecting CPUs and I/O devices. –A larger role: replaceing all I/O standards for data centers:

Some claims (InfiniBand advantages)

• Infiniband and Ethernet can carry each other’s traffic (EoIB) and (IBoE), and both can carry TCP/IP

• InfiniBand is in general faster– 10G Ethernet .vs. IB DDR (20G) and QDR(40G)– 40G Ethernet .vs. IB EDR (100G)

• InfiniBand is no longer hard to use• InfiniBand is optimized for fat-tree• InfiniBand still has more features than Ethernet

– Fault tolerance, multicast, etc