Interconnection Network for Tightly Coupled Accelerators ...

Center for Computational Sciences, Univ. of Tsukuba

Interconnection Network for Tightly Coupled Accelerators

Architecture

Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato

Center for Computational Sciences University of Tsukuba, Japan

Aug 22, 2013 Hot Interconnects 21 1


What is “Tightly Coupled Accelerators (TCA)” ? Concept: n  Direct connection between accelerators (GPUs) over the

nodes n  Eliminate extra memory copies to the host n  Reduce latency, improve strong scaling with small data size for scientific applications

n  Using PCIe as a communication link between accelerators over the nodes n  PCIe just performs packet transfer and direct device P2P communication is available.

n  PEACH2: PCI Express Adaptive Communication Hub ver. 2 n  In order to configure TCA, each node is connected to other nodes through PEACH2 chip.



Design policy of PEACH2 n  Implement by FPGA with four PCIe Gen.2 IPs

n  Altera Stratix IV GX n  Prototyping, flexible enhancement

n  Sufficient communication bandwidth n  PCI Express Gen2 x8 for each port n  Sophisticated DMA controller

n  Chaining DMA n  Latency reduction

n  Hardwired logic n  Low-overhead routing mechanism

n  Efficient address mapping in PCIe address area using unused bits n  Simple comparator for decision of output port

It is not only a proof-of-concept implementation, but it will also be available for product-run in GPU cluster.



TCA node structure example

n  PEACH2 can access all GPUs n  NVIDIA Kepler architecture + CUDA 5.0 “GPUDirect Support for RDMA”

n  Performance over QPI is quite bad. => support only for GPU0, GPU1

n  Connect among 3 nodes using PEACH2

Aug 22, 2013 Hot Interconnects 21

CPU (Xeon

E5)

CPU (Xeon

E5) QPI

PCIe

GPU0

GPU2

GPU3

IB HCA

PEACH2

GPU1

G2 x8 G2

x16 G2 x16

G3 x8

G2 x16

G2 x16

G2 x8

G2 x8

G2 x8

Single PCIe address

GPU: NVIDIA K20, K20X (Kepler architecture）

4


Overview of PEACH2 chip n  Fully compatible with PCIe

Gen2 spec. n  Root and EndPoint must be

paired according to PCIe spec. n  Port N: connected to the host

and GPUs n  Port E and W: form the ring

topology n  Port S: connected to the

other ring n  Selectable between Root and Endpoint

n  Write only except Port N n  Instead, “Proxy write” on remote

node realizes pseudo-read.

CPU & GPU side (Endpoint)

To PEACH2 (Root Complex /

Endpoint)

To PE

AC

H2

(Endpoint)

To PE

AC

H2

(Root C

omplex)

NIOS (CPU)

Memory

DMAC

Routing function

Port N

Port E

Port W

Port S



Communication by PEACH2 n  PIO

n  CPU can store the data to remote node directly using mmap.

n  DMA n  Chaining mode

n  DMA requests are prepared as the DMA descriptors chained in the host memory.

n  Multiple DMA transactions are operated automatically according to the DMA descriptors by hardware.

n  Register mode n  Up to 16 requests n  No overhead to transfer descriptors from host

n  Block stride transfer function


Descriptor0

Descriptor1

Descriptor2

Descriptor3

Descriptor (n-1)

…

Source Destination

Length Flags

Next


PEACH2 board (Production version for HA-PACS/TCA)

n  PCI Express Gen2 x8 peripheral board n  Compatible with PCIe Spec.


Top View Side View


PEACH2 board (Production version for HA-PACS/TCA)


Main board + sub board Most part operates at 250 MHz

(PCIe Gen2 logic runs at 250MHz)

PCI Express x8 card edge Power supply

for various voltage

DDR3- SDRAM

FPGA (Altera Stratix IV

530GX)

PCIe x16 cable connecter

PCIe x8 cable connecter


Performance Evaluation n  Environment: 8node GPU cluster

(TCAMINI) n  CPU: Intel Xeon-E5 (SandyBridge EP) 2.6GHz x2socket

n  MB: SuperMicro X9DRG-QF n  Memory: DDR3 128GB n  OS: CentOS 6.3 (kernel 2.6.32-279.22.1.el6.x86_64)

n  GPU: NVIDIA K20, GDDR5 5GB x1 n  CUDA: 5.0, NVIDIA-Linux-x86_64-310.32

n  PEACH2 board: Altera Stratix IV 530GX n  MPI: MVAPICH2 1.9 with IB FDR10



Evaluation items

n  Ping-pong performance between nodes n  Latency and bandwidth n  Written as application n  Comparison with MVAPICH2 1.9 (with CUDA support) for GPU-GPU communication

n  In order to access GPU memory by the other device, “GPU Direct support for RDMA” in CUDA5 API is used. n  Special driver named “TCA

p2p driver” to enable memory mapping is developed.

n  “PEACH2 driver” to control the board is also developed.



Ping-pong Latency

Minimum Latency (nearest neighbor comm.) n  PIO: CPU to CPU: 0.9us n  DMA:CPU to CPU: 1.9us

GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec)


0

1

2

3

4

5

6

7

8

8 64 512 4096 32768

Late

ncy

(use

c)

Data Size (bytes)

PIO, Good Performance (<64B)

PIO

DMA (CPU)

DMA (GPU)


Ping-pong Latency

Minimum Latency (nearest neighbor comm.) n  PIO: CPU to CPU: 0.9us n  DMA:CPU to CPU: 1.9us

GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec) Forwarding overhead n  200~300 nsec


0

1

2

3

4

5

6

7

8

8 64 512 4096 32768

Late

ncy

(use

c)

Data Size (bytes)

DMA Direct

DMA 1 hop

DMA 2 hop

DMA 3 hop

DMA (CPU)


Ping-pong Bandwidth n  Max. 3.5 GByte/sec

n  95% of theoretical peak n  Converge to the same peak

with various hop counts

n  GPU to GPU DMA is saturated by up to 880MByte/sec. Ø  PCIe switch embedded in

SandyBridge doesn’t have enough resources for PCIe device read.

n  In IvyBridge, the performance is expected to be improved…

n  Performance will be expected as same as CPU to CPU case.


0

500

1000

1500

2000

2500

3000

3500

4000

8 256 8192 262144

Ban

dwid

th (M

Byt

es/s

ec)

Data Size (bytes)

DMA Direct

DMA 1 hop

DMA 2 hop

DMA 3 hop

GPU Direct

MVA2 GPU

Max Payload Size = 256byte Theoretical peak: 4Gbyte/sec × 256 / (256 + 16 + 2 + 4 + 1 + 1) = 3.66 Gbyte/s

3.5 Gbyte/s

880Mbyte/s


Programming for TCA cluster n  Data transfer to remote GPU within TCA can

be treated like local GPU. n  In particular, suitable for stencil computation

n  Good performance at nearest neighbor communication due to direct network

n  Chaining DMA can bundle data transfers for every “Halo” planes

n  XY-plane: contiguous array n  XZ-plane: block stride n  YZ-plane: stride

n  In each iteration, DMA descriptors can be reused and only a DMA kick operation is needed

=> Improve strong scaling with small data size


Bundle to 1 DMA


Related Work n  Non Transparent Bridge (NTB)

n  NTB appends the bridge function to a downstream port of the PCI-E switch.

n  Inflexible, the host must recognize during the BIOS scan n  It is not defined in the standard of PCI-E and is incompatible with the vendors.

n  APEnet+ (Italy) n  GPU direct copy using Fermi GPU，different protocol from TCA is used.

n  Latency between GPUs is around 5us? n  Original 3-D Torus network, QSFP+ cable

n  MVAPICH2 + GPUDirect n  CUDA5, Kepler n  Latency between GPUs is reported as 4.5us at Jun., but currently not released (4Q of 2013?）



Summary n  TCA: Tightly Coupled Accelerators

n  TCA enables direct communication among accelerators as an element technology becomes a basic technology for next gen’s accelerated computing in exa-scale era.

n  PEACH2 board: Implementation for realizing TCA using PCIe technology n  Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical

peak) Min. Latency: 0.9 us (PIO), 1.9 us (DMA between CPUs), 2.3 us (DMA between GPUs)

n  GPU-GPU communication over the nodes can be demonstrated with 8 node cluster.

n  By the ping-pong program, PEACH2 can achieve lower latency than existing technology, such as MVAPICH2 in small data size.

n  HA-PACS/TCA with 64 nodes will be installed on the end of Oct. 2013. n  Actual proof system of TCA architecture with 4 GPUs per each node n  Development of the HPC application using TCA, and production-run

n  Performance comparison between TCA and InfiniBand n  Hybrid communication scheme using TCA and InfiniBand will be investigated.


Interconnection Network for Tightly Coupled Accelerators ...

Documents

Transcript of Interconnection Network for Tightly Coupled Accelerators ...