Interconnection Network for Tightly Coupled Accelerators ...

16
Center for Computational Sciences, Univ. of Tsukuba Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa , Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan Aug 22, 2013 Hot Interconnects 21 1

Transcript of Interconnection Network for Tightly Coupled Accelerators ...

Center for Computational Sciences, Univ. of Tsukuba

Interconnection Network for Tightly Coupled Accelerators

Architecture

Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato

Center for Computational Sciences University of Tsukuba, Japan

Aug 22, 2013 Hot Interconnects 21 1

Center for Computational Sciences, Univ. of Tsukuba

What is “Tightly Coupled Accelerators (TCA)” ? Concept: n  Direct connection between accelerators (GPUs) over the

nodes n  Eliminate extra memory copies to the host n  Reduce latency, improve strong scaling with small data size for scientific applications

n  Using PCIe as a communication link between accelerators over the nodes n  PCIe just performs packet transfer and direct device P2P communication is available.

n  PEACH2: PCI Express Adaptive Communication Hub ver. 2 n  In order to configure TCA, each node is connected to other nodes through PEACH2 chip.

Aug 22, 2013 Hot Interconnects 21 2

Center for Computational Sciences, Univ. of Tsukuba

Design policy of PEACH2 n  Implement by FPGA with four PCIe Gen.2 IPs

n  Altera Stratix IV GX n  Prototyping, flexible enhancement

n  Sufficient communication bandwidth n  PCI Express Gen2 x8 for each port n  Sophisticated DMA controller

n  Chaining DMA n  Latency reduction

n  Hardwired logic n  Low-overhead routing mechanism

n  Efficient address mapping in PCIe address area using unused bits n  Simple comparator for decision of output port

It is not only a proof-of-concept implementation, but it will also be available for product-run in GPU cluster.

Aug 22, 2013 Hot Interconnects 21 3

Center for Computational Sciences, Univ. of Tsukuba

TCA node structure example

n  PEACH2 can access all GPUs n  NVIDIA Kepler architecture + CUDA 5.0 “GPUDirect Support for RDMA”

n  Performance over QPI is quite bad. => support only for GPU0, GPU1

n  Connect among 3 nodes using PEACH2

Aug 22, 2013 Hot Interconnects 21

CPU (Xeon

E5)

CPU (Xeon

E5) QPI

PCIe

GPU0

GPU2

GPU3

IB HCA

PEACH2

GPU1

G2 x8 G2

x16 G2 x16

G3 x8

G2 x16

G2 x16

G2 x8

G2 x8

G2 x8

Single PCIe address

GPU: NVIDIA K20, K20X (Kepler architecture)

4

Center for Computational Sciences, Univ. of Tsukuba

Overview of PEACH2 chip n  Fully compatible with PCIe

Gen2 spec. n  Root and EndPoint must be

paired according to PCIe spec. n  Port N: connected to the host

and GPUs n  Port E and W: form the ring

topology n  Port S: connected to the

other ring n  Selectable between Root and Endpoint

n  Write only except Port N n  Instead, “Proxy write” on remote

node realizes pseudo-read.

CPU & GPU side (Endpoint)

To PEACH2 (Root Complex /

Endpoint)

To PE

AC

H2

(Endpoint)

To PE

AC

H2

(Root C

omplex)

NIOS (CPU)

Memory

DMAC

Routing function

Port N

Port E

Port W

Port S

Aug 22, 2013 Hot Interconnects 21 5

Center for Computational Sciences, Univ. of Tsukuba

Communication by PEACH2 n  PIO

n  CPU can store the data to remote node directly using mmap.

n  DMA n  Chaining mode

n  DMA requests are prepared as the DMA descriptors chained in the host memory.

n  Multiple DMA transactions are operated automatically according to the DMA descriptors by hardware.

n  Register mode n  Up to 16 requests n  No overhead to transfer descriptors from host

n  Block stride transfer function

Aug 22, 2013 Hot Interconnects 21 6

Descriptor0

Descriptor1

Descriptor2

Descriptor3

Descriptor (n-1)

Source Destination

Length Flags

Next

Center for Computational Sciences, Univ. of Tsukuba

PEACH2 board (Production version for HA-PACS/TCA)

n  PCI Express Gen2 x8 peripheral board n  Compatible with PCIe Spec.

Aug 22, 2013 Hot Interconnects 21 7

Top View Side View

Center for Computational Sciences, Univ. of Tsukuba

PEACH2 board (Production version for HA-PACS/TCA)

Aug 22, 2013 Hot Interconnects 21 8

Main board + sub board Most part operates at 250 MHz

(PCIe Gen2 logic runs at 250MHz)

PCI Express x8 card edge Power supply

for various voltage

DDR3- SDRAM

FPGA (Altera Stratix IV

530GX)

PCIe x16 cable connecter

PCIe x8 cable connecter

Center for Computational Sciences, Univ. of Tsukuba

Performance Evaluation n  Environment: 8node GPU cluster

(TCAMINI) n  CPU: Intel Xeon-E5 (SandyBridge EP) 2.6GHz x2socket

n  MB: SuperMicro X9DRG-QF n  Memory: DDR3 128GB n  OS: CentOS 6.3 (kernel 2.6.32-279.22.1.el6.x86_64)

n  GPU: NVIDIA K20, GDDR5 5GB x1 n  CUDA: 5.0, NVIDIA-Linux-x86_64-310.32

n  PEACH2 board: Altera Stratix IV 530GX n  MPI: MVAPICH2 1.9 with IB FDR10

Aug 22, 2013 Hot Interconnects 21 9

Center for Computational Sciences, Univ. of Tsukuba

Evaluation items

n  Ping-pong performance between nodes n  Latency and bandwidth n  Written as application n  Comparison with MVAPICH2 1.9 (with CUDA support) for GPU-GPU communication

n  In order to access GPU memory by the other device, “GPU Direct support for RDMA” in CUDA5 API is used. n  Special driver named “TCA

p2p driver” to enable memory mapping is developed.

n  “PEACH2 driver” to control the board is also developed.

Aug 22, 2013 Hot Interconnects 21 10

Center for Computational Sciences, Univ. of Tsukuba

Ping-pong Latency

Minimum Latency (nearest neighbor comm.) n  PIO: CPU to CPU: 0.9us n  DMA:CPU to CPU: 1.9us

GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec)

Aug 22, 2013 Hot Interconnects 21 11

0

1

2

3

4

5

6

7

8

8 64 512 4096 32768

Late

ncy

(use

c)

Data Size (bytes)

PIO, Good Performance (<64B)

PIO

DMA (CPU)

DMA (GPU)

Center for Computational Sciences, Univ. of Tsukuba

Ping-pong Latency

Minimum Latency (nearest neighbor comm.) n  PIO: CPU to CPU: 0.9us n  DMA:CPU to CPU: 1.9us

GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec) Forwarding overhead n  200~300 nsec

Aug 22, 2013 Hot Interconnects 21 12

0

1

2

3

4

5

6

7

8

8 64 512 4096 32768

Late

ncy

(use

c)

Data Size (bytes)

DMA Direct

DMA 1 hop

DMA 2 hop

DMA 3 hop

DMA (CPU)

Center for Computational Sciences, Univ. of Tsukuba

Ping-pong Bandwidth n  Max. 3.5 GByte/sec

n  95% of theoretical peak n  Converge to the same peak

with various hop counts

n  GPU to GPU DMA is saturated by up to 880MByte/sec. Ø  PCIe switch embedded in

SandyBridge doesn’t have enough resources for PCIe device read.

n  In IvyBridge, the performance is expected to be improved…

n  Performance will be expected as same as CPU to CPU case.

Aug 22, 2013 Hot Interconnects 21 13

0

500

1000

1500

2000

2500

3000

3500

4000

8 256 8192 262144

Ban

dwid

th (M

Byt

es/s

ec)

Data Size (bytes)

DMA Direct

DMA 1 hop

DMA 2 hop

DMA 3 hop

GPU Direct

MVA2 GPU

Max Payload Size = 256byte Theoretical peak: 4Gbyte/sec × 256 / (256 + 16 + 2 + 4 + 1 + 1) = 3.66 Gbyte/s

3.5 Gbyte/s

880Mbyte/s

Center for Computational Sciences, Univ. of Tsukuba

Programming for TCA cluster n  Data transfer to remote GPU within TCA can

be treated like local GPU. n  In particular, suitable for stencil computation

n  Good performance at nearest neighbor communication due to direct network

n  Chaining DMA can bundle data transfers for every “Halo” planes

n  XY-plane: contiguous array n  XZ-plane: block stride n  YZ-plane: stride

n  In each iteration, DMA descriptors can be reused and only a DMA kick operation is needed

=> Improve strong scaling with small data size

Aug 22, 2013 Hot Interconnects 21 14

Bundle to 1 DMA

Center for Computational Sciences, Univ. of Tsukuba

Related Work n  Non Transparent Bridge (NTB)

n  NTB appends the bridge function to a downstream port of the PCI-E switch.

n  Inflexible, the host must recognize during the BIOS scan n  It is not defined in the standard of PCI-E and is incompatible with the vendors.

n  APEnet+ (Italy) n  GPU direct copy using Fermi GPU,different protocol from TCA is used.

n  Latency between GPUs is around 5us? n  Original 3-D Torus network, QSFP+ cable

n  MVAPICH2 + GPUDirect n  CUDA5, Kepler n  Latency between GPUs is reported as 4.5us at Jun., but currently not released (4Q of 2013?)

Aug 22, 2013 Hot Interconnects 21 15

Center for Computational Sciences, Univ. of Tsukuba

Summary n  TCA: Tightly Coupled Accelerators

n  TCA enables direct communication among accelerators as an element technology becomes a basic technology for next gen’s accelerated computing in exa-scale era.

n  PEACH2 board: Implementation for realizing TCA using PCIe technology n  Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical

peak) Min. Latency: 0.9 us (PIO), 1.9 us (DMA between CPUs), 2.3 us (DMA between GPUs)

n  GPU-GPU communication over the nodes can be demonstrated with 8 node cluster.

n  By the ping-pong program, PEACH2 can achieve lower latency than existing technology, such as MVAPICH2 in small data size.

n  HA-PACS/TCA with 64 nodes will be installed on the end of Oct. 2013. n  Actual proof system of TCA architecture with 4 GPUs per each node n  Development of the HPC application using TCA, and production-run

n  Performance comparison between TCA and InfiniBand n  Hybrid communication scheme using TCA and InfiniBand will be investigated.

Aug 22, 2013 Hot Interconnects 21 16