Post on 22-May-2020
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
Yuetsu KODAMA Center for Computational Sciences
University of Tsukuba kodama@cs.tsukuba.ac.jp
Tightly Coupled Accelerator for HA-PACS
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)
!! Two parts !! Base Cluster:
!! for development of GPU-accelerated code for target fields, and performing product-run of them
!! TCA: (Tightly Coupled Accelerators) !! for elementary research on new communication technology for
reducing latency between accelerators. !! Our original communication system based on PCI-Express named
“PEARL”, and a prototype communication chip named “PEACH” !! Improve PEACH for TCA : PEACH2
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
Problems of GPU Cluster!! Problems of GPGPU for HPC
!! Data I/O performance limitation !! Ex) GPGPU: PCIe gen2 x16 (8GB/s ) 665 GFLOPS (NVIDIA M2090)
!! Memory size limitation !! Ex) M2090: 6GByte vs CPU: 128GByte
!! Communication between accelerators: no direct path (external) communication latency via CPU becomes large
!! Researches for direct communication between GPUs are required
TThhee ttaarrggeett ooff TTCCAA iiss ddeevveellooppiinngg aa ddiirreecctt ccoommmmuunniiccaattiioonn ssyysstteemm bbeettwweeeenn eexxtteerrnnaall GGPPUUss ffoorr aa ffeeaassiibbiilliittyy ssttuuddyy ffoorr ffuuttuurree aacccceelleerraatteedd ccoommppuuttiinngg
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
PCIe
Node
PCIe
PCIe
PCIeGPU
GPUCPU
CPU
IB HCA
IB HCA
IB Switch
MEMMEM
MEM MEM
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
HA-PACS: TCA (Tightly Coupled Accelerator)
!! TCA: Tightly Coupled Accelerator !! Direct connection between accelerators (GPUs) !! Using PCIe as a communication device between accelerator
!! Most acceleration device and other I/O device are connected by PCIe as PCIe end-point (slave device)
!! An intelligent PCIe device logically enables an end-point device to directly communicate with other end-point devices
!! PEARL: PCI Express Adaptive and Reliable Link !! We already developed such PCIe device (PEACH, PCI Express Adaptive
Communication Hub) on JST-CREST project “low power and dependable network for embedded system”
Improving PEACH for HPC to realize TCA : PEACH2
LBNL and CCS-Tsukuba Joint Workshop 20122012/03/19
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PEACH, PCI Express Adaptive Communication Hub
!! Applying PCIe link as “a direct link between nodes” !! PCI-E link edge control feature: “root complex” and “end points” are
automatically switched (flipped) according to the connection handling !! Each PCIe is an independent bus, and Intelligent controller on PEACH
transfers each packet between PCIe buses.
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
PCIePCIe PCIe
PEACHCPUPCIe
RC RCEPRC
EP
CPUPCIe
PEACH
RCEPEP
EP
RC
PEACHCPUPCIe
RC RCEPRC
EP
CPUPCIe
PEACH
RCEPEP
EP
RC
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PEACH chip [Otani et al., ISSCC2011]
!"#$%&'()
*+,-./0
*+,-./'
Transfer Processing block
Control Processing block
*+,-./&
*+,-./1
23435678493:7;
*83<7;;38<387
!! CPU: Renesas M32R 4core SMP (max. 400MHz) !! small core size, low power!! SMP !! Controlling PCIe 4 port!! health check for the node, link!! communication link management!! route reconfiguration
PCIe#0
L2 Cache
PCIe#3
PCIe#2
CPU SRAM
DD
R3
I/F
PCIe#1
•! comm. link PCI Express Gen2 x4 lanes (20Gbps) x4 port
•! SuperHyway DMAC
•! Low power –! Controlling #lanes for each port –! gen1/gen2 switching –! core frequency control
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
HA-PACS/TCA (Tightly Coupled Accelerator)!! Enhanced version of PEACH
PPEEAACCHH22 !! x4 lanes -> x8 lanes !! hardwired on main data path and
PCIe interface fabric
PEACH2CPU
PCIe
CPUPCIe
Node
PEACH2
PCIe
PCIe
PCIe
GPU
GPU
!! True GPU-direct !! current GPU clusters require 3-
hop communication (3-5 times memory copy)
!! For strong scaling, Inter-GPU direct communication protocol is needed for lower latency and higher throughput
PCIe
PCIe
Node
PCIe
PCIe
PCIeGPU
GPUCPU
CPU
IB HCA
IB HCA
IB Switch
MEMMEM
MEM
MEMMEM MEM
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
Implementation of PEACH2: ASIC FPGA!! FPGA based implementation
!! today’s advanced FPGA allows to use PCIe hub with multiple ports !! currently gen2 x 8 lanes x 4 ports are available
soon gen3 will be available (?) !! easy modification and enhancement !! fits to standard (full-size) PCIe board !! internal multi-core general purpose CPU with programmability is available
easily split hardwired/firmware partitioning on certain level on control layer
!! Controlling PEACH2 for GPU communication protocol !! collaboration with NVIDIA for information sharing and discussion !! based on CUDA4.0 device to device direct memory copy protocol
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
HA-PACS/TCANode Cluster = NC
PEACH2
C x 2
PPEEAARRLL RRiinngg NNeettwwoorrkk
IInnffiinniibbaanndd LLiinnkk
Node Cluster with 16 nodes •! GPUx64 (G) •! CPUx32 (C) •! GPU comm with PCIe •! IB link / node •! CPU: Xeon E5 •! GPU: Kepler
Node Cluster
Node Cluster
Node Cluster
Node Cluster
Node Cluster
IInnffiinniibbaanndd NNeettwwoorrkk
................
44 NNCC wwiitthh 1166 nnooddeess,, oorr 88 NNCC wwiitthh 88 nnooddeess == 336600 TTFFLLOOPPSS eexxtteennssiioonn ttoo bbaassee cclluusstteerr
•!High speed GPU-GPU comm. by PEACH within NC (PCI-E gen2x8 = 4GB/s/link) •!Infiniband QDR (x2) for NC-NC comm. (4GB/s/link)
Gx4
PEACH2
C x 2
Gx4
PEACH2
C x 2
Gx4
.....
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PEARL/PEACH2 variation (1)
!! Option 1: !! GPU host driver is available as-is !! 4 GPUs are equivalent from PEACH2
C C
C C
C C
C CQPI
PCIe
GPU GPU GPU IB HCA
PEACH2
GPU
G2 x8
G3 x16
G3 x16
G3 x8
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PEARL/PEACH2 variation (2)
C C
C C
C C
C CQPI
PCIe
GPU GPU GPUIB
HCA
GPUPEACH2
PCIe SWG2 x8
G3 x16
G3 x16
G3 x8
G3 x8
!! Option 1: !! Performance comparison among IB and PEARL can be evenly
compared !! Additional latency by PCIe switch
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PEARL/PEACH2 variation (3)
C C
C C
C C
C CQPI
PCIe
GPU GPU
GPU
IB HCA
GPU PEACH2
PCIe SWG2 x8
G3 x16
G3 x16
G3 x16
G3 x16
G3 x8
!! Option 3: !! Requires only 72 lanes in total !! asymmetric connection among 3 blocks of GPUs
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PEACH2 FPGA test bedHSMC-PCIe converter board (newly
developed) -! HSMC (High-Speed Mezzanine Card)
General I/O port -! PCIe x8 cable connector -! RootComplex / Endpoint switchable -! both x4 / x8 available
generic FPGA evaluation board (DEV-4SGX530N)
-! PCIe x8 endpoint -! HSMC support PCIe x 8 Gen2 -! HSMC support PCIe x 4 Gen2 -! Stratix IV GX530KH40
(1517pin, 531K LE, 20Mbit, 4 PCIe IP, 24 8.5Gbps Transceiver)
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
Read: 3.2GB/s, Write 3.3GB/s in 64K
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PEACH2 PCIe compatible board(~ Mar. 2012)
Altera Stratix IV GX D
DR3-
SODIMM
PCI-Express gen2 x8 card edge
PCI-Express gen2 x8 cable connector
PCI-Express gen2 x8 cable connector
PCI-Express gen2 x16 (using only x8) cable
connector
Stratix IV GX530NF45 (1932pin, 32 8.5Gbps Transceiver)
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
Gate Usage: 16% Clock: 250MHz Latency: 376ns
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
Summary!! HA-PACS/TCA for elementary study for advanced technology on
direct communication among accelerating devices (GPUs) !! FPGA implementation of PEACH2 will be finished for the
prototype version on Mar. 2012 and enhanced for final version in following 6 months
!! HA-PACS/TCA with at least 200 TFLOPS additional performance will be installed around Mar. 2013
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PCI express IP core performance (Gen2 x 8)
2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012
0
500
1000
1500
2000
2500
3000
3500
4000
128 512 2048 8192 32768 131072
bbaanndd
wwiidd
tthh ((MM
bbyyttee
ss//ssee
cc))
ddaattaa ssiizzee ((bbyytteess))
read MB/s write MB/s
CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa
PCI Express
!! I/O Interface for PC after PCI, PCI-X !! Standard by PCI-SIG !! Used in connecton between PC and NIC
!! Serial communication !! Speed of 1 lane: 2.5Gbps(Gen1),
5.0Gbps(Gen2) !! Multilane is supported for high
throughput !! x1, x2, x4, x8, x16
!! Line length is limited !! 30cm for Gen2 on board !! 2m with external cable !! Enough for Embeded or small scale
cluster
CPU
Root Complex (RC)
Device Device
Switch
EndPoint (EP)