Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and...

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa

Yuetsu KODAMA Center for Computational Sciences

University of Tsukuba kodama@cs.tsukuba.ac.jp

Tightly Coupled Accelerator for HA-PACS

2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)

!! Two parts !! Base Cluster:

!! for development of GPU-accelerated code for target fields, and performing product-run of them

!! TCA: (Tightly Coupled Accelerators) !! for elementary research on new communication technology for

reducing latency between accelerators. !! Our original communication system based on PCI-Express named

“PEARL”, and a prototype communication chip named “PEACH” !! Improve PEACH for TCA : PEACH2

Problems of GPU Cluster!! Problems of GPGPU for HPC

!! Data I/O performance limitation !! Ex) GPGPU: PCIe gen2 x16 (8GB/s ) 665 GFLOPS (NVIDIA M2090)

!! Memory size limitation !! Ex) M2090: 6GByte vs CPU: 128GByte

!! Communication between accelerators: no direct path (external) communication latency via CPU becomes large

!! Researches for direct communication between GPUs are required

TThhee ttaarrggeett ooff TTCCAA iiss ddeevveellooppiinngg aa ddiirreecctt ccoommmmuunniiccaattiioonn ssyysstteemm bbeettwweeeenn eexxtteerrnnaall GGPPUUss ffoorr aa ffeeaassiibbiilliittyy ssttuuddyy ffoorr ffuuttuurree aacccceelleerraatteedd ccoommppuuttiinngg

PCIeGPU

GPUCPU

IB HCA

IB Switch

MEMMEM

MEM MEM

HA-PACS: TCA (Tightly Coupled Accelerator)

!! TCA: Tightly Coupled Accelerator !! Direct connection between accelerators (GPUs) !! Using PCIe as a communication device between accelerator

!! Most acceleration device and other I/O device are connected by PCIe as PCIe end-point (slave device)

!! An intelligent PCIe device logically enables an end-point device to directly communicate with other end-point devices

!! PEARL: PCI Express Adaptive and Reliable Link !! We already developed such PCIe device (PEACH, PCI Express Adaptive

Communication Hub) on JST-CREST project “low power and dependable network for embedded system”

Improving PEACH for HPC to realize TCA : PEACH2

LBNL and CCS-Tsukuba Joint Workshop 20122012/03/19

PEACH, PCI Express Adaptive Communication Hub

!! Applying PCIe link as “a direct link between nodes” !! PCI-E link edge control feature: “root complex” and “end points” are

automatically switched (flipped) according to the connection handling !! Each PCIe is an independent bus, and Intelligent controller on PEACH

transfers each packet between PCIe buses.

PCIePCIe PCIe

PEACHCPUPCIe

RC RCEPRC

CPUPCIe

RCEPEP

PEACHCPUPCIe

RC RCEPRC

CPUPCIe

RCEPEP

PEACH chip [Otani et al., ISSCC2011]

!"#$%&'()

*+,-./0

*+,-./'

Transfer Processing block

Control Processing block

*+,-./&

*+,-./1

23435678493:7;

*83<7;;38<387

!! CPU: Renesas M32R 4core SMP (max. 400MHz) !! small core size, low power!! SMP !! Controlling PCIe 4 port!! health check for the node, link!! communication link management!! route reconfiguration

PCIe#0

L2 Cache

PCIe#3

PCIe#2

CPU SRAM

PCIe#1

•! comm. link PCI Express Gen2 x4 lanes (20Gbps) x4 port

•! SuperHyway DMAC

•! Low power –! Controlling #lanes for each port –! gen1/gen2 switching –! core frequency control

HA-PACS/TCA (Tightly Coupled Accelerator)!! Enhanced version of PEACH

PPEEAACCHH22 !! x4 lanes -> x8 lanes !! hardwired on main data path and

PCIe interface fabric

PEACH2CPU

CPUPCIe

PEACH2

!! True GPU-direct !! current GPU clusters require 3-

hop communication (3-5 times memory copy)

!! For strong scaling, Inter-GPU direct communication protocol is needed for lower latency and higher throughput

PCIeGPU

GPUCPU

IB HCA

IB Switch

MEMMEM

MEMMEM MEM

Implementation of PEACH2: ASIC FPGA!! FPGA based implementation

!! today’s advanced FPGA allows to use PCIe hub with multiple ports !! currently gen2 x 8 lanes x 4 ports are available

soon gen3 will be available (?) !! easy modification and enhancement !! fits to standard (full-size) PCIe board !! internal multi-core general purpose CPU with programmability is available

easily split hardwired/firmware partitioning on certain level on control layer

!! Controlling PEACH2 for GPU communication protocol !! collaboration with NVIDIA for information sharing and discussion !! based on CUDA4.0 device to device direct memory copy protocol

HA-PACS/TCANode Cluster = NC

PEACH2

PPEEAARRLL RRiinngg NNeettwwoorrkk

IInnffiinniibbaanndd LLiinnkk

Node Cluster with 16 nodes •! GPUx64 (G) •! CPUx32 (C) •! GPU comm with PCIe •! IB link / node •! CPU: Xeon E5 •! GPU: Kepler

Node Cluster

IInnffiinniibbaanndd NNeettwwoorrkk

................

44 NNCC wwiitthh 1166 nnooddeess,, oorr 88 NNCC wwiitthh 88 nnooddeess == 336600 TTFFLLOOPPSS eexxtteennssiioonn ttoo bbaassee cclluusstteerr

•!High speed GPU-GPU comm. by PEACH within NC (PCI-E gen2x8 = 4GB/s/link) •!Infiniband QDR (x2) for NC-NC comm. (4GB/s/link)

PEACH2

PEARL/PEACH2 variation (1)

!! Option 1: !! GPU host driver is available as-is !! 4 GPUs are equivalent from PEACH2

C CQPI

GPU GPU GPU IB HCA

PEACH2

G3 x16

C CQPI

GPU GPU GPUIB

GPUPEACH2

PCIe SWG2 x8

G3 x16

!! Option 1: !! Performance comparison among IB and PEARL can be evenly

compared !! Additional latency by PCIe switch

C CQPI

GPU GPU

IB HCA

GPU PEACH2

PCIe SWG2 x8

G3 x16

!! Option 3: !! Requires only 72 lanes in total !! asymmetric connection among 3 blocks of GPUs

PEACH2 FPGA test bedHSMC-PCIe converter board (newly

developed) -! HSMC (High-Speed Mezzanine Card)

General I/O port -! PCIe x8 cable connector -! RootComplex / Endpoint switchable -! both x4 / x8 available

generic FPGA evaluation board (DEV-4SGX530N)

-! PCIe x8 endpoint -! HSMC support PCIe x 8 Gen2 -! HSMC support PCIe x 4 Gen2 -! Stratix IV GX530KH40

(1517pin, 531K LE, 20Mbit, 4 PCIe IP, 24 8.5Gbps Transceiver)

Read: 3.2GB/s, Write 3.3GB/s in 64K

PEACH2 PCIe compatible board(~ Mar. 2012)

Altera Stratix IV GX D

SODIMM

PCI-Express gen2 x8 card edge

PCI-Express gen2 x8 cable connector

PCI-Express gen2 x16 (using only x8) cable

connector

Stratix IV GX530NF45 (1932pin, 32 8.5Gbps Transceiver)

Gate Usage: 16% Clock: 250MHz Latency: 376ns

Summary!! HA-PACS/TCA for elementary study for advanced technology on

direct communication among accelerating devices (GPUs) !! FPGA implementation of PEACH2 will be finished for the

prototype version on Mar. 2012 and enhanced for final version in following 6 months

!! HA-PACS/TCA with at least 200 TFLOPS additional performance will be installed around Mar. 2013

CCeenntteerr ffoorr CCoommppuuttaattiioonnaall SScciieenncceess,, UUnniivv.. ooff TTssuukkuubbaa 2012/03/19 LBNL and CCS-Tsukuba Joint Workshop 2012

PCI express IP core performance (Gen2 x 8)

128 512 2048 8192 32768 131072

bbaanndd

wwiidd

tthh ((MM

bbyyttee

ss//ssee

ddaattaa ssiizzee ((bbyytteess))

read MB/s write MB/s

PCI Express

!! I/O Interface for PC after PCI, PCI-X !! Standard by PCI-SIG !! Used in connecton between PC and NIC

!! Serial communication !! Speed of 1 lane: 2.5Gbps(Gen1),

5.0Gbps(Gen2) !! Multilane is supported for high

throughput !! x1, x2, x4, x8, x16

!! Line length is limited !! 30cm for Gen2 on board !! 2m with external cable !! Enough for Embeded or small scale

cluster

Root Complex (RC)

Device Device

Switch

EndPoint (EP)

Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and...

Documents

Transcript of Tightly Coupled Accelerator for HA-PACSwill be installed around Mar. 2013 2012/03/19 LBNL and...

Methodology Results - Tsukuba

ATHIC2008T.Umeda (Tsukuba)1 QCD Thermodynamics at fixed lattice scale Takashi Umeda (Univ. of Tsukuba) for WHOT-QCD Collaboration ATHIC2008, Univ. of Tsukuba,

1 Basic Concepts in Parallelization - cOMPunity Basic_Concepts... · Basic Concepts in Parallelization 1 RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, ... This "law'

1 Lattice QCD Activities at CCS Yoshinobu Kuramashi Center for Computational Sciences (CCS) University of Tsukuba.

LBNL Pigment Survey

JICA Tsukuba Guide€¦ · JICA Tsukuba Guide 2019 . INDEX I. Outline of JICA Tsukuba II ... Welcome to JICA Tsukuba Center! JICA Tsukuba, which is located in Tsukuba Science City,

ALICE at LBNL

D. Filippetto LBNL

IPHC-LBNL Phone Conference News and PXL sensor (Ultimate) testing at LBNL

LBNL Contaminant Plumes

英語 - Welcome to the City of TSUKUBA · 42 Mount Tsukuba Plum Blossom Festival The 42nd Mount Tsukuba Ume Matsuri (Plum Blossom Festival) will be held at the Tsukuba-san Bairin

1 Y. Miake, April.,7, 2009 @ LBNL J-Cal Overview Yasuo MIAKE Univ. of Tsukuba Extension of EM-Cal for B-to-b jet capabilities S. Esumi, T. Horaguchi H.

Jaes Paper Lbnl

SPOT90 - Tsukuba

LBNL Enterprise Computing

LBNL Design Studies

Tsukuba Final PPT

Grid Activity at CCS Toshiyuki Amagasa Center for Computational Sciences, Univertsity of Tsukuba 1.

Masahiro Tanaka University of Tsukuba - HPCS Lab.tanaka/publications/open/RubyConf2010... · Masahiro Tanaka University of Tsukuba ... University of Tsukuba ... Scalability is an

Mental Skills Training - Tsukuba Summer Institute · Mental Skills Training for Athletes and Coaches Tsukuba Summer Institute 2014 Guido Geisler, PhD University of Tsukuba