Pedraforca: a First ARM + GPU Cluster for HPC · Enable software stack development Tibidabo: The...
Transcript of Pedraforca: a First ARM + GPU Cluster for HPC · Enable software stack development Tibidabo: The...
www.bsc.es
Pedraforca: a First
ARM + GPU Cluster for HPC
Nikola Puzovic, Alex Ramirez
We’ve hit the power wall
ALL computers are limited by power consumption
Multi-core – Fujitsu Ultra SPARC VIIIfx
– Intel SandyBridge
– AMD Bulldozer
Low-power processors – IBM BlueGene/Q
Compute accelerators – IBM Cell
– NVIDIA Tesla
– AMD Radeon
– Intel Xeon Phi
Energy-efficient approaches
Build the next HPC system on commodity and super-commodity components
– 100M tablets in 2012
– 750M smartphones in 2012
The next step in the commodity chain
HPC
Servers
Desktop
Mobile
Tegra 2
– Dual-core ARM Cortex-A9
– ULP Embedded GPU
Tegra 3
– Quad-core ARM Cortex-A9
– 12-core Embedded GPU
Tegra 4
– Quad-core ARM Cortex-A15
– 72-core Embedded GPU
NVIDIA Tegra: Commodity CPU + GPU platform
Proof of concept – It is possible to deploy a cluster of
smartphone processors
Enable software stack development
Tibidabo: The first ARM multicore cluster
Q7 carrier board
2 x Cortex-A9
2 GFLOPS
1 GbE + 100 MbE
7 Watts
0.3 GFLOPS / W
Q7 Tegra 2
2 x Cortex-A9 @ 1GHz
2 GFLOPS
5 Watts (?)
0.4 GFLOPS / W
1U Rackable blade
8 nodes
16 GFLOPS
65 Watts
0.25 GFLOPS / W
2 Racks
32 blade containers 256 nodes
512 cores
9x 48-port 1GbE switch
512 GFLOPS
3.4 Kwatt
0.15 GFLOPS / W
Open source system software stack – Ubuntu/Debian Linux OS
– GNU compilers • gcc, g++, gfortran
– Scientific libraries • ATLAS, FFTW, HDF5,...
– Slurm cluster management
Runtime libraries – MPICH2, CUDA, …
– OmpSs toolchain*
Developer tools – Paraver, Scalasca
– Allinea DDT debugger
HPC System software stack on ARM
OmpSs runtime library (NANOS++)
GPU CPU GPU CPU
CPU GPU …
Source files (C, C++, FORTRAN, …)
gcc gfortran OmpSs … Compiler(s)
Executable(s)
CUDA OpenCL MPI
GASNet
Linux Linux Linux
FFTW HDF5 … … ATLAS Scientific libraries
Scalasca … Paraver
Developer tools
Cluster management (Slurm)
* S3232 - OmpSs: Leveraging CUDA and OpenCL to
Exploit Heterogeneous Clusters of Hardware
Accelerators. Thursday, 10:00, Marriott Ballroom 3
Porting applications to ARM
Application Domain Institution Prog. Model
Scalability ARM port MPI OpenMP Other
YALES2 Combustion CNRS/CORIA Y >32K
EUTERPE Fusion BSC Y Y >60K
SPECFEM3D Wave propagation CNRS Y CUDA, SMPSs >150K, >1K GPU
MP2C Multi-particle collision JSC Y >65K
BigDFT Elect. Structure CEA Y Y CUDA, OpenCL >2K, >300 GPU
Quantum Expresso Elect. Strcuture CINECA Y Y CUDA Good
PEPC Coulomg + gravitational
forces JSC Y Pthreads, SMPSs >300K
SMMP Protein folding JSC Y OpenCL
16K
ProFASI Protein folding JSC Y Good
COSMO Weather forecast CINECA Y Y
BQCD Particle physics LRZ Y Y ~300K
Porting full-scale HPC applications to ARM cluster requires minimal effort
Tegra3 SoC
– Quad-core ARM Cortex-A9
– 6 PCIe lanes (gen1)
Quadro 1000M
– CUDA supported
1 GbE
First hybrid
ARM + CUDA
platform
CARMA: CUDA on ARM developer kit
CARMA Kit: Energy Efficiency
CARMA platform is much more energy-efficient than Tegra3 alone
Pedraforca v1: The first ARM + GPU cluster
Development cluster of 16 CARMA kits @ BSC
Pedraforca v1: Initial application performance results
Only 3.72 GLFOPS in Linpack … but – DGEMM: 21.3 GFLOPS (0.78 GFLOPS/W)
– SGEMM: 127.8 GFLOPS (5.04 GFLOPS/W)
– Low PCIe bandwidth (400 MB/s peak)
– No overlap of data transfers and computation
Pedraforca v2: Next generation ARM + GPU platform Tegra3 Q7 module 4x ARM Cortex-A9 @ 1.3 GHz
2GB DDR2
Mini-ITX carrier
4x PCIe Gen1
SATA 2.0
1 GbE
2.5” SSD
250 GB
SATA 3 MLC
NVIDIA Tesla K20
16x PCIe Gen3
1170 GFLOPS (peak)
Mellanox ConnectX-3
8x PCIe Gen3
40 Gb/s
Ethernet 1 Gb/s (service + storage)
InfiniBand 40 Gb/s (MPI)
Pedraforca: Rack enclosure
2x GbE switch
4x IB switch
Login nodes
Intel SandyBridge E5
64x Compute nodes
4x ARM Cortex-A9
1x NVIDIA Tesla K20
NFS Storage
Pedraforca: Interconnect
GbE network for service and storage
IB network for MPI – With extra ports to connect to other clusters …
GbE
GbE
IB
IB
IB
IB
Current GPU clusters
– Fixed ratio of CPU to GPU
– Unused GPU in not-accelerated apps
– Unused CPU in heavily accelerated apps
Decouple CPU from GPU
– Off-load kernels to remote GPU
– Direct GPU to GPU data transfers
• Orchestrated by light-weight ARM CPU
GPU-accelerated cluster vs. GPU-accelerator cluster
CPU GPU CPU GPU CPU GPU CPU GPU
CPU GPU CPU GPU CPU GPU CPU GPU
CPU GPU CPU GPU CPU GPU CPU GPU
Interconnection network
CPU CPU CPU CPU
GPU GPU GPU GPU
CPU CPU CPU CPU
CPU CPU CPU CPU
GPU GPU GPU GPU
Interconnection network
Conclusions
CARMA is not an HPC solution …
… but it enables software development already
Pedraforca is the second generation ARM + GPU prototype – GPU-accelerator cluster, instead of GPU-accelerated cluster
• ARM CPU used to orchestrate direct GPU to GPU communication
CPU + GPU integration is happening already – Embedded mobile platforms with OpenCL capable GPU
Get ready for your next generation CPU + GPU platforms!
We’re hiring!
Do you want to work on the next generation of energy-efficient
HPC systems?
Lead the way to the Exascale?
Change the HPC world forever?
http://www.bsc.es/about_bsc/employment/vacancies
– Senior Researchers in Energy-Efficient Supercomputers
– HPC Application Developers