Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory...
Transcript of Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory...
Increasing the efficiency of your
GPU-enabled cluster
with rCUDA
Federico Silla Technical University of Valencia
Spain
HPC Advisory Council European Conference 2014, Leipzig 2/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
HPC Advisory Council European Conference 2014, Leipzig 3/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
HPC Advisory Council European Conference 2014, Leipzig 4/87
Current computing needs
• Many applications require a lot of computing
resources
• Execution time is usually increased
• Applications are accelerated to get their
execution time reduced
• GPU computing has experienced a
remarkable growth in the last years
HPC Advisory Council European Conference 2014, Leipzig 5/87
GPUs reduce energy and time
On GPU On CPU On GPU On CPU On GPU On CPU On GPU On CPU
blastp –db sorted_env_nr –query SequenceLength_00001300.txt -num_threads X -gpu [t|f]
Dual socket E5-2620 v2 Intel Xeon node with NVIDIA K20 GPU
GPU-Blast: Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool
2.36x 3.56x
HPC Advisory Council European Conference 2014, Leipzig 6/87
Current GPU computing facilities
The basic building block is a node with one or more GPUs
HPC Advisory Council European Conference 2014, Leipzig 7/87
From the programming point of view:
A set of nodes, each one with:
one or more CPUs (with several cores per CPU)
one or more GPUs (typically between 1 and 4)
An interconnection network
Current GPU computing facilities
HPC Advisory Council European Conference 2014, Leipzig 8/87
A GPU computing facility is usually a set of independent self-
contained nodes that leverage the shared-nothing approach
Nothing is directly shared among nodes (MPI required for aggregating
computing resources within the cluster)
GPUs can only be used within the node they are attached to
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Current GPU computing facilities
HPC Advisory Council European Conference 2014, Leipzig 9/87
For many workloads, GPUs may be idle for long periods of time:
• Initial acquisition costs not amortized
• Space: GPUs reduce CPU density
• Energy: idle GPUs keep consuming power
Money leakage in current clusters?
Time (s)
Idle
Pow
er
(Watts)
1 GPU node
4 GPUs node
• 1 GPU node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 GPU
• 4 GPUs node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 GPUs
25%
HPC Advisory Council European Conference 2014, Leipzig 10/87
• Applications can only use the GPUs located
within their node
• Non-accelerated applications keep GPUs idle
in the nodes when they use all the cores
• Multi-GPU applications running on a subset
of nodes cannot make use of the tremendous
GPU resources available at other cluster
nodes (even if they are idle)
Further concerns in accelerated clusters
HPC Advisory Council European Conference 2014, Leipzig 11/87
What is missing is ...
... some flexibility for
using the GPUs in the
cluster
We need something else in the cluster
HPC Advisory Council European Conference 2014, Leipzig 12/87
How to address current concerns:
Increasing flexibility
A way of addressing the “idle” concern is
by sharing the GPUs present in the cluster
among all the nodes and, once GPUs are
shared, their amount can be reduced
This would increase GPU utilization, also
lowering power consumption, at the same
time that initial acquisition costs are reduced
HPC Advisory Council European Conference 2014, Leipzig 13/87
• This new cluster configuration requires:
• A way of seamlessly sharing GPUs
across nodes in the cluster (remote
GPU virtualization)
• Enhanced job schedulers that take into
account the new virtual GPUs
What is needed for increased flexibility?
HPC Advisory Council European Conference 2014, Leipzig 14/87
Remote GPU virtualization allows a new vision of a GPU
deployment, moving from the usual cluster configuration:
Remote GPU virtualization envision
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
to the following one ….
HPC Advisory Council European Conference 2014, Leipzig 15/87
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Physical configuration
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Logical connections
Logical configuration
Remote GPU virtualization envision
HPC Advisory Council European Conference 2014, Leipzig 16/87
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Physical configuration
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Logical connections
Logical configuration
Busy cores are no longer a problem
HPC Advisory Council European Conference 2014, Leipzig 17/87
Without GPU virtualization
GPU virtualization is also useful for multi-GPU applications
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Logical connections
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
With GPU virtualization
Many GPUs in the
cluster can be provided
to the application
Only the GPUs in the
node can be provided
to the application
Multi-GPU applications get benefit
HPC Advisory Council European Conference 2014, Leipzig 18/87
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Remote GPU virtualization envision
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Without GPU
virtualization
With GPU
virtualization
Real local GPUs
Virtualized remote GPUs
GPU virtualization allows all nodes to
access all GPUs
HPC Advisory Council European Conference 2014, Leipzig 19/87
+ GPU
+ GPU
+ GPU
+ GPU
+ GPU
One step further:
enhancing the scheduling process so
that GPU servers are put into low-power
sleeping modes as soon as their
acceleration features are not required
More about reducing energy consumption
HPC Advisory Council European Conference 2014, Leipzig 20/87
Going even beyond:
consolidating GPUs into dedicated
GPU boxes (no CPU power required)
allowing GPU task migration
TRUE GREEN GPU
COMPUTING
Networked GPU boxes
HPC Advisory Council European Conference 2014, Leipzig 21/87
Box A has 4 GPUs and only one is busy
Box B has 8 GPUs but only two are busy
Move jobs from Box B to Box A and switch
off Box B
Migration should be transparent to
applications (decided by the global
scheduler)
TRUE GREEN GPU
COMPUTING
GPU task migration
Box A
Box B
HPC Advisory Council European Conference 2014, Leipzig 22/87
Box A has 4 GPUs and only one is busy
Box B has 8 GPUs but only two are busy
Move jobs from Box B to Box A and switch
off Box B
Migration should be transparent to
applications (decided by the global
scheduler)
TRUE GREEN GPU
COMPUTING
GPU task migration
Box A
Box B
HPC Advisory Council European Conference 2014, Leipzig 23/87
0 2000 4000 6000 8000 10000 12000 14000 16000 180000
10
20
30
40
50
60
70
80
90
100Influence of data transfers for SGEMM
Pinned Memory
Non-Pinned Memory
Matrix Size
Tim
e d
evo
ted
to
da
ta tra
nsfe
rs (
%)
Main GPU virtualization drawback is the increased
latency and reduced bandwidth to the remote GPU
Problem with remote GPU virtualization
Data from a matrix-matrix
multiplication using a local GPU!!!
HPC Advisory Council European Conference 2014, Leipzig 24/87
Several efforts have been made regarding GPU
virtualization during the last years:
rCUDA (CUDA 6.0)
GVirtuS (CUDA 3.2)
DS-CUDA (CUDA 4.1)
vCUDA (CUDA 1.1)
GViM (CUDA 1.1)
GridCUDA (CUDA 2.3)
V-GPU (CUDA 4.0)
Remote GPU virtualization frameworks
Publicly available
NOT publicly available
HPC Advisory Council European Conference 2014, Leipzig 25/87
Performance comparison of GPU virtualization solutions:
Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture)
GPU NVIDIA Tesla K20
Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)
SX6025 Mellanox switch
CentOS 6.3 + Mellanox OFED 1.5.3
Pageable
H2D
Pinned
H2D
Pageable
D2H
Pinned
D2H
CUDA 34,3 4,3 16,2 5,2
rCUDA 94,5 23,1 292,2 6,0
GVirtuS 184,2 200,3 168,4 182,8
DS-CUDA 45,9 - 26,5 -
Latency (measured by transferring 64 bytes)
Remote GPU virtualization frameworks
HPC Advisory Council European Conference 2014, Leipzig 26/87
MB
/sec
Transfer size (MB)
Host-to-Device
pageable memory
Device-to-Host
pageable memory
Host-to-Device
pinned memory
Device-to-Host
pinned memory
MB
/sec
Transfer size (MB)
MB
/sec
Transfer size (MB)
MB
/sec
Transfer size (MB)
Bandwidth of a copy between GPU and CPU memories
Remote GPU virtualization frameworks
HPC Advisory Council European Conference 2014, Leipzig 27/87
Applications tested with rCUDA
rCUDA has been successfully tested with the following
applications:
NVIDIA CUDA SDK Samples
LAMMPS
WideLM
CUDASW++
OpenFOAM
HOOMDBlue
mCUDA-MEME
The list keeps growing
GPU-Blast
Gromacs
GAMESS
DL-POLY
HPL
HPC Advisory Council European Conference 2014, Leipzig 28/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
HPC Advisory Council European Conference 2014, Leipzig 29/87
It is useful for:
Applications that do not make use of GPUs all the time
(moderate level of data parallelism)
Applications for multi-GPU computing
A framework enabling a CUDA-based
application running in one (or some) node(s)
to access GPUs in other nodes
Basics of the rCUDA framework
HPC Advisory Council European Conference 2014, Leipzig 30/87
Basics of the rCUDA framework
Basic CUDA behavior
HPC Advisory Council European Conference 2014, Leipzig 31/87
Basics of the rCUDA framework
HPC Advisory Council European Conference 2014, Leipzig 32/87
Basics of the rCUDA framework
HPC Advisory Council European Conference 2014, Leipzig 33/87
Basics of the rCUDA framework
HPC Advisory Council European Conference 2014, Leipzig 34/87
How to declare remote GPUs
Environment variables are properly initialized in the client side and
used by the rCUDA client (transparently to the application)
Server name/IP address : GPU
Amount of GPUs exposed to
applications
HPC Advisory Council European Conference 2014, Leipzig 35/87
rCUDA uses a proprietary
communication protocol
Example:
1) initialization
2) memory allocation on the
remote GPU
3) CPU to GPU memory
transfer of the input data
4) kernel execution
5) GPU to CPU memory
transfer of the results
6) GPU memory release
7) communication channel
closing and server
process finalization
Basics of the rCUDA framework
HPC Advisory Council European Conference 2014, Leipzig 36/87
rCUDA presents a modular architecture
HPC Advisory Council European Conference 2014, Leipzig 37/87
Use of GPUDirect RDMA to move data between GPUs
Pipelined transfers to improve performance
Preallocated pinned memory buffers
Optimal pipeline block size
rCUDA uses optimized transfers
rCUDA features optimized communications:
HPC Advisory Council European Conference 2014, Leipzig 38/87
Pipeline block size for InfiniBand FDR
• NVIDIA Tesla K20; Mellanox ConnectX-3 + SX6025 Mellanox switch
It was
2MB with
IB QDR
Basic performance analysis
HPC Advisory Council European Conference 2014, Leipzig 39/87
Host-to-Device
pinned memory
Bandwidth of a copy between GPU and CPU memories
Host-to-Device
pageable memory
Basic performance analysis
HPC Advisory Council European Conference 2014, Leipzig 40/87
Latency study: copy a small dataset (64 bytes)
Basic performance analysis
HPC Advisory Council European Conference 2014, Leipzig 41/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
HPC Advisory Council European Conference 2014, Leipzig 42/87
Test system:
Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture)
GPU NVIDIA Tesla K20
Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)
SX6025 Mellanox switch
Cisco switch SLM2014 (1Gbps Ethernet)
CentOS 6.3 + Mellanox OFED 1.5.3
Performance of the rCUDA framework
HPC Advisory Council European Conference 2014, Leipzig 43/87
BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI0
1
2
3
4
5
6
7
8
9
CUDA
rCUDA FDR
rCUDA Gb
No
rma
lized
tim
e
Test CUDA SDK examples
Single-GPU applications
HPC Advisory Council European Conference 2014, Leipzig 44/87
CUDASW++
Bioinformatics software for Smith-Waterman protein database searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCU
DA
Ove
rhe
ad
(%
)
Exe
cu
tio
n T
ime
(s)
Single-GPU applications
HPC Advisory Council European Conference 2014, Leipzig 45/87
GPU-Blast
Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool
2 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
0
50
100
150
200
250
300
350
0
10
20
30
40
50
60
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCU
DA
Ove
rhe
ad
(%
)
Exe
cu
tio
n T
ime
(s)
Single-GPU applications
HPC Advisory Council European Conference 2014, Leipzig 46/87
Test system for Multi-GPU applications:
For CUDA tests one node with:
2 Quad-Core Intel Xeon E5440 processors
Tesla S2050 (4 Tesla GPUs)
Each thread (1-4) uses one GPU
For rCUDA tests 8 nodes with:
2 Quad-Core Intel Xeon E5520 processors
1 NVIDIA Tesla C2050
InfiniBand QDR
Test running in one node and using up to all the GPUs from the others
Each thread (1-8) uses one GPU
Multi-GPU applications
HPC Advisory Council European Conference 2014, Leipzig 47/87
MonteCarlo MultiGPU (from NVIDIA SDK)
Multi-GPU applications
HPC Advisory Council European Conference 2014, Leipzig 48/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
HPC Advisory Council European Conference 2014, Leipzig 49/87
• SLURM (Simple Linux Utility for Resource Management) job scheduler
• SLURM does not understand about virtualized GPUs
• Add a new GRES (general resource) in order to manage virtualized GPUs
• Where the GPUs are in the system is completely transparent to the user
• In the job script, or in the submission command, the user specifies the number of rGPUs (remote GPUs) required by the job. The amount of GPU memory required by the job may also be specified
Integrating rCUDA with SLURM
HPC Advisory Council European Conference 2014, Leipzig 50/87
The basic idea about SLURM
HPC Advisory Council European Conference 2014, Leipzig 51/87
GPUs are
decoupled
from nodes
The basic idea about SLURM + rCUDA
All jobs are
executed in
less time
HPC Advisory Council European Conference 2014, Leipzig 52/87
GPUs are
decoupled
from nodes
All jobs are
executed
even in
less time
Sharing remote GPUs among jobs
GPU 0 is
scheduled to
be shared
among jobs
HPC Advisory Council European Conference 2014, Leipzig 53/87
ClusterName=rcu
…
GresTypes=gpu,rgpu
…
NodeName=rcu16 NodeAddr=rcu16 CPUs=8 Sockets=1
CoresPerSocket=4 ThreadsPerCore=2
RealMemory=7990 Gres=rgpu:4,gpu:4
slurm.conf
SLURM configuration
HPC Advisory Council European Conference 2014, Leipzig 54/87
NodeName=rcu16 Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null)
Gres=rgpu:4,gpu:4
NodeAddr=rcu16 NodeHostName=rcu16
OS=Linux RealMemory=7682 Sockets=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
BootTime=2013-04-24T18:45:35
SlurmdStartTime=2013-04-30T10:02:04
scontrol show node
SLURM configuration
HPC Advisory Council European Conference 2014, Leipzig 55/87
Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m
Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m
Name=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535m
Name=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
gres.conf
SLURM configuration
HPC Advisory Council European Conference 2014, Leipzig 56/87
$
$ srun -N1 --gres=rgpu:4:512M script.sh...
$
Submit a job
RCUDA_DEVICE_COUNT=4
RCUDA_DEVICE_0=rcu16:0
RCUDA_DEVICE_1=rcu16:1
RCUDA_DEVICE_2=rcu16:2
RCUDA_DEVICE_3=rcu16:3
Submitting jobs with SLURM
Environment variables are initialized by SLURM and used by
the rCUDA client (transparently to the user)
Server name/IP address : GPU
HPC Advisory Council European Conference 2014, Leipzig 57/87
Test bench for checking SLURM correctness:
Old “heterogeneous” cluster with 28 nodes:
24 nodes without GPU with 2 AMD Opteron 1.4GHz, 2GB RAM
1 node without GPU, QuadCore i7 3.5GHz, 8GB de RAM
3 nodes with GeForce GTX 670 or 590 GPUs 2GB, QuadCore i7 3GHz
1Gbps Ethernet interconnect
Scientific Linux release 6.1 and also Open Suse 11.2 (dual boot)
Integrating rCUDA with SLURM
HPC Advisory Council European Conference 2014, Leipzig 58/87
Test bench for analyzing SLURM+rCUDA performance:
InfiniBand ConnectX-3 based cluster
CentOS 6.4 Linux
Dual socket E5-2620 v2 Intel Xeon based nodes:
3 nodes without GPU
1 node with NVIDIA K40 GPU
10 nodes with NVIDIA K20 GPU
1 node with 4 NVIDIA K20 GPU
First tests carried out. More complex tests in the way
Integrating rCUDA with SLURM
1 node hosting the main SLURM controller
1 node without GPU
1 node with the K20 GPU
1 node with the K40 GPU
HPC Advisory Council European Conference 2014, Leipzig 59/87
GPUBlast using K40 GPU remotely
Theoretical CUDA
Exclusive CUDA
Shared CUDA
Theoretical rCUDA (Eth)
Exclusive rCUDA (Eth)
Shared rCUDA (Eth)
Theoretical rCUDA (IB)
Exclusive rCUDA (IB)
Shared rCUDA (IB)
HPC Advisory Council European Conference 2014, Leipzig 60/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
HPC Advisory Council European Conference 2014, Leipzig 61/87
Why rCUDA with virtual machines?
• Current clusters frequently leverage virtual machines in order to attain energy and cost reductions
• Xen and KVM virtual machines are commonly leveraged
• Applications being executed inside virtual machines need access to GPU computing resources
• Different Xen and KVM virtual machines can share an InfiniBand network card thanks to the virtualization features of the InfiniBand driver
• NVIDIA drivers do not provide virtualization features, thus avoiding the concurrent usage of the GPU from several virtual machines
• rCUDA can be used to provide concurrent access to GPUs
HPC Advisory Council European Conference 2014, Leipzig 62/87
How to connect an IB card to a VM
Real IB card Virtual IB cards (VFs)
• PCI pass-through is used to assign an IB card (either real or virtual) to a given VM
• The IB card manages the several virtual copies
HPC Advisory Council European Conference 2014, Leipzig 63/87
Performance analysis on KVM VMs
RDMA write
Bandw
idth
(M
B/s
)
Transfer size (Bytes)
Computer with
KVM VMs
Remote
computer
RealX3: actual Connect-X3 (1 port) card
VF: virtual copy used from host OS
VM: virtual card copy used from VM
RDMA write
Norm
aliz
ed
Ba
ndw
idth
Transfer size (Bytes)
RDMA write Norm
aliz
ed late
ncy
Transfer size (Bytes)
HPC Advisory Council European Conference 2014, Leipzig 64/87
CUDA BWtest from a KVM VM
Computer with
KVM VMs
Remote computer
with Tesla K20
HPC Advisory Council European Conference 2014, Leipzig 65/87
Outline
Why remote GPU virtualization?
How does rCUDA work?
The performance of the rCUDA framework
Scheduling virtual GPUs with SLURM
rCUDA and KVM virtual machines
Low-power processors and rCUDA
HPC Advisory Council European Conference 2014, Leipzig 66/87
rCUDA and low-power processors
• There is a clear trend to improve the performance-power ratio in HPC facilities:
• By leveraging accelerators (Tesla K20 by NVIDIA, FirePro by AMD, Xeon Phi by Intel, …)
• More recently, by using low-power processors (Intel Atom, ARM, …)
• rCUDA allows to attach a “pool of accelerators” to a computing facility:
• Less accelerators than nodes accelerators shared among nodes
• Performance-power ratio probably improved further
• How interesting is a heterogeneous platform leveraging powerful processors, low-power ones, and remote GPUs?
• Which is the best configuration?
• Is energy effectively saved?
HPC Advisory Council European Conference 2014, Leipzig 67/87
rCUDA and low-power processors
• The computing platforms analyzed in this study are:
• KAYLA: NVIDIA Tegra 3 ARM Cortex A9 quad-core CPU (1.4 GHz), 2GB DDR3 RAM and Intel 82574L Gigabit Ethernet controller
• ATOM: Intel Atom quad-core CPU s1260 (2.0 GHz), 8GB DDR3 RAM and Intel I350 Gigabit Ethernet controller. No PCIe connector for a GPU
• XEON: Intel Xeon X3440 quad-core CPU (2.4GHz), 8GB DDR3 RAM and 82574L Gigabit Ethernet controller
• CARMA: NVIDIA Quadro 1000M GPU (96 cores) with 2GB DDR5 RAM. The rest of the system is the same as for KAYLA
• FERMI: NVIDIA GeForce GTX480 “Fermi” GPU (448 cores) with 1,280 MB DDR3/GDDR5 RAM
• The accelerators analyzed are: CARMA platform
HPC Advisory Council European Conference 2014, Leipzig 68/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
HPC Advisory Council European Conference 2014, Leipzig 69/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
HPC Advisory Council European Conference 2014, Leipzig 70/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
• LAMMPS makes a more intensive use of the GPU than CUDASW++
HPC Advisory Council European Conference 2014, Leipzig 71/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
• LAMMPS makes a more intensive use of the GPU than CUDASW++
• The lower performance of the PCIe bus in Kayla reduces performance
HPC Advisory Council European Conference 2014, Leipzig 72/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
• The XEON+FERMI combination achieves the best performance
• LAMMPS makes a more intensive use of the GPU than CUDASW++
• The lower performance of the PCIe bus in Kayla reduces performance
• The 96 cores of CARMA provide a noticeably lower performance than a more powerful GPU
HPC Advisory Council European Conference 2014, Leipzig 73/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
CARMA
requires
less power
HPC Advisory Council European Conference 2014, Leipzig 74/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
When execution time
and required power are
combined into energy,
XEON+FERMI is more
efficient
HPC Advisory Council European Conference 2014, Leipzig 75/87
rCUDA and low-power processors
• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)
Summing up:
- If execution time is a concern, then use XEON+FERMI
- If power is to be minimized, CARMA is a good choice
- The lower performance of KAYLA’s PCIe decreases the interest of
this option for single system local-GPU usage
- If energy must be minimized, XEON+FERMI is the right choice
HPC Advisory Council European Conference 2014, Leipzig 76/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs
• The use of the CUADRO GPU within the CARMA system is discarded as rCUDA server due to its poor performance. Only the FERMI GPU is considered
• Six different combinations of rCUDA clients and servers are feasible:
• Only one client system and one server system are used at each experiment
Combination Client Server
1 KAYLA KAYLA+FERMI
2 ATOM KAYLA+FERMI
3 XEON KAYLA+FERMI
4 KAYLA XEON+FERMI
5 ATOM XEON+FERMI
6 XEON XEON+FERMI
HPC Advisory Council European Conference 2014, Leipzig 77/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 78/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
The KAYLA-based
server presents
much higher
execution time
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 79/87
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs
The KAYLA-based
server presents
much higher
execution time
CUDASW++
The network
interface for
KAYLA delivers
much less than the
expected 1Gbps
bandwidth
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 80/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
The lower
bandwidth of
KAYLA (and ATOM)
can also be noticed
when using the
XEON server
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 81/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
Most of the power and
energy is required by
the server side, as it
hosts the GPU
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 82/87
rCUDA and low-power processors
• 2nd analysis: use of remote GPUs CUDASW++
The much shorter
execution time of
XEON-based servers
render more energy-
efficient systems
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA ATOM XEON KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 83/87
rCUDA and low-power processors
LAMMPS • 2nd analysis: use of remote GPUs
Combination Client Server
1 KAYLA KAYLA+FERMI
2 KAYLA XEON+FERMI
3 ATOM XEON+FERMI
4 XEON XEON+FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 84/87
rCUDA and low-power processors
LAMMPS • 2nd analysis: use of remote GPUs
Similar
conclusions to
CUDASW++ apply
to LAMMPS, due
to the network
bottleneck
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
KAYLA KAYLA ATOM XEON
KAYLA + FERMI XEON + FERMI
HPC Advisory Council European Conference 2014, Leipzig 85/87
rCUDA and low-power processors
• Atom-based systems with PCIe 2.0 x8 may hold an InfiniBand Card
• InfiniBand QDR adapters would noticeably increase network bandwidth
HPC Advisory Council European Conference 2014, Leipzig 86/87
Summary about rCUDA
rCUDA is the enabling technology for:
• High Throughput Computing
• Using remote GPUs does not make applications to execute faster
• Sharing remote GPUs makes applications to execute slower
• BUT more throughput (jobs/time) is achieved
• Datacenter administrators can choose between HPC and HTC
• Green Computing
• GPU migration and application migration allow to
devote just the required computing resources to
the current load
• More flexible system upgrades
• GPU and CPU updates become independent
from each other. Adding GPU boxes to non
GPU-enabled clusters is possible
HPC Advisory Council European Conference 2014, Leipzig 87/87
You can get a free copy of rCUDA at
http://www.rcuda.net