rCUDA is multiplatform
rCUDA is available for the x86, ARM and Power processor architectures. Furthermore, rCUDA clients and rCUDA servers for different processor architectures can interact among them. This availability allows for different options. For instance:
b) You can run your applications in low-power ARM-based systems while accelerating the compute intensive parts of your applications in remote GPUs located either in x86-based systems or in IBM Power nodes. The following charts are an example of this possibility
a) You can run your applications as always in x86-based systems while accelerating them with the extraordinary performance of the multi-GPU IBM Power systems leveraging the NVLink technology
rCUDA will be available for Windows in the future
rCUDA is currently available for Linux
CPU vs GPU performance of the CloverLeaf application. CPU: up to 64 ThunderX ARM nodes (estimation based on performance in one node). GPU: single ThunderX ARM node using 3 GPUs located in a remote x86 server.
CPU vs GPU performance of the Flowapplication. CPU: up to 64 ThunderX ARM nodes (estimation based on performance in one node). GPU: single ThunderX ARM node using 3 GPUs located in a remotex86 server.
The rCUDA middleware will also be available for the Windows operating system as soon asfunding availability allows to devote resources to that new development.
Make your GPUs even better with rCUDA!
How rCUDA works
The rCUDA middleware leverages the client-server architecture. The rCUDA client isa set of libraries which is installed in the cluster node executing the CUDA application. The rCUDA client replaces NVIDIA's libraries in that node. The server side of rCUDA runs on computers owning GPUs. In order to use rCUDA, applications do not need to be modified. They just must be linked against the rCUDA libraries instead of the CUDA ones.
The rCUDA client provides applications with the illusion of dealing with local GPUs. Whenever an application accesses the virtual GPU, the rCUDA client forwards the request to the server owning the actual GPU. The rCUDA server receives those requests and forwards them to the real GPU.
GPU
CUDA application
client engineCUDA libraries
serverengine
Hardware
Software
Client side Server side
CUDA API
Network
CUDA is limited to use GPUs that are physically installed in the node where the application is executed. With rCUDA applications are not limited to local GPUs but they can use any GPU in the cluster.
GPU
GPU
GPU
nodes without GPUs running CUDA apps
remote CUDA
make a better usage of the GPUs in your cluster
make your GPUs flexible! ... detach ... share ... aggregate
save acquisition and energy costs
High energy savings by sharing the
GPUs installed in the cluster
Use all the GPUs installed in your
cluster from a single node
Seamless use of GPUs installed in remote computers
The rCUDA middleware virtualizes GPUs located in different nodes of the cluster, creating a pool of GPUs that can be accessed from any cluster node. Additionally, rCUDA enables concurrent access to them. This is done transparently to programmers and applications. Furthermore, applicationsource code does not need to be modified. In this way, rCUDA provides applications with the illusion of dealing with real local GPUs.
The rCUDA team thanks Mellanox for its support on promoting the rCUDA technology
The integration of rCUDA with the SLURM job scheduler creates the perfect suite for introducing remote GPU virtualization into your cluster, allowing to effectively reduce the amount of requested GPUs without significantly reducing overall performance. Overall energy consumption is noticeably reduced.
Technical University of Valencia
rCUDA is a development by:
Traditional cluster configuration
Cluster configuration with rCUDA
· Support for CUDA 9.1
· Binary compatibility
· Thread-safe
· Linux compatible
· TCP/IP communications
· Specific InfiniBand support
· Specific support for RoCE
· GPUDirect RDMA enabled
· Multi-GPU server enabled
· Supports multi-GPU applications
Main rCUDA features
rCUDA is binary compatible with
CUDA. Just install rCUDA
in your cluster and keep using
your applications as usual
Application performance not changed
Applications using a remote GPU with EDRInfiniBand perform similar to when executed on local GPUs. More information can be found in the paper: C. Reaño et al. "Local and Remote GPUs Perform Similar with EDR 100G InfiniBand". ACM Middleware Conference 2015
CUDASW++GPU-LIBSVM
LAMMPSGPU-BLAST
MAGMACUDA-MEME
BarraCUDA0.0
0.2
0.4
0.6
0.8
1.0
CUDA
rCUDA
Nor
ma
lized
Exe
cT
ime
CUDA case
Amazing performance when moving data to/from remote GPUs
rCUDA case
rCUDA case
0 10 20 30 40 50 600
2000
4000
6000
8000
10000
CUDA K20 K20 FDR rCUDACUDA K40 K40 EDR rCUDACUDA P100 P100 EDR rCUDA
Copy Size (MB)
Ban
dw
idth
(MB
/s)
CUDA case
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 4 8 16 32 64
Exec
uti
on
tim
e (s
)
# nodes
ARM P100 x3
0
500
1000
1500
2000
4 8 16 32 64
Exec
uti
on
tim
e (s
)
0
5000
10000
15000
20000
1 2 4 8 16 32 64
Exec
uti
on
tim
e (s
)
# nodes
ARM P100 x 3
0
500
1000
1500
2000
2500
8 16 32 64
Exec
uti
on
tim
e (s
)
10 years virtualizing
GPUs
0 10 20 30 40 50 600
2000
4000
6000
8000
10000
12000
K20 CUDA K20 FDR rCUDA
K40 CUDA K40 EDR rCUDA
P100 CUDA P100 EDR rCUDA
Copy Size (MB)
Ba
ndw
idth
(MB
/s)
rCUDA for Cloud GPU migration with rCUDArCUDA and SlurmMore GPUs with rCUDAUsing GPUs in cloud computing is increasingly becoming mainstream. CUDA applications running inside a virtual machine (VM) can be accelerated by using the GPUs in the host. However, assigning GPUs to VMs presents an important burden when using the inexpensive PCI passthrough mechanism: efficiently sharing the GPUs among the VMs.
VMs can access GPUs in the host by means of the PCI passthrough technique. This mechanism assigns a PCI device to a VM in an exclusive way. That is, once a GPU is assigned to a VM, it cannot be assigned to other VMs until the device is detached from the former VM. This is a limitation. The figure on the left shows a server with four GPUs. Each GPU is assigned to one VM.
How can a VM access CUDA GPUs?
GPUVM
GPUVM
GPUVM
GPUVM
rCUDA is designed following the client-server distributed approach. Therefore, rCUDA clients can be placed inside VMs whereas the rCUDA server is placed in the native domain owning the GPU. Using the PCI passthrough mechanism is not required. As a consequence, GPUs can be concurrently shared by all the VMs. The figure on the right shows a 4-GPU cluster node with the rCUDA middleware serving four VMs.
How does rCUDA avoid PCI passthrough?
remote CUDA
GPUVM
GPUVM
GPUVM
GPUVM
GPUVM
VM
GPUVM
VM
remote CUDA
Using rCUDA for providing GPUs to VMs allows for an increased level of flexibility. For instance, the figure on the left shows that the amount of GPUs installed in the node can be reduced. On the other hand, the figure on the right shows how a node providing GPU acceleration can be shared among several nodes hosting VMs. The sharing degree of GPUs is limited by their physical memory. A scheduler is provided with rCUDA in order to control GPU memory occupancy.
CUDA-MEME CUDASW++ GPU-BLAST LAMMPS0
50
100
150
200
250
300
350
PCI passthrough 4 VMs rCUDA 4 VMs
rCUDA 8 VMs rCUDA 12 VMs
Exe
cutio
nT
ime
(se
con
ds)
CUDA-MEME CUDASW++ GPU-BLAST LAMMPS0
10
20
30
40
50
60
70
80
PCI passthrough 4 VMs rCUDA 4 VMs
rCUDA 8 VMs rCUDA 12 VMs
Com
ple
ted
Job
s
Which performance can be expected?
Four VMs were created in a node with four GPUs. Each of the VMs was assigned one GPU and executed a series of randomly selected applications for one hour. Results from this test are shown in the charts as "PCI passthough 4 VMs". Moreover, rCUDA was used in order to share the four GPUs among up to twelve VMs. Results labeled as "rCUDA 4 VMs", "rCUDA 8 VMs" and "rCUDA 12 VMs" refer to these experiments. More informationabout this experiment can be found in the paper by J. Prades and F. Silla titled "Made-to-Measure GPUs on Virtual Machines with rCUDA" published at NextGenClouds 2018.
With rCUDA it is possible to change, at execution time, the GPU assigned to an application.That is, the application is initially provided one or more GPUs but, during its execution, the GPU part of the application is transparently migrated to other GPU (or GPUs) elsewhere in the cluster. The figure on the right shows the migration time for five different CUDA applications when using FDR InfiniBand, EDR InfiniBand and 1Gbps Ethernet. It can be seen that migration time is very small when InfiniBand is used.
123456789
1011121314
GPU utilization (%)20 40 60 80 1000
Before migration Resource utilization in data centers evolves over time. Thus, at some point in time, the utilization of the GPUs in the cluster may be similar to that depicted in the figure on the left. It can be seen that GPU utilization is uneven. In this scenario it would be useful to consolidate the GPU part of an application into a smaller number of servers, so that those nodes that become free can be switched off. The figure on theright depicts this
consolidation of GPU jobs, where jobs generating a lower GPU utilization, such as the ones in nodes 2, 4, 5, 6, 8, 10, and 11 have been migrated to other nodes. After migration, nodes sourcing the movement of GPU jobs have been switched off.
Using GPU migration for server consolidation
3.83 27.5222.25
Using GPU migration for load balancingWhen several applications share a given GPU in the cluster, rCUDA allows to migrate each of the GPU jobs of these applications to different destination GPUs. That is, it is not required that GPUs are migrated as a whole. This individual migration of GPUjobs allows load to be balanced across the GPUs in the cluster.
The figure below on the left shows three applications sharing a GPU with rCUDA while two other GPUs remain idle. The job scheduler can detect this uneven GPU utilization and can balance load by migrating some of the applications to other GPUs.
GPU
remote CUDA
CloverLeaf
GPU-Blast
CUDAmeme
GPU
GPU
CloverLeafGPU-Blast
CUDAmeme
remote CUDA
remote CUDA
Idle GPU
Idle GPU
GPU
remote CUDA
CloverLeaf
GPU-Blast
CUDAmeme
GPU
GPU
CloverLeaf
GPU-Blast
CUDAmeme
remote CUDA
remote CUDA
Before load balancing After load balancing
1 2 4 6 8 10 12 140
1000
2000
3000
4000
5000
6000
CUDA rCUDA
Number of GPUs
Op
tions
per
seco
nd
rCUDA detaches GPUs from the nodes where they are installed, thus creating a pool of GPUs that can be used from any node of the cluster. Therefore, rCUDA allowsnon-MPI applications to use all the GPUs in the cluster from a single node.
With rCUDA, the amount of GPUs that can be assigned to a single non-MPI applicationis not limited to the GPUs physically installed in the node running the application.
Access all the GPUs in the cluster from z single node
64 GPUs!!!
Options per second computed by the MonteCarlo MultiGPU sample from NVIDIA. CUDA leverages up to the 4 GPUs available in one node. rCUDA makes use of up to the 14 K20 GPUs available in the cluster.
Use all the GPUs in your cluster to boost renderingrCUDA supports renderers such asBlender or Octane. These renderers typically split the scene to render into smaller parts and then compute each part independently from each other. When several GPUs are provided to the renderer, it computes the different parts of the scene in parallel in the available GPUs, thus reducing rendering time.
1 GPU 2 GPUs 4 GPUs
8 GPUs 12 GPUs 16 GPUs
0
5
10
15
20
25
1 2 4 8 16
Rend
erin
g �
me
(min
utes
)
Amount of GPUs
TensorFlow gets accelerated by using many GPUsDeep learning frameworks benefit of having access to many GPUs in order to reduce the time required to train a neural network. TensorFlow as well as other deep learning frameworks, such as Caffe, can make use of multiple GPUs. These frameworks can benefit from rCUDA by making use of several remote GPUs.
Work
fu
nd
ed
by G
en
era
lita
t V
ale
ncia
na u
nd
er
Gra
nt
PR
OM
ETEO
/2017/0
77.
Th
e r
CU
DA
Team
is g
rate
ful fo
r th
e s
up
port
pro
vid
ed
by M
ellan
ox T
ech
nolo
gie
s The need for integrating rCUDA and SLURMSLURM is not able to deal with the virtual GPUs provided by rCUDA. Thus, when a job that is submitted to the SLURM queues includes within its requirements one or more GPUs per node, SLURM can only assign that job nodes owning the requested amount of GPUs, neglecting the benefits of remote GPU virtualization. By integrating rCUDA with SLURM, applications are able to use GPUs independently of the exact nodes where applications, on the one hand, and GPUs, on the other, are physicaly located.
rCUDA allows to share remote GPUs among several jobs. Thus integrating rCUDA and SLURM would increase cluster throughputas well as GPU utilization at the same time that consumed energy is reduced.
The integration of rCUDA with SLURM is currently being developed. Moreover, a GPU scheduler is being developed in order to manage virtual GPUs in scenarios where SLURMis not required, as the cloud computing domain, for instance.
Increased performance with rCUDA+SlurmSome initial performance measurements have been carried outin a small 16-node cluster. Each node includes two Intel processors and one FDR InfiniBand adapter as well as one NVIDIA Tesla K20 GPU. A workload composed of 400 jobs randomly selected from 8 applications has been used. These applications are GPU-Blast, LIBSVM, Gromacs, LAMMPS, mCUDA-MEME, NAMD, BarraCUDA, and MUMmer.
node with the Slurm scheduler
0
2
4
6
8
10
Ener
gy (k
Wh)
0
2000
4000
6000
8000
10000
Exec
utio
n T
ime
(s)
0
0.05
0.1
0.15
0.2
Ave
rage
GP
U U
tiliz
atio
n
CUDA rCUDAusual way with those when they are executed with SLURM+rCUDA. More details can be found in the paper by S. Iserte et al. titled"Increasing the Performance of Data Centers by Combining Remote GPU Virtualization with Slurm", CCGRID 2016
Cheaper cluster upgrades with rCUDA
0
10000
20000
30000
40000
50000
60000
70000
80000
Exe
cuti
on
Tim
e (
s)
0
10
20
30
40
En
erg
y (
kW
h)
0
0.1
0.2
0.3
0.4
Av
era
ge
GP
U U
tili
zati
on
CUDA rCUDA
No GPU Thanks to rCUDA, clusters not including GPUs can be updated by attaching to them one or more computers containing GPUs. As shown in the plots, when four nodes with GPUs are attached to a
GPU enabled
GPU
cluster with 16 non-GPU nodes, the overall performance with rCUDA is much higher than with CUDA because all the nodes in the original cluster can make concurrent use of the GPUs in the new boxes.
CombiningrCUDA with SLURM
provideslarge benefits
in overall cluster throughput
as well as important reductions
in energy consumption
After migration
1
3
7
9
121314
GPU utilization (%)20 40 60 80 1000
off
offoffoff
off
offoff
VM
VM
VM
VM
GPU
GPU
remote CUDA
GPU
GPU
VM
VM
VM
VM
VM
VM
VM
VM
Nov 2018
Charts compare the results when the 400 jobs are run using SLURM+CUDA in the
Top Related