Download - rCUDA is multiplatform Make your GPUs even How rCUDA works ... · GPUs on Virtual Machines with rCUDA" published at NextGenClouds 2018. With rCUDA it is possible to change, at execution

rCUDA is multiplatform

rCUDA is available for the x86, ARM and Power processor architectures. Furthermore, rCUDA clients and rCUDA servers for different processor architectures can interact among them. This availability allows for different options. For instance:

b) You can run your applications in low-power ARM-based systems while accelerating the compute intensive parts of your applications in remote GPUs located either in x86-based systems or in IBM Power nodes. The following charts are an example of this possibility

a) You can run your applications as always in x86-based systems while accelerating them with the extraordinary performance of the multi-GPU IBM Power systems leveraging the NVLink technology

rCUDA will be available for Windows in the future

rCUDA is currently available for Linux

CPU vs GPU performance of the CloverLeaf application. CPU: up to 64 ThunderX ARM nodes (estimation based on performance in one node). GPU: single ThunderX ARM node using 3 GPUs located in a remote x86 server.

CPU vs GPU performance of the Flowapplication. CPU: up to 64 ThunderX ARM nodes (estimation based on performance in one node). GPU: single ThunderX ARM node using 3 GPUs located in a remotex86 server.

The rCUDA middleware will also be available for the Windows operating system as soon asfunding availability allows to devote resources to that new development.

Make your GPUs even better with rCUDA!

How rCUDA works

The rCUDA middleware leverages the client-server architecture. The rCUDA client isa set of libraries which is installed in the cluster node executing the CUDA application. The rCUDA client replaces NVIDIA's libraries in that node. The server side of rCUDA runs on computers owning GPUs. In order to use rCUDA, applications do not need to be modified. They just must be linked against the rCUDA libraries instead of the CUDA ones.

The rCUDA client provides applications with the illusion of dealing with local GPUs. Whenever an application accesses the virtual GPU, the rCUDA client forwards the request to the server owning the actual GPU. The rCUDA server receives those requests and forwards them to the real GPU.

GPU

CUDA application

client engineCUDA libraries

serverengine

Hardware

Software

Client side Server side

CUDA API

Network

CUDA is limited to use GPUs that are physically installed in the node where the application is executed. With rCUDA applications are not limited to local GPUs but they can use any GPU in the cluster.

GPU

GPU

GPU

nodes without GPUs running CUDA apps

remote CUDA

make a better usage of the GPUs in your cluster

make your GPUs flexible! ... detach ... share ... aggregate

save acquisition and energy costs

[email protected]

High energy savings by sharing the

GPUs installed in the cluster

Use all the GPUs installed in your

cluster from a single node

Seamless use of GPUs installed in remote computers

The rCUDA middleware virtualizes GPUs located in different nodes of the cluster, creating a pool of GPUs that can be accessed from any cluster node. Additionally, rCUDA enables concurrent access to them. This is done transparently to programmers and applications. Furthermore, applicationsource code does not need to be modified. In this way, rCUDA provides applications with the illusion of dealing with real local GPUs.

The rCUDA team thanks Mellanox for its support on promoting the rCUDA technology

The integration of rCUDA with the SLURM job scheduler creates the perfect suite for introducing remote GPU virtualization into your cluster, allowing to effectively reduce the amount of requested GPUs without significantly reducing overall performance. Overall energy consumption is noticeably reduced.

Technical University of Valencia

rCUDA is a development by:

[email protected]

Traditional cluster configuration

Cluster configuration with rCUDA

· Support for CUDA 9.1

· Binary compatibility

· Thread-safe

· Linux compatible

· TCP/IP communications

· Specific InfiniBand support

· Specific support for RoCE

· GPUDirect RDMA enabled

· Multi-GPU server enabled

· Supports multi-GPU applications

Main rCUDA features

rCUDA is binary compatible with

CUDA. Just install rCUDA

in your cluster and keep using

your applications as usual

Application performance not changed

Applications using a remote GPU with EDRInfiniBand perform similar to when executed on local GPUs. More information can be found in the paper: C. Reaño et al. "Local and Remote GPUs Perform Similar with EDR 100G InfiniBand". ACM Middleware Conference 2015

CUDASW++GPU-LIBSVM

LAMMPSGPU-BLAST

MAGMACUDA-MEME

BarraCUDA0.0

0.2

0.4

0.6

0.8

1.0

CUDA

rCUDA

Nor

ma

lized

Exe

cT

ime

CUDA case

Amazing performance when moving data to/from remote GPUs

rCUDA case

rCUDA case

0 10 20 30 40 50 600

2000

4000

6000

8000

10000

CUDA K20 K20 FDR rCUDACUDA K40 K40 EDR rCUDACUDA P100 P100 EDR rCUDA

Copy Size (MB)

Ban

dw

idth

(MB

/s)

CUDA case

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 4 8 16 32 64

Exec

uti

on

tim

e (s

)

# nodes

ARM P100 x3

0

500

1000

1500

2000

4 8 16 32 64

Exec

uti

on

tim

e (s

)

0

5000

10000

15000

20000

1 2 4 8 16 32 64

Exec

uti

on

tim

e (s

)

# nodes

ARM P100 x 3

0

500

1000

1500

2000

2500

8 16 32 64

Exec

uti

on

tim

e (s

)

10 years virtualizing

GPUs

0 10 20 30 40 50 600

2000

4000

6000

8000

10000

12000

K20 CUDA K20 FDR rCUDA

K40 CUDA K40 EDR rCUDA

P100 CUDA P100 EDR rCUDA

Copy Size (MB)

Ba

ndw

idth

(MB

/s)

rCUDA for Cloud GPU migration with rCUDArCUDA and SlurmMore GPUs with rCUDAUsing GPUs in cloud computing is increasingly becoming mainstream. CUDA applications running inside a virtual machine (VM) can be accelerated by using the GPUs in the host. However, assigning GPUs to VMs presents an important burden when using the inexpensive PCI passthrough mechanism: efficiently sharing the GPUs among the VMs.

VMs can access GPUs in the host by means of the PCI passthrough technique. This mechanism assigns a PCI device to a VM in an exclusive way. That is, once a GPU is assigned to a VM, it cannot be assigned to other VMs until the device is detached from the former VM. This is a limitation. The figure on the left shows a server with four GPUs. Each GPU is assigned to one VM.

How can a VM access CUDA GPUs?

GPUVM

GPUVM

GPUVM

GPUVM

rCUDA is designed following the client-server distributed approach. Therefore, rCUDA clients can be placed inside VMs whereas the rCUDA server is placed in the native domain owning the GPU. Using the PCI passthrough mechanism is not required. As a consequence, GPUs can be concurrently shared by all the VMs. The figure on the right shows a 4-GPU cluster node with the rCUDA middleware serving four VMs.

How does rCUDA avoid PCI passthrough?

remote CUDA

GPUVM

GPUVM

GPUVM

GPUVM

GPUVM

VM

GPUVM

VM

remote CUDA

Using rCUDA for providing GPUs to VMs allows for an increased level of flexibility. For instance, the figure on the left shows that the amount of GPUs installed in the node can be reduced. On the other hand, the figure on the right shows how a node providing GPU acceleration can be shared among several nodes hosting VMs. The sharing degree of GPUs is limited by their physical memory. A scheduler is provided with rCUDA in order to control GPU memory occupancy.

CUDA-MEME CUDASW++ GPU-BLAST LAMMPS0

50

100

150

200

250

300

350

PCI passthrough 4 VMs rCUDA 4 VMs

rCUDA 8 VMs rCUDA 12 VMs

Exe

cutio

nT

ime

(se

con

ds)

CUDA-MEME CUDASW++ GPU-BLAST LAMMPS0

10

20

30

40

50

60

70

80

PCI passthrough 4 VMs rCUDA 4 VMs

rCUDA 8 VMs rCUDA 12 VMs

Com

ple

ted

Job

s

Which performance can be expected?

Four VMs were created in a node with four GPUs. Each of the VMs was assigned one GPU and executed a series of randomly selected applications for one hour. Results from this test are shown in the charts as "PCI passthough 4 VMs". Moreover, rCUDA was used in order to share the four GPUs among up to twelve VMs. Results labeled as "rCUDA 4 VMs", "rCUDA 8 VMs" and "rCUDA 12 VMs" refer to these experiments. More informationabout this experiment can be found in the paper by J. Prades and F. Silla titled "Made-to-Measure GPUs on Virtual Machines with rCUDA" published at NextGenClouds 2018.

With rCUDA it is possible to change, at execution time, the GPU assigned to an application.That is, the application is initially provided one or more GPUs but, during its execution, the GPU part of the application is transparently migrated to other GPU (or GPUs) elsewhere in the cluster. The figure on the right shows the migration time for five different CUDA applications when using FDR InfiniBand, EDR InfiniBand and 1Gbps Ethernet. It can be seen that migration time is very small when InfiniBand is used.

123456789

1011121314

GPU utilization (%)20 40 60 80 1000

Before migration Resource utilization in data centers evolves over time. Thus, at some point in time, the utilization of the GPUs in the cluster may be similar to that depicted in the figure on the left. It can be seen that GPU utilization is uneven. In this scenario it would be useful to consolidate the GPU part of an application into a smaller number of servers, so that those nodes that become free can be switched off. The figure on theright depicts this

consolidation of GPU jobs, where jobs generating a lower GPU utilization, such as the ones in nodes 2, 4, 5, 6, 8, 10, and 11 have been migrated to other nodes. After migration, nodes sourcing the movement of GPU jobs have been switched off.

Using GPU migration for server consolidation

3.83 27.5222.25

Using GPU migration for load balancingWhen several applications share a given GPU in the cluster, rCUDA allows to migrate each of the GPU jobs of these applications to different destination GPUs. That is, it is not required that GPUs are migrated as a whole. This individual migration of GPUjobs allows load to be balanced across the GPUs in the cluster.

The figure below on the left shows three applications sharing a GPU with rCUDA while two other GPUs remain idle. The job scheduler can detect this uneven GPU utilization and can balance load by migrating some of the applications to other GPUs.

GPU

remote CUDA

CloverLeaf

GPU-Blast

CUDAmeme

GPU

GPU

CloverLeafGPU-Blast

CUDAmeme

remote CUDA

remote CUDA

Idle GPU

Idle GPU

GPU

remote CUDA

CloverLeaf

GPU-Blast

CUDAmeme

GPU

GPU

CloverLeaf

GPU-Blast

CUDAmeme

remote CUDA

remote CUDA

Before load balancing After load balancing

1 2 4 6 8 10 12 140

1000

2000

3000

4000

5000

6000

CUDA rCUDA

Number of GPUs

Op

tions

per

seco

nd

rCUDA detaches GPUs from the nodes where they are installed, thus creating a pool of GPUs that can be used from any node of the cluster. Therefore, rCUDA allowsnon-MPI applications to use all the GPUs in the cluster from a single node.

With rCUDA, the amount of GPUs that can be assigned to a single non-MPI applicationis not limited to the GPUs physically installed in the node running the application.

Access all the GPUs in the cluster from z single node

64 GPUs!!!

Options per second computed by the MonteCarlo MultiGPU sample from NVIDIA. CUDA leverages up to the 4 GPUs available in one node. rCUDA makes use of up to the 14 K20 GPUs available in the cluster.

Use all the GPUs in your cluster to boost renderingrCUDA supports renderers such asBlender or Octane. These renderers typically split the scene to render into smaller parts and then compute each part independently from each other. When several GPUs are provided to the renderer, it computes the different parts of the scene in parallel in the available GPUs, thus reducing rendering time.

1 GPU 2 GPUs 4 GPUs

8 GPUs 12 GPUs 16 GPUs

0

5

10

15

20

25

1 2 4 8 16

Rend

erin

g �

me

(min

utes

)

Amount of GPUs

TensorFlow gets accelerated by using many GPUsDeep learning frameworks benefit of having access to many GPUs in order to reduce the time required to train a neural network. TensorFlow as well as other deep learning frameworks, such as Caffe, can make use of multiple GPUs. These frameworks can benefit from rCUDA by making use of several remote GPUs.

Work

fu

nd

ed

by G

en

era

lita

t V

ale

ncia

na u

nd

er

Gra

nt

PR

OM

ETEO

/2017/0

77.

Th

e r

CU

DA

Team

is g

rate

ful fo

r th

e s

up

port

pro

vid

ed

by M

ellan

ox T

ech

nolo

gie

s The need for integrating rCUDA and SLURMSLURM is not able to deal with the virtual GPUs provided by rCUDA. Thus, when a job that is submitted to the SLURM queues includes within its requirements one or more GPUs per node, SLURM can only assign that job nodes owning the requested amount of GPUs, neglecting the benefits of remote GPU virtualization. By integrating rCUDA with SLURM, applications are able to use GPUs independently of the exact nodes where applications, on the one hand, and GPUs, on the other, are physicaly located.

rCUDA allows to share remote GPUs among several jobs. Thus integrating rCUDA and SLURM would increase cluster throughputas well as GPU utilization at the same time that consumed energy is reduced.

The integration of rCUDA with SLURM is currently being developed. Moreover, a GPU scheduler is being developed in order to manage virtual GPUs in scenarios where SLURMis not required, as the cloud computing domain, for instance.

Increased performance with rCUDA+SlurmSome initial performance measurements have been carried outin a small 16-node cluster. Each node includes two Intel processors and one FDR InfiniBand adapter as well as one NVIDIA Tesla K20 GPU. A workload composed of 400 jobs randomly selected from 8 applications has been used. These applications are GPU-Blast, LIBSVM, Gromacs, LAMMPS, mCUDA-MEME, NAMD, BarraCUDA, and MUMmer.

node with the Slurm scheduler

0

2

4

6

8

10

Ener

gy (k

Wh)

0

2000

4000

6000

8000

10000

Exec

utio

n T

ime

(s)

0

0.05

0.1

0.15

0.2

Ave

rage

GP

U U

tiliz

atio

n

CUDA rCUDAusual way with those when they are executed with SLURM+rCUDA. More details can be found in the paper by S. Iserte et al. titled"Increasing the Performance of Data Centers by Combining Remote GPU Virtualization with Slurm", CCGRID 2016

Cheaper cluster upgrades with rCUDA

0

10000

20000

30000

40000

50000

60000

70000

80000

Exe

cuti

on

Tim

e (

s)

0

10

20

30

40

En

erg

y (

kW

h)

0

0.1

0.2

0.3

0.4

Av

era

ge

GP

U U

tili

zati

on

CUDA rCUDA

No GPU Thanks to rCUDA, clusters not including GPUs can be updated by attaching to them one or more computers containing GPUs. As shown in the plots, when four nodes with GPUs are attached to a

GPU enabled

GPU

cluster with 16 non-GPU nodes, the overall performance with rCUDA is much higher than with CUDA because all the nodes in the original cluster can make concurrent use of the GPUs in the new boxes.

CombiningrCUDA with SLURM

provideslarge benefits

in overall cluster throughput

as well as important reductions

in energy consumption

After migration

1

3

7

9

121314

GPU utilization (%)20 40 60 80 1000

off

offoffoff

off

offoff

VM

VM

VM

VM

GPU

GPU

remote CUDA

GPU

GPU

VM

VM

VM

VM

VM

VM

VM

VM

Nov 2018

Charts compare the results when the 400 jobs are run using SLURM+CUDA in the