Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory...

87
Increasing the efficiency of your GPU-enabled cluster with rCUDA Federico Silla Technical University of Valencia Spain

Transcript of Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory...

Page 1: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

Increasing the efficiency of your

GPU-enabled cluster

with rCUDA

Federico Silla Technical University of Valencia

Spain

Page 2: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 2/87

Outline

Why remote GPU virtualization?

How does rCUDA work?

The performance of the rCUDA framework

Scheduling virtual GPUs with SLURM

rCUDA and KVM virtual machines

Low-power processors and rCUDA

Page 3: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 3/87

Outline

Why remote GPU virtualization?

How does rCUDA work?

The performance of the rCUDA framework

Scheduling virtual GPUs with SLURM

rCUDA and KVM virtual machines

Low-power processors and rCUDA

Page 4: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 4/87

Current computing needs

• Many applications require a lot of computing

resources

• Execution time is usually increased

• Applications are accelerated to get their

execution time reduced

• GPU computing has experienced a

remarkable growth in the last years

Page 5: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 5/87

GPUs reduce energy and time

On GPU On CPU On GPU On CPU On GPU On CPU On GPU On CPU

blastp –db sorted_env_nr –query SequenceLength_00001300.txt -num_threads X -gpu [t|f]

Dual socket E5-2620 v2 Intel Xeon node with NVIDIA K20 GPU

GPU-Blast: Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool

2.36x 3.56x

Page 6: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 6/87

Current GPU computing facilities

The basic building block is a node with one or more GPUs

Page 7: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 7/87

From the programming point of view:

A set of nodes, each one with:

one or more CPUs (with several cores per CPU)

one or more GPUs (typically between 1 and 4)

An interconnection network

Current GPU computing facilities

Page 8: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 8/87

A GPU computing facility is usually a set of independent self-

contained nodes that leverage the shared-nothing approach

Nothing is directly shared among nodes (MPI required for aggregating

computing resources within the cluster)

GPUs can only be used within the node they are attached to

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Current GPU computing facilities

Page 9: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 9/87

For many workloads, GPUs may be idle for long periods of time:

• Initial acquisition costs not amortized

• Space: GPUs reduce CPU density

• Energy: idle GPUs keep consuming power

Money leakage in current clusters?

Time (s)

Idle

Pow

er

(Watts)

1 GPU node

4 GPUs node

• 1 GPU node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 GPU

• 4 GPUs node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 GPUs

25%

Page 10: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 10/87

• Applications can only use the GPUs located

within their node

• Non-accelerated applications keep GPUs idle

in the nodes when they use all the cores

• Multi-GPU applications running on a subset

of nodes cannot make use of the tremendous

GPU resources available at other cluster

nodes (even if they are idle)

Further concerns in accelerated clusters

Page 11: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 11/87

What is missing is ...

... some flexibility for

using the GPUs in the

cluster

We need something else in the cluster

Page 12: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 12/87

How to address current concerns:

Increasing flexibility

A way of addressing the “idle” concern is

by sharing the GPUs present in the cluster

among all the nodes and, once GPUs are

shared, their amount can be reduced

This would increase GPU utilization, also

lowering power consumption, at the same

time that initial acquisition costs are reduced

Page 13: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 13/87

• This new cluster configuration requires:

• A way of seamlessly sharing GPUs

across nodes in the cluster (remote

GPU virtualization)

• Enhanced job schedulers that take into

account the new virtual GPUs

What is needed for increased flexibility?

Page 14: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 14/87

Remote GPU virtualization allows a new vision of a GPU

deployment, moving from the usual cluster configuration:

Remote GPU virtualization envision

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

to the following one ….

Page 15: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 15/87

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Physical configuration

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

Interconnection Network

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Logical connections

Logical configuration

Remote GPU virtualization envision

Page 16: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 16/87

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Physical configuration

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

Interconnection Network

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Logical connections

Logical configuration

Busy cores are no longer a problem

Page 17: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 17/87

Without GPU virtualization

GPU virtualization is also useful for multi-GPU applications

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

Interconnection Network

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Logical connections

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

With GPU virtualization

Many GPUs in the

cluster can be provided

to the application

Only the GPUs in the

node can be provided

to the application

Multi-GPU applications get benefit

Page 18: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 18/87

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

Interconnection Network

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Remote GPU virtualization envision

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Without GPU

virtualization

With GPU

virtualization

Real local GPUs

Virtualized remote GPUs

GPU virtualization allows all nodes to

access all GPUs

Page 19: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 19/87

+ GPU

+ GPU

+ GPU

+ GPU

+ GPU

One step further:

enhancing the scheduling process so

that GPU servers are put into low-power

sleeping modes as soon as their

acceleration features are not required

More about reducing energy consumption

Page 20: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 20/87

Going even beyond:

consolidating GPUs into dedicated

GPU boxes (no CPU power required)

allowing GPU task migration

TRUE GREEN GPU

COMPUTING

Networked GPU boxes

Page 21: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 21/87

Box A has 4 GPUs and only one is busy

Box B has 8 GPUs but only two are busy

Move jobs from Box B to Box A and switch

off Box B

Migration should be transparent to

applications (decided by the global

scheduler)

TRUE GREEN GPU

COMPUTING

GPU task migration

Box A

Box B

Page 22: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 22/87

Box A has 4 GPUs and only one is busy

Box B has 8 GPUs but only two are busy

Move jobs from Box B to Box A and switch

off Box B

Migration should be transparent to

applications (decided by the global

scheduler)

TRUE GREEN GPU

COMPUTING

GPU task migration

Box A

Box B

Page 23: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 23/87

0 2000 4000 6000 8000 10000 12000 14000 16000 180000

10

20

30

40

50

60

70

80

90

100Influence of data transfers for SGEMM

Pinned Memory

Non-Pinned Memory

Matrix Size

Tim

e d

evo

ted

to

da

ta tra

nsfe

rs (

%)

Main GPU virtualization drawback is the increased

latency and reduced bandwidth to the remote GPU

Problem with remote GPU virtualization

Data from a matrix-matrix

multiplication using a local GPU!!!

Page 24: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 24/87

Several efforts have been made regarding GPU

virtualization during the last years:

rCUDA (CUDA 6.0)

GVirtuS (CUDA 3.2)

DS-CUDA (CUDA 4.1)

vCUDA (CUDA 1.1)

GViM (CUDA 1.1)

GridCUDA (CUDA 2.3)

V-GPU (CUDA 4.0)

Remote GPU virtualization frameworks

Publicly available

NOT publicly available

Page 25: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 25/87

Performance comparison of GPU virtualization solutions:

Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture)

GPU NVIDIA Tesla K20

Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)

SX6025 Mellanox switch

CentOS 6.3 + Mellanox OFED 1.5.3

Pageable

H2D

Pinned

H2D

Pageable

D2H

Pinned

D2H

CUDA 34,3 4,3 16,2 5,2

rCUDA 94,5 23,1 292,2 6,0

GVirtuS 184,2 200,3 168,4 182,8

DS-CUDA 45,9 - 26,5 -

Latency (measured by transferring 64 bytes)

Remote GPU virtualization frameworks

Page 26: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 26/87

MB

/sec

Transfer size (MB)

Host-to-Device

pageable memory

Device-to-Host

pageable memory

Host-to-Device

pinned memory

Device-to-Host

pinned memory

MB

/sec

Transfer size (MB)

MB

/sec

Transfer size (MB)

MB

/sec

Transfer size (MB)

Bandwidth of a copy between GPU and CPU memories

Remote GPU virtualization frameworks

Page 27: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 27/87

Applications tested with rCUDA

rCUDA has been successfully tested with the following

applications:

NVIDIA CUDA SDK Samples

LAMMPS

WideLM

CUDASW++

OpenFOAM

HOOMDBlue

mCUDA-MEME

The list keeps growing

GPU-Blast

Gromacs

GAMESS

DL-POLY

HPL

Page 28: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 28/87

Outline

Why remote GPU virtualization?

How does rCUDA work?

The performance of the rCUDA framework

Scheduling virtual GPUs with SLURM

rCUDA and KVM virtual machines

Low-power processors and rCUDA

Page 29: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 29/87

It is useful for:

Applications that do not make use of GPUs all the time

(moderate level of data parallelism)

Applications for multi-GPU computing

A framework enabling a CUDA-based

application running in one (or some) node(s)

to access GPUs in other nodes

Basics of the rCUDA framework

Page 30: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 30/87

Basics of the rCUDA framework

Basic CUDA behavior

Page 31: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 31/87

Basics of the rCUDA framework

Page 32: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 32/87

Basics of the rCUDA framework

Page 33: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 33/87

Basics of the rCUDA framework

Page 34: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 34/87

How to declare remote GPUs

Environment variables are properly initialized in the client side and

used by the rCUDA client (transparently to the application)

Server name/IP address : GPU

Amount of GPUs exposed to

applications

Page 35: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 35/87

rCUDA uses a proprietary

communication protocol

Example:

1) initialization

2) memory allocation on the

remote GPU

3) CPU to GPU memory

transfer of the input data

4) kernel execution

5) GPU to CPU memory

transfer of the results

6) GPU memory release

7) communication channel

closing and server

process finalization

Basics of the rCUDA framework

Page 36: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 36/87

rCUDA presents a modular architecture

Page 37: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 37/87

Use of GPUDirect RDMA to move data between GPUs

Pipelined transfers to improve performance

Preallocated pinned memory buffers

Optimal pipeline block size

rCUDA uses optimized transfers

rCUDA features optimized communications:

Page 38: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 38/87

Pipeline block size for InfiniBand FDR

• NVIDIA Tesla K20; Mellanox ConnectX-3 + SX6025 Mellanox switch

It was

2MB with

IB QDR

Basic performance analysis

Page 39: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 39/87

Host-to-Device

pinned memory

Bandwidth of a copy between GPU and CPU memories

Host-to-Device

pageable memory

Basic performance analysis

Page 40: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 40/87

Latency study: copy a small dataset (64 bytes)

Basic performance analysis

Page 41: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 41/87

Outline

Why remote GPU virtualization?

How does rCUDA work?

The performance of the rCUDA framework

Scheduling virtual GPUs with SLURM

rCUDA and KVM virtual machines

Low-power processors and rCUDA

Page 42: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 42/87

Test system:

Intel Xeon E5-2620 (6 cores) 2.0 GHz (SandyBrigde architecture)

GPU NVIDIA Tesla K20

Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)

SX6025 Mellanox switch

Cisco switch SLM2014 (1Gbps Ethernet)

CentOS 6.3 + Mellanox OFED 1.5.3

Performance of the rCUDA framework

Page 43: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 43/87

BT IE RE AT TA FF SK CB MC LU MB SC FI HE BN IS LC CG TT IN F2 SN FD BF VR RG MI DI0

1

2

3

4

5

6

7

8

9

CUDA

rCUDA FDR

rCUDA Gb

No

rma

lized

tim

e

Test CUDA SDK examples

Single-GPU applications

Page 44: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 44/87

CUDASW++

Bioinformatics software for Smith-Waterman protein database searches

144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478

0

5

10

15

20

25

30

35

40

45

0

2

4

6

8

10

12

14

16

18

FDR Overhead QDR Overhead GbE Overhead CUDA

rCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhe

ad

(%

)

Exe

cu

tio

n T

ime

(s)

Single-GPU applications

Page 45: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 45/87

GPU-Blast

Accelerated version of the NCBI-BLAST (Basic Local Alignment Search Tool), a widely used bioinformatics tool

2 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400

0

50

100

150

200

250

300

350

0

10

20

30

40

50

60

FDR Overhead QDR Overhead GbE Overhead CUDA

rCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhe

ad

(%

)

Exe

cu

tio

n T

ime

(s)

Single-GPU applications

Page 46: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 46/87

Test system for Multi-GPU applications:

For CUDA tests one node with:

2 Quad-Core Intel Xeon E5440 processors

Tesla S2050 (4 Tesla GPUs)

Each thread (1-4) uses one GPU

For rCUDA tests 8 nodes with:

2 Quad-Core Intel Xeon E5520 processors

1 NVIDIA Tesla C2050

InfiniBand QDR

Test running in one node and using up to all the GPUs from the others

Each thread (1-8) uses one GPU

Multi-GPU applications

Page 47: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 47/87

MonteCarlo MultiGPU (from NVIDIA SDK)

Multi-GPU applications

Page 48: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 48/87

Outline

Why remote GPU virtualization?

How does rCUDA work?

The performance of the rCUDA framework

Scheduling virtual GPUs with SLURM

rCUDA and KVM virtual machines

Low-power processors and rCUDA

Page 49: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 49/87

• SLURM (Simple Linux Utility for Resource Management) job scheduler

• SLURM does not understand about virtualized GPUs

• Add a new GRES (general resource) in order to manage virtualized GPUs

• Where the GPUs are in the system is completely transparent to the user

• In the job script, or in the submission command, the user specifies the number of rGPUs (remote GPUs) required by the job. The amount of GPU memory required by the job may also be specified

Integrating rCUDA with SLURM

Page 50: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 50/87

The basic idea about SLURM

Page 51: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 51/87

GPUs are

decoupled

from nodes

The basic idea about SLURM + rCUDA

All jobs are

executed in

less time

Page 52: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 52/87

GPUs are

decoupled

from nodes

All jobs are

executed

even in

less time

Sharing remote GPUs among jobs

GPU 0 is

scheduled to

be shared

among jobs

Page 53: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 53/87

ClusterName=rcu

GresTypes=gpu,rgpu

NodeName=rcu16 NodeAddr=rcu16 CPUs=8 Sockets=1

CoresPerSocket=4 ThreadsPerCore=2

RealMemory=7990 Gres=rgpu:4,gpu:4

slurm.conf

SLURM configuration

Page 54: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 54/87

NodeName=rcu16 Arch=x86_64 CoresPerSocket=4

CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null)

Gres=rgpu:4,gpu:4

NodeAddr=rcu16 NodeHostName=rcu16

OS=Linux RealMemory=7682 Sockets=1

State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1

BootTime=2013-04-24T18:45:35

SlurmdStartTime=2013-04-30T10:02:04

scontrol show node

SLURM configuration

Page 55: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 55/87

Name=rgpu File=/dev/nvidia0 Cuda=2.1 Mem=1535m

Name=rgpu File=/dev/nvidia1 Cuda=3.5 Mem=1535m

Name=rgpu File=/dev/nvidia2 Cuda=1.2 Mem=1535m

Name=rgpu File=/dev/nvidia3 Cuda=1.2 Mem=1535m

Name=gpu File=/dev/nvidia0

Name=gpu File=/dev/nvidia1

Name=gpu File=/dev/nvidia2

Name=gpu File=/dev/nvidia3

gres.conf

SLURM configuration

Page 56: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 56/87

$

$ srun -N1 --gres=rgpu:4:512M script.sh...

$

Submit a job

RCUDA_DEVICE_COUNT=4

RCUDA_DEVICE_0=rcu16:0

RCUDA_DEVICE_1=rcu16:1

RCUDA_DEVICE_2=rcu16:2

RCUDA_DEVICE_3=rcu16:3

Submitting jobs with SLURM

Environment variables are initialized by SLURM and used by

the rCUDA client (transparently to the user)

Server name/IP address : GPU

Page 57: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 57/87

Test bench for checking SLURM correctness:

Old “heterogeneous” cluster with 28 nodes:

24 nodes without GPU with 2 AMD Opteron 1.4GHz, 2GB RAM

1 node without GPU, QuadCore i7 3.5GHz, 8GB de RAM

3 nodes with GeForce GTX 670 or 590 GPUs 2GB, QuadCore i7 3GHz

1Gbps Ethernet interconnect

Scientific Linux release 6.1 and also Open Suse 11.2 (dual boot)

Integrating rCUDA with SLURM

Page 58: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 58/87

Test bench for analyzing SLURM+rCUDA performance:

InfiniBand ConnectX-3 based cluster

CentOS 6.4 Linux

Dual socket E5-2620 v2 Intel Xeon based nodes:

3 nodes without GPU

1 node with NVIDIA K40 GPU

10 nodes with NVIDIA K20 GPU

1 node with 4 NVIDIA K20 GPU

First tests carried out. More complex tests in the way

Integrating rCUDA with SLURM

1 node hosting the main SLURM controller

1 node without GPU

1 node with the K20 GPU

1 node with the K40 GPU

Page 59: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 59/87

GPUBlast using K40 GPU remotely

Theoretical CUDA

Exclusive CUDA

Shared CUDA

Theoretical rCUDA (Eth)

Exclusive rCUDA (Eth)

Shared rCUDA (Eth)

Theoretical rCUDA (IB)

Exclusive rCUDA (IB)

Shared rCUDA (IB)

Page 60: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 60/87

Outline

Why remote GPU virtualization?

How does rCUDA work?

The performance of the rCUDA framework

Scheduling virtual GPUs with SLURM

rCUDA and KVM virtual machines

Low-power processors and rCUDA

Page 61: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 61/87

Why rCUDA with virtual machines?

• Current clusters frequently leverage virtual machines in order to attain energy and cost reductions

• Xen and KVM virtual machines are commonly leveraged

• Applications being executed inside virtual machines need access to GPU computing resources

• Different Xen and KVM virtual machines can share an InfiniBand network card thanks to the virtualization features of the InfiniBand driver

• NVIDIA drivers do not provide virtualization features, thus avoiding the concurrent usage of the GPU from several virtual machines

• rCUDA can be used to provide concurrent access to GPUs

Page 62: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 62/87

How to connect an IB card to a VM

Real IB card Virtual IB cards (VFs)

• PCI pass-through is used to assign an IB card (either real or virtual) to a given VM

• The IB card manages the several virtual copies

Page 63: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 63/87

Performance analysis on KVM VMs

RDMA write

Bandw

idth

(M

B/s

)

Transfer size (Bytes)

Computer with

KVM VMs

Remote

computer

RealX3: actual Connect-X3 (1 port) card

VF: virtual copy used from host OS

VM: virtual card copy used from VM

RDMA write

Norm

aliz

ed

Ba

ndw

idth

Transfer size (Bytes)

RDMA write Norm

aliz

ed late

ncy

Transfer size (Bytes)

Page 64: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 64/87

CUDA BWtest from a KVM VM

Computer with

KVM VMs

Remote computer

with Tesla K20

Page 65: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 65/87

Outline

Why remote GPU virtualization?

How does rCUDA work?

The performance of the rCUDA framework

Scheduling virtual GPUs with SLURM

rCUDA and KVM virtual machines

Low-power processors and rCUDA

Page 66: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 66/87

rCUDA and low-power processors

• There is a clear trend to improve the performance-power ratio in HPC facilities:

• By leveraging accelerators (Tesla K20 by NVIDIA, FirePro by AMD, Xeon Phi by Intel, …)

• More recently, by using low-power processors (Intel Atom, ARM, …)

• rCUDA allows to attach a “pool of accelerators” to a computing facility:

• Less accelerators than nodes accelerators shared among nodes

• Performance-power ratio probably improved further

• How interesting is a heterogeneous platform leveraging powerful processors, low-power ones, and remote GPUs?

• Which is the best configuration?

• Is energy effectively saved?

Page 67: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 67/87

rCUDA and low-power processors

• The computing platforms analyzed in this study are:

• KAYLA: NVIDIA Tegra 3 ARM Cortex A9 quad-core CPU (1.4 GHz), 2GB DDR3 RAM and Intel 82574L Gigabit Ethernet controller

• ATOM: Intel Atom quad-core CPU s1260 (2.0 GHz), 8GB DDR3 RAM and Intel I350 Gigabit Ethernet controller. No PCIe connector for a GPU

• XEON: Intel Xeon X3440 quad-core CPU (2.4GHz), 8GB DDR3 RAM and 82574L Gigabit Ethernet controller

• CARMA: NVIDIA Quadro 1000M GPU (96 cores) with 2GB DDR5 RAM. The rest of the system is the same as for KAYLA

• FERMI: NVIDIA GeForce GTX480 “Fermi” GPU (448 cores) with 1,280 MB DDR3/GDDR5 RAM

• The accelerators analyzed are: CARMA platform

Page 68: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 68/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

Page 69: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 69/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

Page 70: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 70/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

• LAMMPS makes a more intensive use of the GPU than CUDASW++

Page 71: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 71/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

• LAMMPS makes a more intensive use of the GPU than CUDASW++

• The lower performance of the PCIe bus in Kayla reduces performance

Page 72: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 72/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

• The XEON+FERMI combination achieves the best performance

• LAMMPS makes a more intensive use of the GPU than CUDASW++

• The lower performance of the PCIe bus in Kayla reduces performance

• The 96 cores of CARMA provide a noticeably lower performance than a more powerful GPU

Page 73: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 73/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

CARMA

requires

less power

Page 74: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 74/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

When execution time

and required power are

combined into energy,

XEON+FERMI is more

efficient

Page 75: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 75/87

rCUDA and low-power processors

• 1st analysis: using local GPUs; CUDASW++ (top) LAMMPS (bottom)

Summing up:

- If execution time is a concern, then use XEON+FERMI

- If power is to be minimized, CARMA is a good choice

- The lower performance of KAYLA’s PCIe decreases the interest of

this option for single system local-GPU usage

- If energy must be minimized, XEON+FERMI is the right choice

Page 76: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 76/87

rCUDA and low-power processors

• 2nd analysis: use of remote GPUs

• The use of the CUADRO GPU within the CARMA system is discarded as rCUDA server due to its poor performance. Only the FERMI GPU is considered

• Six different combinations of rCUDA clients and servers are feasible:

• Only one client system and one server system are used at each experiment

Combination Client Server

1 KAYLA KAYLA+FERMI

2 ATOM KAYLA+FERMI

3 XEON KAYLA+FERMI

4 KAYLA XEON+FERMI

5 ATOM XEON+FERMI

6 XEON XEON+FERMI

Page 77: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 77/87

rCUDA and low-power processors

• 2nd analysis: use of remote GPUs CUDASW++

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 78: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 78/87

rCUDA and low-power processors

• 2nd analysis: use of remote GPUs CUDASW++

The KAYLA-based

server presents

much higher

execution time

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 79: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 79/87

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

rCUDA and low-power processors

• 2nd analysis: use of remote GPUs

The KAYLA-based

server presents

much higher

execution time

CUDASW++

The network

interface for

KAYLA delivers

much less than the

expected 1Gbps

bandwidth

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 80: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 80/87

rCUDA and low-power processors

• 2nd analysis: use of remote GPUs CUDASW++

The lower

bandwidth of

KAYLA (and ATOM)

can also be noticed

when using the

XEON server

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 81: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 81/87

rCUDA and low-power processors

• 2nd analysis: use of remote GPUs CUDASW++

Most of the power and

energy is required by

the server side, as it

hosts the GPU

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 82: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 82/87

rCUDA and low-power processors

• 2nd analysis: use of remote GPUs CUDASW++

The much shorter

execution time of

XEON-based servers

render more energy-

efficient systems

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA ATOM XEON KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 83: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 83/87

rCUDA and low-power processors

LAMMPS • 2nd analysis: use of remote GPUs

Combination Client Server

1 KAYLA KAYLA+FERMI

2 KAYLA XEON+FERMI

3 ATOM XEON+FERMI

4 XEON XEON+FERMI

KAYLA KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 84: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 84/87

rCUDA and low-power processors

LAMMPS • 2nd analysis: use of remote GPUs

Similar

conclusions to

CUDASW++ apply

to LAMMPS, due

to the network

bottleneck

KAYLA KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

KAYLA KAYLA ATOM XEON

KAYLA + FERMI XEON + FERMI

Page 85: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 85/87

rCUDA and low-power processors

• Atom-based systems with PCIe 2.0 x8 may hold an InfiniBand Card

• InfiniBand QDR adapters would noticeably increase network bandwidth

Page 86: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 86/87

Summary about rCUDA

rCUDA is the enabling technology for:

• High Throughput Computing

• Using remote GPUs does not make applications to execute faster

• Sharing remote GPUs makes applications to execute slower

• BUT more throughput (jobs/time) is achieved

• Datacenter administrators can choose between HPC and HTC

• Green Computing

• GPU migration and application migration allow to

devote just the required computing resources to

the current load

• More flexible system upgrades

• GPU and CPU updates become independent

from each other. Adding GPU boxes to non

GPU-enabled clusters is possible

Page 87: Increasing the efficiency of your GPU-enabled cluster with rCUDA · 2020. 1. 14. · HPC Advisory Council European Conference 2014, Leipzig 5/87 GPUs reduce energy and time On GPU

HPC Advisory Council European Conference 2014, Leipzig 87/87

You can get a free copy of rCUDA at

http://www.rcuda.net