MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2...

19
Sylvain Jeaugey MULTI-GPU TRAINING WITH NCCL

Transcript of MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2...

Page 1: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

Sylvain Jeaugey

MULTI-GPU TRAINING WITH NCCL

Page 2: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

2

MULTI-GPU COMPUTINGHarvesting the power of multiple GPUs

1 GPUMultiple GPUs per system

Multiple systems connected

NCCL : NVIDIA Collective Communication Library

NCCL

Page 3: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

3

MULTI-GPU DL TRAININGSingle-GPU

parameters

gradients

batch(e.g. 256 images)

Forward/Backward

UpdateDatabase : GBs of

input data : images, sound, …

Page 4: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

4

MULTI-GPU DL TRAININGData parallel

parameters

batch

gradients

local gradients

parameters

batch

gradients

local gradients

parameters

batch

gradients

local gradients

NCCL Allreduce : Sum gradients across GPUs

parameters

batch

gradients

local gradients

Page 5: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

5

NCCLA multi-GPU communication library

Sockets (Ethernet)

Infiniband,with GPU Direct RDMA

Between systemsWithin a system

PCIe

NVLink

GPU Direct P2P

Page 6: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

6

NCCLArchitecture

NCCL

CUDA

CUBLAS

Tensorflow

(+Horovod)PyTorch MXNet Caffe2 Caffe

Deep Learning Frameworks

NVIDIA GPUs

CUDNN

CNTK

Page 7: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

7

TIMELINENCCL history & roadmap

Inter-node

communication

Improved

latency

Aggregated

operations

Large scale

algorithms

Point-to-point

primitives

(Send/Recv)

2.0 2.1

2.2 2.3 2.4

Intra-node

communication

1.x

Page 8: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

8

NCCL 2.0Provide best performance to DL apps

0

10

20

30

40

50

60

70

QPI CPU PCI Switch DGX1 (Pascal) DGX1 (Volta)

Allreduce Bandwidth (OMB, size=128MB)

58

12

132

62

GB/s

Page 9: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

9

NCCL 2.1ResNet50 buffer size

Latency is important in some workloads, e.g. ResNet 50, in particular when reductions are done for each layer.

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M02468

10121416182022242628303234

Bytes

Occ

ure

nce

s

ResNet50 / MXNet

FP32

FP16

Page 10: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

10

NCCL 2.1Latency improvement

0

20

40

60

80

100

120

140

160

2 4 8 16

NCCL Latency (in us)

2.0.5 2.1.0

n GPUs

265

75

40

22

8 10 14

36

1 node, NVLink2 nodes,

NVLink+Infiniband

7x

5x

μs

Page 11: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

11

NCCL

NCCL 2.2Aggregated operations : principle

CUDA

DL frameworkncclAllReduce

cudaLaunchKernel

Principle : Merge multiple operations on the same CUDA devicePay the launch overhead only once (more operations per second)Use multiple NVLinks simultaneously (more bandwidth)

Page 12: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

12

NCCL 2.2Aggregated operations : overhead

# of Aggregated ops

μs

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1 2 4 8 16 32 64 128 256

Per-operation time, 8 GPUs, 8 Bytes reduction

Page 13: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

13

NCCL 2.2Aggregated operations : usage

Use ncclGroupStart() / ncclGroupEnd() around the NCCL operations we want to aggregate :

ncclGroupStart();

for (int op=0; op<nops; op++) {

ncclAllReduce(

layers[op].localGradients,

layers[op].globalGradients,

layers[op].gradientSize,

ncclFloat, ncclSum, ncclComm, ncclStream);

}

ncclGroupEnd();

// All operations are only guaranteed to be posted on the stream after ncclGroupEnd

cudaStreamSynchronize(ncclStream);

Page 14: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

14

NCCL 2.2Aggregated operations : usage

Can be combined/nested with multi-GPU grouping :

ncclGroupStart();

for (int op=0; op<nops; op++) {

for (int gpu=0; gpu<ngpus; gpu++) {

ncclGroupStart();

ncclAllReduce(

layers[op].localGradients[gpu],

layers[op].globalGradients[gpu],

layers[op].gradientSize,

ncclFloat, ncclSum, ncclComms[gpu], ncclStreams[gpu]);

ncclGroupEnd();

}

}

ncclGroupEnd();

// All operations are only guaranteed to be posted on the stream after the last ncclGroupEnd

for (int gpu=0; gpu<ngpus; gpu++)

cudaStreamSynchronize(ncclStreams[gpu]);

Page 15: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

15

NCCL 2.2Aggregated operations : other uses

ReduceScatterV = Aggregation of multiple reductions operationsncclGroupStart();

for (int rank=0; rank<nranks; rank++) {

ncclReduce(sendbuff+offsets[rank], recvbuff+offsets[rank],

recvcounts[rank], datatype, redOp, rank, comm, stream);

}

ncclGroupEnd();

AllGatherV = Aggregation of multiple broadcasts operationsncclGroupStart();

for (int rank=0; rank<nranks; rank++) {

ncclBroadcast(sendbuff+offsets[rank], recvbuff+offsets[rank],

recvcounts[rank], datatype, rank, comm, stream);

}

ncclGroupEnd();

Page 16: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

16

NCCL 2.3Large scale algorithms

0

50

100

150

200

250

300

2 4 8 16 32 64 128

Allreduce Latency

2.2

2.3 (projected)

Page 17: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

17

NCCL 2.4Point-to-point primitives

Send / Receive , Scatter[v],Gather[v],Alltoall[v,w], neighbor collectives, …

scatter

gather

alltoall

Neighbor

Page 18: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system

18

NCCLSummary

Optimized inter-GPU communication for DL and HPCOptimized for all NVIDIA platforms, most OEMs and CloudScales to 100s of GPUs, targeting 10,000s in the near future.

Aims at covering all communication needs for multi-GPU computing.

Only relies on CUDA. No dependency on MPI or any parallel environment.

More questions ? Connect with the Experts : NCCL Wed 28, 3pm

Page 19: MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2 MULTI-GPU COMPUTING Harvesting the power of multiple GPUs 1 GPU Multiple GPUs per system