NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2...

28
Sylvain Jeaugey NCCL 2.0

Transcript of NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2...

Page 1: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

Sylvain Jeaugey

NCCL 2.0

Page 2: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

2

DEEP LEARNING ON GPUS Making DL training times shorter

Multi-core CPU GPU

CUDA

Multi-GPU

NCCL 1

Multi-GPU

Multi-node

NCCL 2

Deeper neural networks, larger data sets … training is a very, very long operation !

Page 3: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

3

NCCL A multi-GPU communication library

PCIe

NVLink

Sockets (Ethernet)

Infiniband, with GPU Direct RDMA

To other systems

Within a system

GPU Direct P2P

Page 4: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

4

NCCL Architecture

NCCL

CUDA

Caffe

CUBLAS

Caffe2 Torch TF MXNET CNTK

Deep Learning Frameworks

NVIDIA GPUs

CUDNN

Page 5: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

5

AGENDA

NCCL History Design

NCCL 2.0 New features API Changes

Performance

Future

Page 6: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

6

HISTORY

Q4 2015: NCCL 1.x

Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations.

Limited to intra-node.

Q2 2017: NCCL 2.x and beyond

NVIDIA Library, multi-node support and improved API.

Page 7: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

7

DESIGN

Optimized collective communication library between CUDA devices.

Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.

Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage.

Operates on CUDA pointers. Operations are tied to a CUDA stream.

Uses as little threads as possible to permit other computation to progress simultaneously.

What is NCCL ?

Page 8: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

8

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

Page 9: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

9

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

PCIe / QPI : 1 unidirectional ring

Page 10: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

10

DESIGN Rings

NCCL uses rings to move data across all GPUs and perform reductions.

DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring

Page 11: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

11

Reduction

DESIGN Kernels

sendbuff recvbuff

FIFO

Next GPU

in the ring

Previous GPU

in the ring

Page 12: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

12

NCCL 2.0

Page 13: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

13

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

Page 14: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

14

NCCL 2.0

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.

Inter-node communication

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

Page 15: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

15

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node N-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7 4 GPUs per

socket

2 sockets

per node

n nodes

Page 16: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

16

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

1 process

per GPU P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

P

8

P

9 P10

P11

P12

P13

P14

P15

P8n

-8

P 8n

-7

P 8n

-6

P 8n

-5

P 8n

-4

P 8n

-3

P 8n

-2

P 8n

-1

Page 17: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

17

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

t0 t1 t2 t3

Process 1

t0 t1 t2 t3

Process 2

t0 t1 t2 t3

Process 3

t0 t1 t2 t3

Process 2n-2

t0 t1 t2 t3

Process 2n-1

t0 t1 t2 t3

1 thread

per GPU

1 process

per socket

Page 18: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

18

NCCL 2.0

Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.

Processes, threads and GPUs

Node 0

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node 1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Node n-1

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

1 process

per node Process 0 Process 1 Process n-1

8 GPUs per

process

Page 19: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

19

NCCL 2.0 API

NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread

NCCL 1.x : for (int i=0; i<ngpus; i++) {

cudaSetDevice(devices[i]);

ncclAllReduce(…, comms[i], streams[i]);

}

NCCL 2.0 : ncclGroupStart();

for (int i=0; i<ngpus; i++) {

ncclAllReduce(…, comms[i], streams[i]);

}

ncclGroupEnd();

Group calls

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

Page 20: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

20

NCCL 2.0 API

Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclCommInitRank(&comm, nranks, id, rank);

Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd if (rank == 0) ncclGetUniqueId(&id)

My_Bcast(&id);

ncclGroupStart();

for (int i=0; i<ndev; i++) {

cudaSetDevice(devices[i]);

ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i);

}

ncclGroupEnd();

Integration with parallel environments

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

GPU

0

CPU0

GPU

1

GPU

2

GPU

3

GPU

4

CPU1

GPU

5

GPU

6

GPU

7

Process 0

Page 21: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

21

NCCL 2.0 API

Other small API adjustments over the NCCL 1.x API :

Counts are now of type size_t instead of int

allGather arguments order has been fixed to be similar to other operations

Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double

Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations

Others

Page 22: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

22

PERFORMANCE

Page 23: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

23

PERFORMANCE Intra-node performance

0

10

20

30

40

50

60

4 QPI 4 CPU 4 PCI DGX-1

AllReduce bandwidth (OMB, size=128MB, in GB/s)

Page 24: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

24

PERFORMANCE Inter-node performance

0

5

10

15

20

25

30

35

40

45

2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

AllReduce bandwidth (OMB, size=128MB, in GB/s)

MPI

Baidu Allreduce

NCCL

Page 25: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

25

PERFORMANCE Deep Learning - CNTK

0

1000

2000

3000

4000

5000

6000

7000

8000

0 8 16 24 32

CNTK scaling ResNet50, images/s

Ideal MPI NCCL

217

1684

3281

6569

1645 1744

3360

Page 26: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

26

FUTURE

Page 27: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)

27

FUTURE

Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions)

User-defined reduction operations also, trying to merge computation and communication better

Windows support

Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm

Top asked features

Page 28: NCCL 2 - GTC On-Demand Featured Talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)