Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2...

20

Transcript of Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2...

Page 1: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

Graph500: From Kepler to Pascal

Julien Loiseau, Michaël Krajecki, François Alin and Christophe Jaillet

GTC 2017

Page 2: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

University of Reims Champagne Ardenne (URCA)

Multidisciplinary University

I about 27 000

I 5 campus: Reims, Troyes, Charleville-Mézières, Chaumont etChâlons-en-Champagne

I A wide initial undergaduate studies program

I Graduate studies and PhD program linked with research lab

Graph500: From Kepler to Pascal J. Loiseau et al. 1 / 19

Page 3: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

HPC issues

Power e�ciency

Exascale architecture

I Computational power: Peta�op → ×1000→ Exa�opI Moore's law is over

I Energy e�ciency: 8MW → ×1000→ 8GW ??

Graph500: From Kepler to Pascal J. Loiseau et al. 2 / 19

Page 4: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

HPC issues

HPC Architectures

CPU(s) + Accelerator(s)

CPU(s) + Accelerator(s)

MPI

Memory - In-core

- Out-of-core

Xeon PhiGPU

FPGA

ASIC

SSD

HDD

CPU/GPU

Graph500: From Kepler to Pascal J. Loiseau et al. 3 / 19

Page 5: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

HPC issues

ROMEO, Reims, France

ROMEO supercomputer

I Reims, Champagne-Ardenne, FranceI 130 nodes

I 2 × CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM)I 2 × GPU NVIDIA K20Xm (6GB RAM)

I FatTree with In�niBand

I 1 × DGX-1 node

Graph500: From Kepler to Pascal J. Loiseau et al. 4 / 19

Page 6: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

HPC issues

ROMEO, Regional HPC Center

Its mission is to deliver, for both industrial and academic researchers:

I high performance computing resources

I secured storage spaces

I speci�c & scienti�c software

I advanced user support in exploiting these resources

I in-depth expertise in di�erent engineering �elds: HPC, appliedmathematics, physics, biophysics and chemistry, ...

I Promote and di�use HPC and simulation to companies / SMB

I identify, experiment and master breakthrough technologiesI which give new opportunities for our usersI from technology-watching to productionI for all research domains

Graph500: From Kepler to Pascal J. Loiseau et al. 5 / 19

Page 7: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

HPC issues

GPU Integration

20081 server

Tesla S1070960 cores/U

201010% of cluster

nodes with GPU

Fermi - M2090

2013100% of cluster

nodes with GPU260 K20x

TOP 500 & GREEN 500

2012

2015

2016

2016DGX1 Server

First server dedicated to

Deep Learning 8 GPU P100

Graph500: From Kepler to Pascal J. Loiseau et al. 6 / 19

Page 8: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

HPC issues

Benchmarking HPC Architectures

How to compare the computing power of parallel architectures?

TOP500

I LINPACK

I Solving n equations with n unknowns

I "Regular"

FLoating-point Operation Per Second, FLOPS

GRAPH500

I BFS

I Large randomly generated graphs

I "Irregular"

Traversed Edges Per Second, TEPS

Graph500: From Kepler to Pascal J. Loiseau et al. 7 / 19

Page 9: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

GRAPH500

Protocol and ranking

Graphs algorithms:

I Irregular memory access

I Irregular communications

I No heavy computation step

Data dense application

Steps

I Graph generation (SKG, RMAT)

I Randomly sample 64 unique root vertices

I Structure generationI For each root vertex:

I BFSI Validate BFS tree

Graph500: From Kepler to Pascal J. Loiseau et al. 8 / 19

Page 10: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

GRAPH500

Protocol and ranking

Goal: Breadth First Search on random graph:

Level 2

Level 1

Source

Level 3

BFS

Graph500: From Kepler to Pascal J. Loiseau et al. 9 / 19

Page 11: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

GRAPH500

Problem Scale

- 2SCALE ⇒ vertices

- 2SCALE+4 ⇒ edges

Problem size Scale Memory (TB)

Toy 26 0,0172Mini 29 0,1374Small 32 1,0995Medium 36 17,5922Large 39 140,7375Huge 42 1125,8999

I For graph generation

I Converted before use

I Graph500 current list (Nov. 2016)

Best CPU & GPU machines:

Name Scale GTEPS

(1)K computer 40 38621,4(2)Sunway 40 23755.7(3)Sequoia 41 23751... ... ...(31) TSUBAME 2.0 35 462.25(39) GSIC Center 35 317,09(43) HA-PACS 32 223.634

I BlueGene ⇒ 19 in top 30

I GPU ⇒ NVIDIA

Graph500: From Kepler to Pascal J. Loiseau et al. 10 / 19

Page 12: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

GRAPH500

Graph generation

a

dc

From

To Nodes

Nodes

c d

ba b

c d

Sparse Graph

I Kronecker

I Generation: a = 0.57 b = 0.19 c = 0.19 d = 0.05

I Edge: 16× more than vertices

Graph500: From Kepler to Pascal J. Loiseau et al. 11 / 19

Page 13: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

GRAPH500

Data structure format

Structure format

→ BitmapI Natural representationI 220 vertices = 128GBI BG/Q version

→ CSR/CSC (Compressed Sparse Rows/Columns)I Compressed formatI 220 vertices < 1GBI BG/P and GPU version

Compressed Sparse Row

0 1 0 1

0

0

0001

0

1 0 1

01

0 3 6 8 9

1 3 4 0 2 4

0 1 2 3

0

3

2

1

Sparse Matrix

Row pointers

Column indice

1

1

1

0

4

01114 0

12

1 4 0 0 1 2

Graph500: From Kepler to Pascal J. Loiseau et al. 12 / 19

Page 14: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

GRAPH500

Parallel algorithm

I Split the adjacency matrix into blocks

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

Generate output queue

A0,0

A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A1,3

A2,0

A2,1

A2,2

A2,3

A3,0

A3,1

A3,2

A3,3

Share input queue

Repeat until end

- l × l machines with l = 2k(k ∈ N)- Machine Mi,j ⇔ block Ai,j

→ Vertices "In" Ri

→ Vertices "Out" Rj

- Predecessors, 1D distribution: Mi,j get 1/4 vertices in Ri

Graph500: From Kepler to Pascal J. Loiseau et al. 13 / 19

Page 15: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

GRAPH500

Exploration

CSR

I Top-down

I in_queue → out_queue

CSC

I Bottom-up

I out_queue & visited ← in_queue

Current frontier

Not yet visited

Current frontier

Not yet visited

Search for children:

Search for parents:

Iteration Top-down Bottom-up Hybrid version0 27 22 090 111 271 8 156 1 568 798 8 1562 3 695 684 587 893 587 8933 19 565 465 12 586 12 5864 214 578 8 256 8 2565 5 865 1 201 1 2016 12 156 12

Graph500: From Kepler to Pascal J. Loiseau et al. 14 / 19

Page 16: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

Results and prospects

Performance analysis

CPU/GPU Comparison

I one CPU or GPU

I Di�erent graph scales

16 17 18 19 20 210

0,5

1

1,5

2

2,5

3CPU/GPU Comparision

GTX 970GTX 780 TiK20XCPU E5-2650v2TX1CPU GRAPH500

SCALE

GT

EP

S

Graph500: From Kepler to Pascal J. Loiseau et al. 15 / 19

Page 17: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

Results and prospects

Scalability

1 4 16 640

2

4

6

8

10

12

14weak scaling

CPU

GPU

#GPU

GTEPS

21

23

25

27

1 4 16 640

1

2

3

4

5

6

strong scaling (SCALE=21)

CPU

GPU

#GPU

GTEPS

I ROMEO → 105th (Nov. 2016)

Graph500: From Kepler to Pascal J. Loiseau et al. 16 / 19

Page 18: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

Results and prospects

P100 GPU

P100 GPU

I Pascal Architecture

I Several improvements

I Base component of DGX-1

Communication, NVLink

I DGX-1, Power 8

I 40GBs bidirectional

⇒ Advantage for graph

Graph500: From Kepler to Pascal J. Loiseau et al. 17 / 19

Page 19: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

Kepler vs Pascal

Product K20X Tesla P100Arch Kepler PascalGPU GK100 GP100SMs 14 56 | More concurrent blocks

FP32/SM 192 64FP32/GPU 2688 3584 | More concurrent threadsFP64/SM 64 32

FP64/GPU 896 1792Base Clock 732 MHz 1328 MHz

FP32 GFLOPs 3950 10600FP64 GFLOPs 1310 5300

Memory Interface 384b GDDR5 4096b HBM2Memory Size 6GB 16 GB | 2.6× more mem.L2 Cache Size 1536 KB 4096 KBRegister/SM 256 KB 256 KB | same register size

Register/GPU 3584 KB 14336 KB

Graph500: From Kepler to Pascal J. Loiseau et al. 18 / 19

Page 20: Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2 CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM) I 2 GPU NVIDIA K20Xm (6GB RAM) I FatTree

Kepler vs Pascal

ROMEO Supercomputer

I 105th of GRAPH500 (November 2016 list)

Benchmark for architectures and accelerators

I Machine MESCA: 12TB of RAM + 8 sockets (256 threads)

I FPGA: Intel partnership

I Xeon Phi: Knights Landing ?

I IBM OpenPower: new communications device (NVLINK)

Applications

I Social networks

I Management of electric network

I Big data and deep learning

Graph500: From Kepler to Pascal J. Loiseau et al. 19 / 19