Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2...

Graph500: From Kepler to Pascal

Julien Loiseau, Michaël Krajecki, François Alin and Christophe Jaillet

GTC 2017

University of Reims Champagne Ardenne (URCA)

Multidisciplinary University

I about 27 000

I 5 campus: Reims, Troyes, Charleville-Mézières, Chaumont etChâlons-en-Champagne

I A wide initial undergaduate studies program

I Graduate studies and PhD program linked with research lab

Graph500: From Kepler to Pascal J. Loiseau et al. 1 / 19

HPC issues

Power e�ciency

Exascale architecture

I Computational power: Peta�op → ×1000→ Exa�opI Moore's law is over

I Energy e�ciency: 8MW → ×1000→ 8GW ??


HPC issues

HPC Architectures

CPU(s) + Accelerator(s)

CPU(s) + Accelerator(s)

MPI

Memory - In-core

- Out-of-core

Xeon PhiGPU

FPGA

ASIC

SSD

HDD

CPU/GPU


HPC issues

ROMEO, Reims, France

ROMEO supercomputer

I Reims, Champagne-Ardenne, FranceI 130 nodes

I 2 × CPU Intel Xeon E5-2650v2 2.6GHz (16GB RAM)I 2 × GPU NVIDIA K20Xm (6GB RAM)

I FatTree with In�niBand

I 1 × DGX-1 node


HPC issues

ROMEO, Regional HPC Center

Its mission is to deliver, for both industrial and academic researchers:

I high performance computing resources

I secured storage spaces

I speci�c & scienti�c software

I advanced user support in exploiting these resources

I in-depth expertise in di�erent engineering �elds: HPC, appliedmathematics, physics, biophysics and chemistry, ...

I Promote and di�use HPC and simulation to companies / SMB

I identify, experiment and master breakthrough technologiesI which give new opportunities for our usersI from technology-watching to productionI for all research domains


HPC issues

GPU Integration

20081 server

Tesla S1070960 cores/U

201010% of cluster

nodes with GPU

Fermi - M2090

2013100% of cluster

nodes with GPU260 K20x

TOP 500 & GREEN 500

2012

2015

2016

2016DGX1 Server

First server dedicated to

Deep Learning 8 GPU P100


HPC issues

Benchmarking HPC Architectures

How to compare the computing power of parallel architectures?

TOP500

I LINPACK

I Solving n equations with n unknowns

I "Regular"

FLoating-point Operation Per Second, FLOPS

GRAPH500

I BFS

I Large randomly generated graphs

I "Irregular"

Traversed Edges Per Second, TEPS


GRAPH500

Protocol and ranking

Graphs algorithms:

I Irregular memory access

I Irregular communications

I No heavy computation step

Data dense application

Steps

I Graph generation (SKG, RMAT)

I Randomly sample 64 unique root vertices

I Structure generationI For each root vertex:

I BFSI Validate BFS tree


GRAPH500

Protocol and ranking

Goal: Breadth First Search on random graph:

Level 2

Level 1

Source

Level 3

BFS


GRAPH500

Problem Scale

- 2SCALE ⇒ vertices

- 2SCALE+4 ⇒ edges

Problem size Scale Memory (TB)

Toy 26 0,0172Mini 29 0,1374Small 32 1,0995Medium 36 17,5922Large 39 140,7375Huge 42 1125,8999

I For graph generation

I Converted before use

I Graph500 current list (Nov. 2016)

Best CPU & GPU machines:

Name Scale GTEPS

(1)K computer 40 38621,4(2)Sunway 40 23755.7(3)Sequoia 41 23751... ... ...(31) TSUBAME 2.0 35 462.25(39) GSIC Center 35 317,09(43) HA-PACS 32 223.634

I BlueGene ⇒ 19 in top 30

I GPU ⇒ NVIDIA


GRAPH500

Graph generation

a

dc

From

To Nodes

Nodes

c d

ba b

c d

Sparse Graph

I Kronecker

I Generation: a = 0.57 b = 0.19 c = 0.19 d = 0.05

I Edge: 16× more than vertices


GRAPH500

Data structure format

Structure format

→ BitmapI Natural representationI 220 vertices = 128GBI BG/Q version

→ CSR/CSC (Compressed Sparse Rows/Columns)I Compressed formatI 220 vertices < 1GBI BG/P and GPU version

Compressed Sparse Row

0 1 0 1

0

0

0001

0

1 0 1

01

0 3 6 8 9

1 3 4 0 2 4

0 1 2 3

0

3

2

1

Sparse Matrix

Row pointers

Column indice

1

1

1

0

4

01114 0

12

1 4 0 0 1 2


GRAPH500

Parallel algorithm

I Split the adjacency matrix into blocks

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

Generate output queue

A0,0

A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A1,3

A2,0

A2,1

A2,2

A2,3

A3,0

A3,1

A3,2

A3,3

Share input queue

Repeat until end

- l × l machines with l = 2k(k ∈ N)- Machine Mi,j ⇔ block Ai,j

→ Vertices "In" Ri

→ Vertices "Out" Rj

- Predecessors, 1D distribution: Mi,j get 1/4 vertices in Ri


GRAPH500

Exploration

CSR

I Top-down

I in_queue → out_queue

CSC

I Bottom-up

I out_queue & visited ← in_queue

Current frontier

Not yet visited

Current frontier

Not yet visited

Search for children:

Search for parents:

Iteration Top-down Bottom-up Hybrid version0 27 22 090 111 271 8 156 1 568 798 8 1562 3 695 684 587 893 587 8933 19 565 465 12 586 12 5864 214 578 8 256 8 2565 5 865 1 201 1 2016 12 156 12


Results and prospects

Performance analysis

CPU/GPU Comparison

I one CPU or GPU

I Di�erent graph scales

16 17 18 19 20 210

0,5

1

1,5

2

2,5

3CPU/GPU Comparision

GTX 970GTX 780 TiK20XCPU E5-2650v2TX1CPU GRAPH500

SCALE

GT

EP

S



Scalability

1 4 16 640

2

4

6

8

10

12

14weak scaling

CPU

GPU

#GPU

GTEPS

21

23

25

27

1 4 16 640

1

2

3

4

5

6

strong scaling (SCALE=21)

CPU

GPU

#GPU

GTEPS

I ROMEO → 105th (Nov. 2016)



P100 GPU

P100 GPU

I Pascal Architecture

I Several improvements

I Base component of DGX-1

Communication, NVLink

I DGX-1, Power 8

I 40GBs bidirectional

⇒ Advantage for graph


Kepler vs Pascal

Product K20X Tesla P100Arch Kepler PascalGPU GK100 GP100SMs 14 56 | More concurrent blocks

FP32/SM 192 64FP32/GPU 2688 3584 | More concurrent threadsFP64/SM 64 32

FP64/GPU 896 1792Base Clock 732 MHz 1328 MHz

FP32 GFLOPs 3950 10600FP64 GFLOPs 1310 5300

Memory Interface 384b GDDR5 4096b HBM2Memory Size 6GB 16 GB | 2.6× more mem.L2 Cache Size 1536 KB 4096 KBRegister/SM 256 KB 256 KB | same register size

Register/GPU 3584 KB 14336 KB


Kepler vs Pascal

ROMEO Supercomputer

I 105th of GRAPH500 (November 2016 list)

Benchmark for architectures and accelerators

I Machine MESCA: 12TB of RAM + 8 sockets (256 threads)

I FPGA: Intel partnership

I Xeon Phi: Knights Landing ?

I IBM OpenPower: new communications device (NVLINK)

Applications

I Social networks

I Management of electric network

I Big data and deep learning


Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2...

Documents

Transcript of Graph500: From Kepler to Pascal - NVIDIAon-demand.gputechconf.com/gtc/2017/presentation/s7309...I 2...