INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why...

68
Lecture 1: Analyzing and Parallelizing with OpenACC, October 26, 2016 INTRODUCTION TO OPENACC

Transcript of INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why...

Page 1: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

Lecture 1: Analyzing and Parallelizing with OpenACC, October 26, 2016

INTRODUCTION TO OPENACC

Page 2: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

2

Course Objective:

Enable you to to accelerate your applications with OpenACC.

Page 3: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

3

Course Syllabus

Oct 26: Analyzing and Parallelizing with OpenACC

Nov 2: OpenACC Optimizations

Nov 9: Advanced OpenACC

Recordings:https://developer.nvidia.com/intro-to-openacc-course-2016

Page 4: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

Lecture 1: Jeff Larkin, NVIDIA

ANALYZING AND PARALLELIZING WITH OPENACC

Page 5: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

5

Today’s Objectives

Understand what OpenACC is and why to use it

Understand some of the differences between CPU and GPU hardware.

Know how to obtain an application profile using PGProf

Know how to add OpenACC directives to existing loops and build with OpenACC using PGI

Analyze

ParallelizeOptimize

Page 6: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

6

Why OpenACC?

Page 7: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

7

OpenACCSimple | Powerful | Portable

Fueling the Next Wave of

Scientific Discoveries in HPC

University of IllinoisPowerGrid- MRI Reconstruction

70x Speed-Up

2 Days of Effort

http://www.cr ay.com/sites/default/files/r esources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf

http://www.hpcw ire.com/off-the-w ir e/first-round-of-2015-hackathons-gets-under way

http://on-demand.gputechconf.com/gtc/2015/pr esentation/S5297-Hisashi-Yashir o.pdf

http://www.openacc.or g/content/ex periences-por ting-molecular -dynamics-code-gpus-cr ay-x k7

RIKEN JapanNICAM- Climate Modeling

7-8x Speed-Up

5% of Code Modified

main() {

<serial code>#pragma acc kernels//automatically runs on GPU

{ <parallel code>

}}

Page 8: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

8

0.0x

2.0x

4.0x

6.0x

8.0x

10.0x

12.0x

Alanine-113 Atoms

Alanine-223 Atoms

Alanine-333 AtomsJanus Juul Eriksen, PhD Fellow

qLEAP Center for Theoretical Chemistry, Aarhus University

OpenACC makes GPU computing approachable for

domain scientists. Initial OpenACC implementation required only minor effort, and more importantly,no modifications of our existing CPU implementation.

LS-DALTON

Large-scale application for calculating high-accuracy molecular energies

Lines of Code

Modified

# of Weeks

Required

# of Codes to

Maintain

<100 Lines 1 Week 1 Source

Big Performance

Minimal Effort

LS-DALTON CCSD(T) ModuleBenchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

Page 9: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

9

OpenACC Directives

Manage

Data

Movement

Initiate

Parallel

Execution

Optimize

Loop

Mappings

#pragma acc data copyin(x,y) copyout(z){

...#pragma acc parallel {#pragma acc loop gang vector

for (i = 0; i < n; ++i) {z[i] = x[i] + y[i];...

}}...

}

Performance portable

Interoperable

Single source

Incremental

Page 10: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

10

Wayne Gaudin and Oliver PerksAtomic Weapons Establishment, UK

We were extremely impressed that we can run

OpenACC on a CPU with no code change and get

equivalent performance to our OpenMP/MPI

implementation.

OpenACC Performance Portability: CloverLeaf

Hydrodynamics Application OpenACC Performance Portability

Sp

eed

up

vs

1 C

PU

Core

Benchmarked Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Accelerator: Tesla K80

Page 11: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

11

CloverLeaf on Dual Haswell vs Tesla K80

7x 8x

14x

0

5

10

15

20

Haswell: Intel OpenMP Haswell: PGI OpenACC Tesla K80: PGI OpenACC

Speedup v

s Sin

gle

Hasw

ell C

ore

CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled

GPU: NVIDIA Tesla K80 (single GPU)

OS: CentOS 6.6, Compiler: PGI 16.5

Page 12: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

12

CloverLeaf on Tesla P100 Pascal

7x 8x

14x

40x

0

5

10

15

20

25

30

35

40

45

Haswell: Intel OpenMP Haswell: PGI OpenACC Tesla K80: PGI OpenACC Tesla P100: PGI OpenACC

Speedup v

s Sin

gle

Hasw

ell C

ore

CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled

GPU: NVIDIA Tesla K80 (single GPU) , NVIDIA Tesla P100 (Single GPU)

OS: CentOS 6.6, Compiler: PGI 16.5

Migrating from multicore CPU to K80 to

P100 requires only changing a compiler

flag.

Page 13: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

13

3 Steps to Accelerate with OpenACC

Analyze

ParallelizeOptimize

Page 14: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

14

Case Study: Conjugate Gradient

A sample code implementing the conjugate gradient method has been provided in C/C++ and Fortran.

• To save space, only the C will be shown in slides.

You do not need to understand the algorithm to proceed, but should be able to understand C, C++, or Fortran.

For more information on the CG method, see https://en.wikipedia.org/wiki/Conjugate_gradient_method

10/26/2016

Page 15: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

15

Analyze

Page 16: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

16

Analyze

Obtain a performance profile

Read compiler feedback

Understand the code.

Page 17: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

17

Obtain a Profile

A application profile helps to understand where time is spent

What routines are hotspots?

Focusing on the hotspots delivers the greatest performance impact

A variety of profiling tools are available: gprof, nvprof, CrayPAT, TAU, Vampir

We’ll use PGProf, which comes with the PGI compiler

$ pgprof &

Page 18: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

18

PGPROF Profiler

Page 19: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

19

PGPROF Profiler

Page 20: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

20

PGPROF Profiler

Page 21: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

21

PGPROF Profiler

Page 22: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

22

PGPROF Profiler

Page 23: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

23

PGPROF Profiler

Page 24: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

24

PGPROF Profiler

Page 25: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

25

PGPROF Profiler

Double Click

Page 26: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

26

PGPROF Profiler

Page 27: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

27

Compiler Feedback

Before we can make changes to the code, we need to understand how the compiler is optimizing

With PGI, this can be done with the –Minfo and –Mneginfo flags

$ pgc++ -Minfo=all,ccff -Mneginfo

matvec(const matrix &, const vector &, constvector &):

23, include "matrix_functions.h"Generated 2 alternate versions of the loopGenerated vector sse code for the loopGenerated 2 prefetch instructions for the

loop

Page 28: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

28

Compiler Feedback in PGProf

Page 29: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

29

Parallelize

Page 30: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

30

Parallelize

Insert OpenACC directives around important loops

Enable OpenACC in the compiler

Run on a parallel platform

Page 31: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

31

CPUOptimized for Serial Tasks

GPU AcceleratorOptimized for Parallel Tasks

Accelerated Computing10x Performance & 5x Energy Efficiency for HPC

Page 32: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

32

CPUOptimized for Serial Tasks

GPU AcceleratorOptimized for Parallel Tasks

Accelerated Computing10x Performance & 5x Energy Efficiency for HPC

CPU Strengths

• Very large main memory

• Very fast clock speeds• Latency optimized via large caches

• Small number of threads can run

very quickly

CPU Weaknesses

• Relatively low memory bandwidth

• Cache misses very costly

• Low performance/watt

Page 33: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

33

CPUOptimized for Serial Tasks

GPU AcceleratorOptimized for Parallel Tasks

Accelerated Computing10x Performance & 5x Energy Efficiency for HPC

GPU Strengths

• High bandwidth main memory

• Significantly more compute resources

• Latency tolerant via parallelism

• High throughput

• High performance/watt

GPU Weaknesses

• Relatively low memory capacity

• Low per-thread performance

Page 34: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

34

Speed v. Throughput

Speed Throughput

*Images from Wikimedia Commons via Creative Commons

Which is better depends on your needs…

Page 35: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

35

Accelerator Nodes

PCIe

RAM RAMCPU and GPU communicate via PCIe

• Data must be copied between these memories over PCIe

• PCIe Bandwidth is much lower than either memories

Obtaining high performance on GPU nodes often requires reducing PCIecopies to a minimum

Page 36: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

36

CUDA Unified MemorySimplified Developer Effort

Without Unified Memory With Unified Memory

Unified MemorySystem Memory

GPU Memory

Sometimes referred to as

“managed memory.”

New “Pascal” GPUs handle Unified Memory in hardware.

Page 37: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

37

OpenACC Parallel Directive

#pragma acc parallel

{

}

Generates parallelism

When encountering the parallel directive,

the compiler will generate 1 or more

parallel gangs, which execute redundantly.

Page 38: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

38

OpenACC Parallel Directive

#pragma acc parallel

{

}

Generates parallelism

When encountering the parallel directive,

the compiler will generate 1 or more

parallel gangs, which execute redundantly.

Page 39: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

39

OpenACC Loop Directive

#pragma acc parallel

{

#pragma acc loop

for (i=0;i<N;i++)

{

}

}

Identifies loops to run in parallel

The loop directive informs the compiler

which loops to parallelize.

Page 40: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

40

OpenACC Loop Directive

#pragma acc parallel

{

#pragma acc loop

for (i=0;i<N;i++)

{

}

}

Identifies loops to run in parallel

The loop directive informs the compiler

which loops to parallelize.

Page 41: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

41

OpenACC Parallel Loop Directive

#pragma acc parallel loop

for (i=0;i<N;i++)

{

}

Generates parallelism and identifies loop in one directive

The parallel and loopdirectives are

frequently combined into one.

Page 42: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

42

Case Study: Parallelize

Normally we would start with the most time-consuming routine to deliver the greatest performance impact.

In order to ease you in to writing parallel code, I will instead start with the simplest routine.

Page 43: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

43

Parallelize Waxpby

void waxpby(...) {

#pragma acc parallel loopfor(int i=0;i<n;i++) {wcoefs[i] =

alpha*xcoefs[i] + beta*ycoefs[i];

}}

Adding a parallel loop around the waxpby loop informs the compiler to

Generate parallel gangs on which to execute

Parallelize the loop iterations across the parallel gangs

Page 44: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

44

Build With OpenACC

The PGI –ta flag enables OpenACC and chooses a target accelerator.

We’ll add the following to our compiler flags:

-ta=tesla:managed

Compiler feedback now:

waxpby(double, const vector &, double, const vector &, const vector &):

6, include "vector_functions.h"22, Generating implicit

copyout(wcoefs[:n])Generating implicit

copyin(xcoefs[:n],ycoefs[:n])Accelerator kernel generatedGenerating Tesla code25, #pragma acc loop gang,

vector(128) /* blockIdx.x threadIdx.x */

Page 45: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

45

PGPROF with Parallel waxpby

A significant portion

of the time is now

spent migrating data

between the host and device.

Page 46: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

46

PGPROF with Parallel waxpby

In order to improve

performance, we

need to parallelize

the remaining functions.

Page 47: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

47

Parallelize Dot

double dot(...) {

#pragma acc parallel loopreduction(+:sum)for(int i=0;i<n;i++) {sum +=

xcoefs[i]*ycoefs[i];}return sum;

}

Because each iteration of the loop adds to the variable sum, we must declare a reduction.

A parallel reduction may return a slightly different result than a sequential addition due to floating point limitations

Page 48: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

48

Parallelize Matvecvoid matvec(...) {#pragma acc parallel loopfor(int i=0;i<num_rows;i++) {double sum=0;int row_start=row_offsets[i];int row_end=row_offsets[i+1];

#pragma acc loop reduction(+:sum)for(int

j=row_start;j<row_end;j++) {unsigned int Acol=cols[j];double Acoef=Acoefs[j];double xcoef=xcoefs[Acol];sum+=Acoef*xcoef;

}ycoefs[i]=sum;

}}

The outer parallel loop generates parallelism and parallelizes the “i” loop.

The inner loop declares the iterations of “j” independent and the reduction on “sum”

Page 49: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

49

Final PGPROF Profile for Lecture 1

Now data migration

has been eliminated

during the

computation.

Page 50: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

50

OpenACC Profiling in PGPROF

PGPROF will show you

where in your code to

find an OpenACC

region.

We’ll optimize this

loop next week!

Page 51: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

51

OpenACC Performance So Far…

0.00X

5.00X

10.00X

Serial Multicore K80 (single) P100

Serial Multicore K80 (single) P100

Speed-u

p f

rom

seri

al

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz

Page 52: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

52

Where we’re going next week…

0.00X

5.00X

10.00X

15.00X

20.00X

25.00X

Serial Multicore K80 (single) P100

Serial Multicore K80 (single) P100

Speed-u

p f

rom

seri

al

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz

Page 53: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

53

Optimize (Next Week)

Page 54: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

54

Optimize (Next Week)

Get new performance data from parallel execution

Remove unnecessary data transfer to/from GPU

Guide the compiler to better loop decomposition

Refactor the code to make it more parallel

Page 55: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

55

Using QwikLabs

Page 56: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

56

Getting access

1. Create an account with NVIDIA qwikLABShttps://developer.nvidia.com/qwiklabs-signup

2. Enter a promo code OPENACC before submitting the form

3. Free credits will be added to your account

4. Start using OpenACC!

Page 57: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

57

Download link: https://developer.nvidia.com/openacc-toolkit

CERTIFICATION

1. Attend live lectures

2. Complete the test

3. Enter for a chance to win a Titan X or an OpenACC Book

Available after November 9th

Official rules: http://developer.download.nvidia.com/compute/OpenACC-

Toolkit/docs/TITANX-GIVEAWAY-OPENACC-Official-Rules-2016.pdf

OPENACC TOOLKITFree for Academia

NEW OPENACC BOOKParallel Programming with OpenACC

Available starting Nov 1st, 2016:

http://store.elsevier.com/Parallel-Programming-with-OpenACC/Rob-Farber/isbn-9780124103979/

Page 58: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

58

Where to find help

• OpenACC Course Recordings - https://developer.nvidia.com/openacc-courses

• PGI Website - http://www.pgroup.com/resources

• OpenACC on StackOverflow - http://stackoverflow.com/questions/tagged/openacc

• OpenACC Toolkit - http://developer.nvidia.com/openacc-toolkit

• Parallel Forall Blog - http://devblogs.nvidia.com/parallelforall/

• GPU Technology Conference - http://www.gputechconf.com/

• OpenACC Website - http://openacc.org/

Questions? Email [email protected]

Page 59: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

59

Course Syllabus

Oct 26: Analyzing and Parallelizing with OpenACC

Nov 2: OpenACC Optimizations

Nov 9: Advanced OpenACC

Recordings:https://developer.nvidia.com/intro-to-openacc-course-2016

Questions? Email [email protected]

Page 60: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

60

Additional Material

Page 61: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

61

OpenACC kernels Directive

#pragma acc kernels

{

for(int i=0; i<N; i++)

{

x[i] = 1.0;

y[i] = 2.0;

}

for(int i=0; i<N; i++)

{

y[i] = a*x[i] + y[i];

}

}

Identifies a region of code where I think the compiler can turn loopsinto kernels

61

kernel 1

kernel 2

The compiler identifies

2 parallel loops and

generates 2 kernels.

Page 62: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

62

Loops vs. Kernels

for (int i = 0; i < 16384; i++){

C[i] = A[i] + B[i];}

function loopBody(A, B, C, i){

C[i] = A[i] + B[i];}

Page 63: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

63

Loops vs. Kernels

for (int i = 0; i < 16384; i++){

C[i] = A[i] + B[i];}

function loopBody(A, B, C, i){

C[i] = A[i] + B[i];}

Calculate 0 -16383 in order.

Page 64: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

64

Loops vs. Kernels

for (int i = 0; i < 16384; i++){

C[i] = A[i] + B[i];}

function loopBody(A, B, C, i){

C[i] = A[i] + B[i];}

Calculate 0 -16383 in order.

Calculate 0

Page 65: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

65

Loops vs. Kernels

for (int i = 0; i < 16384; i++){

C[i] = A[i] + B[i];}

function loopBody(A, B, C, i){

C[i] = A[i] + B[i];}

Calculate 0 -16383 in order.

Calculate 0

Calculate 0Calculate 1

Calculate 2Calculate 3

Calculate 16383

Calculate 0Calculate 1

Calculate 2Calculate 3Calculate …

Calculate 0Calculate 1

Calculate 2Calculate 3Calculate 14

Calculate 0Calculate 1

Calculate 2Calculate 3

Calculate 9

Calculate 1Calculate 2

Calculate 3Calculate 4

Page 66: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

66

Parallelize Matvec with kernelsvoid matvec(...) {double *restrict ycoefs=y.coefs;#pragma acc kernelsfor(int i=0;i<num_rows;i++) {double sum=0;int row_start=row_offsets[i];int row_end=row_offsets[i+1];

#pragma acc loop reduction(+:sum)for(int

j=row_start;j<row_end;j++) {unsigned int Acol=cols[j];double Acoef=Acoefs[j];double xcoef=xcoefs[Acol];sum+=Acoef*xcoef;

}ycoefs[i]=sum;

}}

With the kernels directive, the compiler will detect a (false) data dependency on ycoefs.

It’s necessary to either mark the loop as independent or add the restrict keyword to get parallelization.

Page 67: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

67

OpenACC parallel loop vs. kernels

PARALLEL LOOP

Programmer’s responsibility to ensure safe parallelism

Will parallelize what a compiler may miss

Straightforward path from OpenMP

KERNELS

Compiler’s responsibility to analyze the code and parallelize what is safe.

Can cover larger area of code with single directive

Gives compiler additional leeway to optimize.

Compiler sometimes gets it wrong.

Both approaches are equally valid and can perform equally well.

Page 68: INTRODUCTION TO OPENACC · 2016-10-26 · 5 Today’s Objectives Understand what OpenACC is and why to use it Understand some of the differences between CPU and GPU hardware. Know

68

OpenACC Performance So Far… (kernels)

0.00X

5.00X

10.00X

15.00X

Serial Multicore K80 (single) P100

Serial Multicore K80 (single) P100

Speed-u

p f

rom

seri

al

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz