PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

22
OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers Michael Wolfe [email protected] http://www.pgroup.com APU13 San Jose, November, 2013

description

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe at the AMD Developer Summit (APU13) November 11-13, 2013.

Transcript of PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Page 1: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers

Michael Wolfe [email protected]

http://www.pgroup.com

APU13

San Jose, November, 2013

Page 2: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

C, C++, Fortran compilers

Optimizing

Vectorizing

Parallelizing

Graphical parallel tools

PGDBG debugger

PGPROF profiler

AMD, Intel, NVIDIA processors

PGI Unified Binary™ technology

Linux, MacOS, Windows

Visual Studio & Eclipse integration

PGI Accelerator support

OpenACC

CUDA Fortran www.pgroup.com

Page 3: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

SMP Parallel Programming

for( i = 0; i < n; ++i )

a[i] = sinf(b[i]) + cosf(c[i]);

Page 4: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

SMP Parallel Programming

#pragma omp parallel for private(i)

for( i = 0; i < n; ++i )

a[i] = sinf(b[i]) + cosf(c[i]);

% pgcc –mp x.c …

Page 5: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

AMD Radeon Block Diagram*

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.

Multiple Compute Units

Vector Unit

Pipelining / Multithreading

Device Memory

Cache Hierarchy

SW-managed cache (LDS)

Page 6: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Heterogeneous Parallel Programming

for( i = 0; i < n; ++i )

a[i] = sinf(b[i]) + cosf(c[i]);

Page 7: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Heterogeneous Parallel Programming

#pragma acc parallel loop private(i) \

pcopyin(b[0:n], c[0:n]) \

pcopyout(a[0:n])

for( i = 0; i < n; ++i )

a[i] = sinf(b[i]) + cosf(c[i]);

% pgcc –acc –ta=radeon x.c

Page 8: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Parallel programming

GPU Architectural highlights

OpenACC 5 minute summary

PGI Implementation

Performance

Overview

Page 9: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Abstract CPU+Accelerator Target

Page 10: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Accelerator Architecture Features

Potentially separate memory (relatively small)

High bandwidth memory interface

Many degrees of parallelism

MIMD parallelism across many cores

SIMD parallelism within a core

Multithreading for latency tolerance

Asynchronous with host

Performance from Parallelism

slower clock, less ILP, simpler control unit, smaller caches

Page 11: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

OpenACC Open Programming Standard for Parallel Computing

“PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.”

--Buddy Bland, Titan Project Director, Oak Ridge National Lab

“OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.”

--Michael Wong, CEO OpenMP Directives Board

Page 12: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

OpenACC Overview

Directive-based

Parallel Computation

Data Management

#pragma acc data copyin( a[0:n] ) \

copy( b(0:n] ) create( tmp[0:n] )

{

for( int i = 0; i < iters; ++i ){

relax( a, b, tmp, n );

relax( b, a, tmp, n );

}

}

relax(float *x,float *y,float *t,int n){

#pragma acc data \

present( x[0:n], y[0:n], t[0:n] )

{

#pragma acc parallel loop

for( int j = 0; j < n; ++j )

t[j] = x[j];

#pragma acc parallel loop

for( int j = 1; j < n-1; ++j

x[j] = 0.25f*(t[j-1]+t[j+1] +

y[n-j+1] + y[n-j-1]);

}

}

Page 13: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

OpenACC compared to OpenMP

Data parallelism

Parallel per region

Flexible || mapping

Structured parallelism

Performance portability

Thread parallelism

Fixed number of threads

Fixed || thread mapping

Tasks and loops

?

Page 14: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

PGI OpenACC Implementation

C, C++, Fortran

pgcc, pgc++, pgfortran

Command line options

-acc

-ta=radeon

-ta=radeon,host

-ta=radeon,nvidia

Planner

maps program ||ism to

hardware ||ism

Code Generator

OpenCL API

Runtime

initialization

data management

kernel launches

Page 15: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Planner

Maps parallel loops

OpenACC abstractions

gang, worker, vector

OpenCL abstractions

work group, work item

Hardware abstractions

wavefront

#pragma acc parallel loop gang

for( int j = 0; j < n; ++j )

t[j] = x[j];

#pragma acc parallel loop gang vector

for( int j = 0; j < n; ++j )

t[j] = x[j];

#pragma acc kernels loop independent

for( int j = 0; j < n; ++j )

t[j] = x[j];

Page 16: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Code Generator

Low-level OpenCL

“assembly code in C”

SPIR interface to AMD

Radeon LLVM back-end

Uses non-standard

features

device addresses

Page 17: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Runtime

Dynamically loads

OpenCL library

Supports multiple devices

Multiple command

queues

Host as a device (*)

Memory management

device addresses

bigbuffer(s) suballocation

Profiling support

Page 18: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Performance

AMD Piledriver 5800K

4.0GHz

2MB cache

8 cores

Single thread/core

OpenMP parallel

PGI 13.10 –fast –mp

AMD Radeon 7970

Tahiti

925 MHz

3GB memory

32 compute units

OpenACC parallel

PGI 13.10 –fast –acc

–ta=radeon:tahiti

Page 19: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Cloverleaf Mantevo Miniapp

Lagrangian-Eulerian hydrodynamics

compressible Euler equation solver in 2D

9500 lines of Fortran+C with OpenMP, OpenACC

Accelerating Hydrocodes with OpenACC, OpenCL and CUDA,

Herdman et al, 2012 SC Companion

DOI: 10.1109/SC.Companion.2012.66

Page 20: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Performance Results

0

5

10

15

20

25

30

35

40

960^2x87 1920^2x87 3840^2x87 960^2x2955 1920^2x2955

Serial

OpenMP

R7970

S10000

Page 21: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

OpenACC on AMD GPUs and APUs

OpenACC is designed for performance portability

PGI Accelerator compilers provide evidence

Target-specific tuning still underway

Open Beta compilers available now

Product version in January 2014

Page 22: PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Copyright Notice

© Contents copyright 2013, NVIDIA Corp. This material may not be

reproduced in any manner without the expressed written

permission of NVIDIA Corp.