PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers

Michael Wolfe [email protected]

http://www.pgroup.com

APU13

San Jose, November, 2013

mailto:[email protected]

http://www.pgroup.com/

C, C++, Fortran compilers

Optimizing

Vectorizing

Parallelizing

Graphical parallel tools

PGDBG debugger

PGPROF profiler

AMD, Intel, NVIDIA processors

PGI Unified Binary™ technology

Linux, MacOS, Windows

Visual Studio & Eclipse integration

PGI Accelerator support

OpenACC

CUDA Fortran www.pgroup.com

SMP Parallel Programming

for( i = 0; i < n; ++i )

a[i] = sinf(b[i]) + cosf(c[i]);

SMP Parallel Programming

#pragma omp parallel for private(i)

for( i = 0; i < n; ++i )


% pgcc –mp x.c …

AMD Radeon Block Diagram*

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.

Multiple Compute Units

Vector Unit

Pipelining / Multithreading

Device Memory

Cache Hierarchy

SW-managed cache (LDS)

Heterogeneous Parallel Programming

for( i = 0; i < n; ++i )


Heterogeneous Parallel Programming

#pragma acc parallel loop private(i) \

pcopyin(b[0:n], c[0:n]) \

pcopyout(a[0:n])

for( i = 0; i < n; ++i )


% pgcc –acc –ta=radeon x.c

Parallel programming

GPU Architectural highlights

OpenACC 5 minute summary

PGI Implementation

Performance

Overview

Abstract CPU+Accelerator Target

Accelerator Architecture Features

Potentially separate memory (relatively small)

High bandwidth memory interface

Many degrees of parallelism

MIMD parallelism across many cores

SIMD parallelism within a core

Multithreading for latency tolerance

Asynchronous with host

Performance from Parallelism

slower clock, less ILP, simpler control unit, smaller caches

OpenACC Open Programming Standard for Parallel Computing

“PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.”

--Buddy Bland, Titan Project Director, Oak Ridge National Lab

“OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.”

--Michael Wong, CEO OpenMP Directives Board

OpenACC Overview

Directive-based

Parallel Computation

Data Management

#pragma acc data copyin( a[0:n] ) \

copy( b(0:n] ) create( tmp[0:n] )

{

for( int i = 0; i < iters; ++i ){

relax( a, b, tmp, n );

relax( b, a, tmp, n );

}

}

relax(float *x,float *y,float *t,int n){

#pragma acc data \

present( x[0:n], y[0:n], t[0:n] )

{

#pragma acc parallel loop

for( int j = 0; j < n; ++j )

t[j] = x[j];

#pragma acc parallel loop

for( int j = 1; j < n-1; ++j

x[j] = 0.25f*(t[j-1]+t[j+1] +

y[n-j+1] + y[n-j-1]);

}

}

OpenACC compared to OpenMP

Data parallelism

Parallel per region

Flexible || mapping

Structured parallelism

Performance portability

Thread parallelism

Fixed number of threads

Fixed || thread mapping

Tasks and loops

?

PGI OpenACC Implementation

C, C++, Fortran

pgcc, pgc++, pgfortran

Command line options

-acc

-ta=radeon

-ta=radeon,host

-ta=radeon,nvidia

Planner

maps program ||ism to

hardware ||ism

Code Generator

OpenCL API

Runtime

initialization

data management

kernel launches

Planner

Maps parallel loops

OpenACC abstractions

gang, worker, vector

OpenCL abstractions

work group, work item

Hardware abstractions

wavefront

#pragma acc parallel loop gang

for( int j = 0; j < n; ++j )

t[j] = x[j];

#pragma acc parallel loop gang vector

for( int j = 0; j < n; ++j )

t[j] = x[j];

#pragma acc kernels loop independent

for( int j = 0; j < n; ++j )

t[j] = x[j];

Code Generator

Low-level OpenCL

“assembly code in C”

SPIR interface to AMD

Radeon LLVM back-end

Uses non-standard

features

device addresses

Runtime

Dynamically loads

OpenCL library

Supports multiple devices

Multiple command

queues

Host as a device (*)

Memory management

device addresses

bigbuffer(s) suballocation

Profiling support

Performance

AMD Piledriver 5800K

4.0GHz

2MB cache

8 cores

Single thread/core

OpenMP parallel

PGI 13.10 –fast –mp

AMD Radeon 7970

Tahiti

925 MHz

3GB memory

32 compute units

OpenACC parallel

PGI 13.10 –fast –acc

–ta=radeon:tahiti

Cloverleaf Mantevo Miniapp

Lagrangian-Eulerian hydrodynamics

compressible Euler equation solver in 2D

9500 lines of Fortran+C with OpenMP, OpenACC

Accelerating Hydrocodes with OpenACC, OpenCL and CUDA,

Herdman et al, 2012 SC Companion

DOI: 10.1109/SC.Companion.2012.66

Performance Results

0

5

10

15

20

25

30

35

40

960^2x87 1920^2x87 3840^2x87 960^2x2955 1920^2x2955

Serial

OpenMP

R7970

S10000

OpenACC on AMD GPUs and APUs

OpenACC is designed for performance portability

PGI Accelerator compilers provide evidence

Target-specific tuning still underway

Open Beta compilers available now

Product version in January 2014

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

Technology

Transcript of PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe