PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe
-
Upload
amd-developer-central -
Category
Technology
-
view
1.081 -
download
4
description
Transcript of PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe
OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers
Michael Wolfe [email protected]
http://www.pgroup.com
APU13
San Jose, November, 2013
C, C++, Fortran compilers
Optimizing
Vectorizing
Parallelizing
Graphical parallel tools
PGDBG debugger
PGPROF profiler
AMD, Intel, NVIDIA processors
PGI Unified Binary™ technology
Linux, MacOS, Windows
Visual Studio & Eclipse integration
PGI Accelerator support
OpenACC
CUDA Fortran www.pgroup.com
SMP Parallel Programming
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
SMP Parallel Programming
#pragma omp parallel for private(i)
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
% pgcc –mp x.c …
AMD Radeon Block Diagram*
*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.
Multiple Compute Units
Vector Unit
Pipelining / Multithreading
Device Memory
Cache Hierarchy
SW-managed cache (LDS)
Heterogeneous Parallel Programming
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
Heterogeneous Parallel Programming
#pragma acc parallel loop private(i) \
pcopyin(b[0:n], c[0:n]) \
pcopyout(a[0:n])
for( i = 0; i < n; ++i )
a[i] = sinf(b[i]) + cosf(c[i]);
% pgcc –acc –ta=radeon x.c
Parallel programming
GPU Architectural highlights
OpenACC 5 minute summary
PGI Implementation
Performance
Overview
Abstract CPU+Accelerator Target
Accelerator Architecture Features
Potentially separate memory (relatively small)
High bandwidth memory interface
Many degrees of parallelism
MIMD parallelism across many cores
SIMD parallelism within a core
Multithreading for latency tolerance
Asynchronous with host
Performance from Parallelism
slower clock, less ILP, simpler control unit, smaller caches
OpenACC Open Programming Standard for Parallel Computing
“PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.”
--Michael Wong, CEO OpenMP Directives Board
OpenACC Overview
Directive-based
Parallel Computation
Data Management
#pragma acc data copyin( a[0:n] ) \
copy( b(0:n] ) create( tmp[0:n] )
{
for( int i = 0; i < iters; ++i ){
relax( a, b, tmp, n );
relax( b, a, tmp, n );
}
}
relax(float *x,float *y,float *t,int n){
#pragma acc data \
present( x[0:n], y[0:n], t[0:n] )
{
#pragma acc parallel loop
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc parallel loop
for( int j = 1; j < n-1; ++j
x[j] = 0.25f*(t[j-1]+t[j+1] +
y[n-j+1] + y[n-j-1]);
}
}
OpenACC compared to OpenMP
Data parallelism
Parallel per region
Flexible || mapping
Structured parallelism
Performance portability
Thread parallelism
Fixed number of threads
Fixed || thread mapping
Tasks and loops
?
PGI OpenACC Implementation
C, C++, Fortran
pgcc, pgc++, pgfortran
Command line options
-acc
-ta=radeon
-ta=radeon,host
-ta=radeon,nvidia
Planner
maps program ||ism to
hardware ||ism
Code Generator
OpenCL API
Runtime
initialization
data management
kernel launches
Planner
Maps parallel loops
OpenACC abstractions
gang, worker, vector
OpenCL abstractions
work group, work item
Hardware abstractions
wavefront
#pragma acc parallel loop gang
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc parallel loop gang vector
for( int j = 0; j < n; ++j )
t[j] = x[j];
#pragma acc kernels loop independent
for( int j = 0; j < n; ++j )
t[j] = x[j];
Code Generator
Low-level OpenCL
“assembly code in C”
SPIR interface to AMD
Radeon LLVM back-end
Uses non-standard
features
device addresses
Runtime
Dynamically loads
OpenCL library
Supports multiple devices
Multiple command
queues
Host as a device (*)
Memory management
device addresses
bigbuffer(s) suballocation
Profiling support
Performance
AMD Piledriver 5800K
4.0GHz
2MB cache
8 cores
Single thread/core
OpenMP parallel
PGI 13.10 –fast –mp
AMD Radeon 7970
Tahiti
925 MHz
3GB memory
32 compute units
OpenACC parallel
PGI 13.10 –fast –acc
–ta=radeon:tahiti
Cloverleaf Mantevo Miniapp
Lagrangian-Eulerian hydrodynamics
compressible Euler equation solver in 2D
9500 lines of Fortran+C with OpenMP, OpenACC
Accelerating Hydrocodes with OpenACC, OpenCL and CUDA,
Herdman et al, 2012 SC Companion
DOI: 10.1109/SC.Companion.2012.66
Performance Results
0
5
10
15
20
25
30
35
40
960^2x87 1920^2x87 3840^2x87 960^2x2955 1920^2x2955
Serial
OpenMP
R7970
S10000
OpenACC on AMD GPUs and APUs
OpenACC is designed for performance portability
PGI Accelerator compilers provide evidence
Target-specific tuning still underway
Open Beta compilers available now
Product version in January 2014
Copyright Notice
© Contents copyright 2013, NVIDIA Corp. This material may not be
reproduced in any manner without the expressed written
permission of NVIDIA Corp.