Nvopencc tutorial1 Tutorial on NVIDIA’s Open64 Sources by Mike Murphy 11/06.
OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... ·...
Transcript of OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... ·...
![Page 1: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/1.jpg)
P4343— OPENUH: OPEN SOURCE OPENACC COMPILER
Xiaonan (Daniel) Tian, Rengan Xu and Barbara Chapman
HPCTools Group Computer Science Department
University of Houston GTC2014, San Jose, CA; 03/26 /2014
![Page 2: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/2.jpg)
I. Motivation
II. Introduction to OpenUH
III. Loop Scheduling
IV. Data Movement
O U T L I N E
VI. Future and Conclusion
V. Performance
2
![Page 3: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/3.jpg)
I.Motivation
3
![Page 4: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/4.jpg)
WHY do we implement OpenACC support in OpenUH? Performance gap between OpenACC and
CUDA more research on OpenACC compiler optimization
Open Source OpenACC compiler is required for research purposes.
WHY is this talk important? BETTER understand OpenACC
implementation, BETTER knowledge on application optimization.
Motivation
4
![Page 5: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/5.jpg)
II. Introduction to OpenUH
Website: http://web.cs.uh.edu/~openuh/ Source: https://github.com/pumpkin83/OpenUH-OpenACC Email: [email protected]
5
![Page 6: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/6.jpg)
• Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray
• Parallel Programming model OpenMP OpenACC COARRAY
Introduction to OpenUH
6
![Page 7: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/7.jpg)
• OpenACC 1.0 implementation Directives: Parallel, kernels, Data, Loop,
Wait Data Clause:
copy/copyin/copyout/create/update Loop Scheduling Clauses:
gang/worker/vector Async clause: async/wait Unsupported: host_data/declare/cache
Introduction to OpenUH OpenACC
7
![Page 8: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/8.jpg)
PRELOWER (Preprocess OpenACC)
LOWER (Transformation of OpenACC)
WOPT (Global Scalar Optimizer)
WHIRL2CUDA
CG(Code for IA-32,IA-64,X86_64)
OpenUH OpenACC Compiler Infrastructure
Source Code with OpenACC
Directives
GPU Code
NVCC Compiler
PTX Assembler
Loaded Dynamically
CPU Binary
Runtime Library Linker
Executable
FRONTENDS (C, OpenACC)
IPA(Inter Procedural Analyzer)
LNO (Loop Nest Optimizer)
Introduction to OpenUH OpenACC
8
![Page 9: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/9.jpg)
III. LOOP SCHEDULING
9
![Page 10: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/10.jpg)
• What’s Loop Scheduling? 1
• Parallel Loop Scheduling 2
• Kernels Loop Scheduling 3
Loop Scheduling
10
![Page 11: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/11.jpg)
• What is Loop Scheduling? – Solutions to distribute sequential loop iterations
across a large number of threads
• Why we have two different Loop Scheduling strategies? – Explore multi-dimensional topology of NVIDIA
GPGPU architecture
Loop Scheduling
11
![Page 12: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/12.jpg)
• #pragma acc loop gang(4) • For(i=0; i<11; i++){…}
Loop Scheduling
0 1 2 3
0 2 10 9 8 4 3 1 7 6 5
Iterations
Gangs
12
![Page 13: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/13.jpg)
• #pragma acc loop gang(4) • For(i=0; i<11; i++){…}
Loop Scheduling
0 1 2 3
0 2 10 9 8 4 3 1 7 6 5
Iterations
Gangs
13
![Page 14: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/14.jpg)
• #pragma acc loop vector(64) • For(i=0; i<99; i++){…}
Loop Scheduling
Vectors
Iterations
0 31 … 32 63 … 95 64 … 96 99 …
0 31 … 32 63 …
0 31 … 64 95 … 32 63 … 96 99 …
14
![Page 15: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/15.jpg)
Gang 0
• #pragma acc loop gang(3) vector(32) • For(i=0; i<130; i++){…}
Loop Scheduling
0 … 31 32 … 63 64 … 95
0 31 … 0 31 … 0 31 …
Gang 1 Gang 2
96 … 127 128…129
Iterations
0 … 31 32 … 63 64 … 95 96 … 127 128…129
15
![Page 16: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/16.jpg)
Parallel Loop Scheduling Gang → (CUDA) thread-block Worker → (CUDA) y dimensional threads in a thread
block Vector → (CUDA) x dimensional threads in a thread block
1D Grid, and 1D/2D thread-block. # of Worker * # of Vector <= 1024 Requires minimal lower-level knowledge. Follows OpenACC 2.0: gang contains worker and vector;
worker can only include vector.
16
![Page 17: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/17.jpg)
Parallel Loop Scheduling 1. Single Loop
#pragma acc loop gang worker vector for(...){}
2. Two-level Nested Loop 2.1. loop gang / loop worker vector
#pragma acc loop gang for(…){ #pragma acc loop worker vector
for(…){ }
} 2.2. loop gang worker / loop vector
#pragma acc loop gang worker for(…){ #pragma acc loop vector
for(…){ }
} 2.3. loop gang / loop vector 17
![Page 18: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/18.jpg)
Parallel Loop Scheduling:example • #pragma acc loop gang(2) worker(4) vector(64) • For(i=istart; i<iend; i++){…}
18
![Page 19: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/19.jpg)
…
Parallel Loop Scheduling:example • #pragma acc loop gang(2) worker(4) vector(64) • For(i=istart; i<iend; i++){…}
:32iterations/ threads
Iterations
CUDA Architecture
Block 0 Block 1
19
![Page 20: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/20.jpg)
…
Parallel Loop Scheduling • #pragma acc loop gang(2) worker(4) vector(64) • For(i=istart; i<iend; i++){…}
:32iterations/ threads
Iterations
CUDA Architecture
20
Block 0 Block 1
![Page 21: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/21.jpg)
Parallel Loop Scheduling • #pragma acc loop gang(2) for(i=istart; i<iend; i++){
#pragma acc loop worker(4) vector(64) for(j=jstart; j<jend; j++){…} }
21
![Page 22: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/22.jpg)
Parallel Loop Scheduling • #pragma acc loop gang(2) for(i=istart; i<iend; i++){
#pragma acc loop worker(4) vector(64) for(j=jstart; j<jend; j++){…} }
…
…
…
…
…
… … … … … … … …
Inner Loop Iterations
Outer Loop Iterations
22
![Page 23: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/23.jpg)
Parallel Loop Scheduling • #pragma acc loop gang(2) for(i=istart; i<iend; i++){
#pragma acc loop worker(4) vector(64) for(j=jstart; j<jend; j++){…} }
…
…
…
…
…
… … … … … … … …
Inner Loop Iterations
Outer Loop Iterations
23
![Page 24: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/24.jpg)
Parallel Loop Scheduling • #pragma acc loop gang(2) worker(4) for(i=istart; i<iend; i++){
#pragma acc loop vector(64) for(j=jstart; j<jend; j++){…} }
… … … …
… … … … … … … … …
Inner Loop Iterations
Outer Loop Iterations
… … … …
24
![Page 25: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/25.jpg)
Parallel Loop Scheduling • #pragma acc loop gang(2) worker(4) for(i=istart; i<iend; i++){
#pragma acc loop vector(64) for(j=jstart; j<jend; j++){…} }
… … … …
… … … … … … … … …
Inner Loop Iterations
Outer Loop Iterations
… … … …
25
![Page 26: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/26.jpg)
Parallel Loop Scheduling 3. Three level Nested Loop
loop gang/loop worker/ loop vector
#pragma acc loop gang for(...)
#pragma acc loop worker for(...)
#pragma acc loop vector for(...) { }
26
![Page 27: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/27.jpg)
Why do we need different strategies for implementing loop scheduling?
#pragma acc loop gang(19) for(i=0; i<19; i++)
#pragma acc loop worker(32) for(j=0; j<1000000; j++)
#pragma acc loop vector(32) For(k=0; k<100000; k++) { } What is the maximum threads we have? 19*32*32 = 19K
27
![Page 28: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/28.jpg)
Try this loop scheduling According the scheduling in the code, 2D grid and 2D
thread-block in NVIDIA GPGPU are created. #pragma acc loop gang(19) for(i=0; i<19; i++)
#pragma acc loop gang(32) vector(32) for(j=0; j<1000000; j++)
#pragma acc loop vector(32) for(k=0; k<100000; k++) { }
What is the maximum threads we have here? 19*32 *32*32= 32*19K 28
![Page 29: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/29.jpg)
Kernels Loop Scheduling Gang → (CUDA) thread-block, can be in x, y, z dimension Worker → Ignored Vector → (CUDA) thread, can be in x, y, z dimension
Multi-dimensional grid/thread-block, both of them can be extended into 3 dimensional topology.
Fine tuning: provide more scheduling options for users. Users need to have more knowledge about compiler and
hardware information(currently, no autotuning) Provided more choices to loop scheduling. In some cases, it does help improve performance
29
![Page 30: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/30.jpg)
Kernels Loop Scheduling 1. Single Loop
#pragma acc loop gang vector for(...){}
2. Double Nested Loop 2.1. loop gang / loop vector 2.2. loop gang vector/ loop vector 2.3. loop gang / loop gang vector 2.4. loop gang vector / loop gang vector
30
![Page 31: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/31.jpg)
Kernels Loop Scheduling 3. Triple Nested Loop 3.1 loop gang / loop gang vector / loop vector 3.2 loop vector / loop gang vector / loop gang 3.3 loop gang vector / loop gang vector / loop vector 3.3 loop gang vector / loop gang vector / loop gang vector …
31
![Page 32: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/32.jpg)
Kernels Loop Scheduling: Example • #pragma acc loop gang(2) vector(4) for(i=istart; i<iend; i++){
#pragma acc loop gang(3) vector(64) for(j=jstart; j<jend; j++){…} }
… … … …
… … … … … … … … …
Inner Loop Iterations
Outer Loop Iterations
… … … …
32
![Page 33: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/33.jpg)
Parallel Loop Scheduling • #pragma acc loop gang(2) worker(4) for(i=istart; i<iend; i++){
#pragma acc loop vector(64) for(j=jstart; j<jend; j++){…} }
… … … …
… … … … … … … … …
Inner Loop Iterations
Outer Loop Iterations
… … … …
33
![Page 34: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/34.jpg)
IV. DATA MOVEMENT
34
![Page 35: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/35.jpg)
Copyout
Copyin
Data Movement 1. Data transfer between CPU and GPU
Multi-core CPU GPU Thousands of Cores
Main M
emory
GPU
Mem
ory
How to optimize?
35
![Page 36: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/36.jpg)
Data Movement 2. Basic Implementation
copypcopy; copyinpcopyin copyoutpcopyout createpcreate
Free buffer/variables when you exit the current region
Goal: Avoid duplicate data traffic(malloc, copyin, copyout)
36
![Page 37: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/37.jpg)
Data Movement 2. Basic Implementation #pragma acc data
data_clauses {
#pragma acc data data_clauses
{ #pragma acc kernels
data_clauses { … }
} } 37
![Page 38: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/38.jpg)
Data Movement 2. Basic Implementation #pragma acc data
data_clauses {
#pragma acc data data_clauses {
#pragma acc kernels data_clauses
{ … }
} }
38
![Page 39: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/39.jpg)
Data Movement 3. Partial Array #pragma acc data
create(xx[0:N]) {
Foo(&xx[start]) } … Foo(double* x) {
#pragma acc parallel pcopy(x[n1:n2])
{ … }
} GPU Memory
CPU Memory
CPU GPU Memory Mapping Table
39
![Page 40: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/40.jpg)
Data Movement 3. Partial Array #pragma acc data
create(xx[0:N]) {
Foo(&xx[start]) } … Foo(double* x) {
#pragma acc parallel pcopy(x[n1:n2])
{ … }
} GPU Memory
CPU Memory
CPU GPU Memory Mapping Table
xx
xx’
xx xx’
40
![Page 41: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/41.jpg)
Data Movement 3. Partial Array #pragma acc data
create(xx[0:N]) {
Foo(&xx[start]) } … Foo(double* x) {
#pragma acc parallel pcopy(x[n1:n2])
{ … }
} GPU Memory
CPU Memory
CPU GPU Memory Mapping Table
xx
xx’
xx xx’
41
![Page 42: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/42.jpg)
Data Movement 3. Partial Array #pragma acc data
create(xx[0:N]) {
Foo(&xx[start]) } … Foo(double* x) {
#pragma acc parallel pcopy(x[n1:n2])
{ … }
} GPU Memory
CPU Memory
CPU GPU Memory Mapping Table
xx
xx’
xx xx’
x
x’
x x’
42
![Page 43: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/43.jpg)
VI. Performance
43
![Page 44: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/44.jpg)
Three-Level Nested Loop Scheduling
44
0
1
2
3
4
5
6
7
8
9
10
parallel-g-w-v kernels
Perf
orm
ance
(sec
ond)
Loop Scheduling
Wave13pt
OpenUH
CAPS
PGI
CRAY
0
2
4
6
8
10
12
parallel-g-w-v kernels
Perf
orm
ance
(sec
ond)
Loop Scheduling
Laplacian
OpenUH
CAPS
PGI
CRAY
Kernels OpenUH: g-gv-v scheduling PGI: default CAPS: default CRAY: default
Same experimental platform used for OpenUH, CAPS and PGI CRAY platform used for Cray machine
![Page 45: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/45.jpg)
NAS Benchmark
45
0
5
10
15
20
25
30
A B C
Perf
orm
ance
(sec
ond)
Data Size
NAS EP
OpenUH(parallel)
PGI(parallel)
OpenUH(combined)
PGI(combined)
Cray(parallel)
Combined: parallel + kernels Cray: use default loop scheduling, #pragma acc loop
0
5
10
15
20
25
30
A B C
Perf
orm
ance
(sec
ond)
Data Size
NAS CG
OpenUH(parallel)
PGI(parallel)
Cray(Parallel)
Same experimental platform used for OpenUH, CAPS and PGI CRAY platform used for Cray machine
![Page 46: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/46.jpg)
NAS Benchmark
46
0
2
4
6
8
10
12
14
16
A B C
Perf
orm
ance
(sec
ond)
Data Size
NAS MG
OpenUH(parallel)
PGI(parallel)
Cray(parallel)
0
20
40
60
80
100
120
140
A B C
Perf
orm
ance
(sec
ond)
Data Size
NAS SP
OpenUH(parallel)
PGI(parallel)
OpenUH(combined)
PGI(combined)
Combined: parallel + kernels
Same experimental platform used for OpenUH, CAPS and PGI CRAY platform used for Cray machine
![Page 47: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/47.jpg)
NAS Benchmark
47
0
50
100
150
200
250
300
350
400
450
500
A B C
Perf
orm
ance
(sec
ond)
Data Size
NAS BT
OpenUH(parallel)
OpenUH(combined)
0
50
100
150
200
250
300
A B C
perf
orm
ance
(sec
ond)
Data Size
NAS LU
OpenUH(parallel)
OpenUH(combined)
Combined: parallel + kernels
![Page 48: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/48.jpg)
V. Future and Conclusion
48
![Page 49: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/49.jpg)
Future Work Support Fortran Support Xeon Phi/AMD GPGPUs and APU Perform more optimization: Irregular Memory access
optimization Provide a more robust OpenACC implementation
49
![Page 50: OpenUH: An Open Source OpenACC Compileron-demand.gputechconf.com/gtc/2014/presentations/S... · Open Source Research Compiler Open64 based Support C/C++/Fortran/Coarray • Parallel](https://reader034.fdocuments.net/reader034/viewer/2022042417/5f32d9a018dbd752b05bd963/html5/thumbnails/50.jpg)
Conclusion Open source OpenACC research compiler, based on
Open64 Competitive performance, compared to other commercial
compilers Proposed regular loop scheduling for parallel region and
non-standard loop scheduling for kernels region
Question?
50