Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing...
Transcript of Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing...
Automatic Skeleton-Based Compilation throughIntegration with an Algorithm Classification
Cedric Nugteren (presenter), Pieter Custers, Henk Corporaal
Eindhoven University of Technology (TU/e)http://parse.ele.tue.nl/
August 28, 2013
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 1 / 27
Species and skeletons
Are these two of the same species?
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 1 / 27
Species and skeletons
They are. Possible explanation: their skeletons look alike.
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 2 / 27
Species and skeletons
And what about these two?
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 3 / 27
Species and skeletons
They are not: their skeleton is quite different.
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 4 / 27
Species and skeletons
Functionality of the code: what you want tocompute
Structure of the code: parallelism andmemory access patterns
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 5 / 27
Example C to CUDA transformation
Example 1: Sumint sum = 0;
for (int i=0;i<N;i++) {
sum = sum + in[i];
}
[image taken from ‘Optimizing Parallel
Reduction in CUDA’ by Mark Harris]
Just a part of the optimized CUDA code:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {
if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];
}
template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {
sm[tid] += g_idata[i]sm[tid] += g_idata[i+blockSize];i += gridSize;
}__syncthreads();if (blockSize >= 512) {
if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();
}if (blockSize >= 256) {
if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();
}if (blockSize >= 128) {
if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();
}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 6 / 27
Example C to CUDA transformation
Example 1: Sumint sum = 0;
for (int i=0;i<N;i++) {
sum = sum + in[i];
}
[image taken from ‘Optimizing Parallel
Reduction in CUDA’ by Mark Harris]
Just a part of the optimized CUDA code:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {
if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];
}
template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {
sm[tid] += g_idata[i]sm[tid] += g_idata[i+blockSize];i += gridSize;
}__syncthreads();if (blockSize >= 512) {
if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();
}if (blockSize >= 256) {
if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();
}if (blockSize >= 128) {
if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();
}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 6 / 27
Example C to CUDA transformation
Example 1: Sumint sum = 0;
for (int i=0;i<N;i++) {
sum = sum + in[i];
}
[image taken from ‘Optimizing Parallel
Reduction in CUDA’ by Mark Harris]
Just a part of the optimized CUDA code:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {
if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];
}
template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {
sm[tid] += g_idata[i]sm[tid] += g_idata[i+blockSize];i += gridSize;
}__syncthreads();if (blockSize >= 512) {
if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();
}if (blockSize >= 256) {
if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();
}if (blockSize >= 128) {
if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();
}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 6 / 27
What about a second example?
Example 1: Sumint sum = 0;
for (int i=0;i<N;i++) {
sum = sum + in[i];
}
Example 2: Maxint max = 0;
for (int i=0;i<N;i++) {
max = (max>in[i]) ? max : in[i];
}
Highlighted is thefunctionality
The remainder isthe structure: theskeleton of the code
CUDA code for example 2:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {
if (blockSize >= 64) sm[tid] = (sm[tid]>sm[tid+32]) ? sm[tid] : sm[tid+32];if (blockSize >= 32) sm[tid] = (sm[tid]>sm[tid+16]) ? sm[tid] : sm[tid+16];if (blockSize >= 16) sm[tid] = (sm[tid]>sm[tid+ 8]) ? sm[tid] : sm[tid+ 8];if (blockSize >= 8) sm[tid] = (sm[tid]>sm[tid+ 4]) ? sm[tid] : sm[tid+ 4];if (blockSize >= 4) sm[tid] = (sm[tid]>sm[tid+ 2]) ? sm[tid] : sm[tid+ 2];if (blockSize >= 2) sm[tid] = (sm[tid]>sm[tid+ 1]) ? sm[tid] : sm[tid+ 1];
}
template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {
sm[tid] = (sm[tid]>g_idata[i]) ? sm[tid] : g_idata[i];sm[tid] = (sm[tid]>g_idata[i+blockSize]) ? sm[tid] : g_idata[i+blockSize];i += gridSize;
}__syncthreads();if (blockSize >= 512) {
if (tid < 256) { sm[tid] = (sm[tid]>sm[tid+256]) ? sm[tid] : sm[tid+256]; }__syncthreads();
}if (blockSize >= 256) {
if (tid < 128) { sm[tid] = (sm[tid]>sm[tid+128]) ? sm[tid] : sm[tid+128]; }__syncthreads();
}if (blockSize >= 128) {
if (tid < 64) { sm[tid] = (sm[tid]>sm[tid+ 64]) ? sm[tid] : sm[tid+ 64]; }__syncthreads();
}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 7 / 27
What about a second example?
Example 1: Sumint sum = 0;
for (int i=0;i<N;i++) {
sum = sum + in[i];
}
Example 2: Maxint max = 0;
for (int i=0;i<N;i++) {
max = (max>in[i]) ? max : in[i];
}
Highlighted is thefunctionality
The remainder isthe structure: theskeleton of the code
CUDA code for example 2:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {
if (blockSize >= 64) sm[tid] = (sm[tid]>sm[tid+32]) ? sm[tid] : sm[tid+32];if (blockSize >= 32) sm[tid] = (sm[tid]>sm[tid+16]) ? sm[tid] : sm[tid+16];if (blockSize >= 16) sm[tid] = (sm[tid]>sm[tid+ 8]) ? sm[tid] : sm[tid+ 8];if (blockSize >= 8) sm[tid] = (sm[tid]>sm[tid+ 4]) ? sm[tid] : sm[tid+ 4];if (blockSize >= 4) sm[tid] = (sm[tid]>sm[tid+ 2]) ? sm[tid] : sm[tid+ 2];if (blockSize >= 2) sm[tid] = (sm[tid]>sm[tid+ 1]) ? sm[tid] : sm[tid+ 1];
}
template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {
sm[tid] = (sm[tid]>g_idata[i]) ? sm[tid] : g_idata[i];sm[tid] = (sm[tid]>g_idata[i+blockSize]) ? sm[tid] : g_idata[i+blockSize];i += gridSize;
}__syncthreads();if (blockSize >= 512) {
if (tid < 256) { sm[tid] = (sm[tid]>sm[tid+256]) ? sm[tid] : sm[tid+256]; }__syncthreads();
}if (blockSize >= 256) {
if (tid < 128) { sm[tid] = (sm[tid]>sm[tid+128]) ? sm[tid] : sm[tid+128]; }__syncthreads();
}if (blockSize >= 128) {
if (tid < 64) { sm[tid] = (sm[tid]>sm[tid+ 64]) ? sm[tid] : sm[tid+ 64]; }__syncthreads();
}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 7 / 27
What about a second example?
Example 1: Sumint sum = 0;
for (int i=0;i<N;i++) {
sum = sum + in[i];
}
Example 2: Maxint max = 0;
for (int i=0;i<N;i++) {
max = (max>in[i]) ? max : in[i];
}
Highlighted is thefunctionality
The remainder isthe structure: theskeleton of the code
CUDA code for example 1:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {
if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];
}
template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {
sm[tid] += g_idata[i];sm[tid] += g_idata[i+blockSize];i += gridSize;
}__syncthreads();if (blockSize >= 512) {
if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();
}if (blockSize >= 256) {
if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();
}if (blockSize >= 128) {
if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();
}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 8 / 27
Outline
1 Introducing skeletons
2 Introducing algorithmic species
3 ‘Bones’: a skeleton-based source-to-source compiler
4 Host-accelerator transfer optimisations
5 Experimental results
6 Summary
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 9 / 27
Outline
1 Introducing skeletons
2 Introducing algorithmic species
3 ‘Bones’: a skeleton-based source-to-source compiler
4 Host-accelerator transfer optimisations
5 Experimental results
6 Summary
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 10 / 27
Example algorithmic species
Matrix-vector multiplication:
f o r ( i =0; i <64; i++) {r [ i ] = 0 ;f o r ( j =0; j <128; j++) {
r [ i ] += M[ i ] [ j ] ∗ v [ j ] ;}
}
0 127
0
63
0
63
0
127
0
127
0
63
0
63
+
+
→
→
M v r
0:63,0:127|chunk(0:0,0:127) ∧ 0:127|full → 0:63|element
Stencil computation:
f o r ( i =1; i <128−1; i++) {m[ i ] = 0 .33 ∗ ( a [ i −1]+a [ i ]+a [ i +1]) ;
}0 127 0 127
→
→a m
1:126|neighbourhood(-1:1) → 1:126|element
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 11 / 27
Example algorithmic species
Matrix-vector multiplication:
f o r ( i =0; i <64; i++) {r [ i ] = 0 ;f o r ( j =0; j <128; j++) {
r [ i ] += M[ i ] [ j ] ∗ v [ j ] ;}
}
0 127
0
63
0
63
0
127
0
127
0
63
0
63
+
+
→
→
M v r
0:63,0:127|chunk(0:0,0:127) ∧ 0:127|full → 0:63|element
Stencil computation:
f o r ( i =1; i <128−1; i++) {m[ i ] = 0 .33 ∗ ( a [ i −1]+a [ i ]+a [ i +1]) ;
}0 127 0 127
→
→a m
1:126|neighbourhood(-1:1) → 1:126|element
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 11 / 27
Example algorithmic species
Matrix-vector multiplication:
f o r ( i =0; i <64; i++) {r [ i ] = 0 ;f o r ( j =0; j <128; j++) {
r [ i ] += M[ i ] [ j ] ∗ v [ j ] ;}
}
0 127
0
63
0
63
0
127
0
127
0
63
0
63
+
+
→
→
M v r
0:63,0:127|chunk(0:0,0:127) ∧ 0:127|full → 0:63|element
Stencil computation:
f o r ( i =1; i <128−1; i++) {m[ i ] = 0 .33 ∗ ( a [ i −1]+a [ i ]+a [ i +1]) ;
}0 127 0 127
→
→a m
1:126|neighbourhood(-1:1) → 1:126|element
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 11 / 27
Algorithmic species
Algorithmic species:
Classifies code based on memory access patterns and parallelism
Is more fine-grained compared to other skeleton classifications
Can be extracted automatically from C-code using aset or a-darwin
For more information on species:
1 C. Nugteren, P. Custers, and H. Corporaal. Algorithmic Species:An Algorithm Classification of Affine Loop Nests for ParallelProgramming. In ACM TACO: Transactions on Architecture andCode Optimisations, 9(4):Article 40. 2013.
2 C. Nugteren, R. Corvino, and H. Corporaal. Algorithmic SpeciesRevisited: A Program Code Classification Based on ArrayReferences. In MuCoCoS’13: International Workshop onMulti-/Many-core Computing Systems. IEEE, 2013.
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 12 / 27
Algorithmic species
Algorithmic species:
Classifies code based on memory access patterns and parallelism
Is more fine-grained compared to other skeleton classifications
Can be extracted automatically from C-code using aset or a-darwin
For more information on species:
1 C. Nugteren, P. Custers, and H. Corporaal. Algorithmic Species:An Algorithm Classification of Affine Loop Nests for ParallelProgramming. In ACM TACO: Transactions on Architecture andCode Optimisations, 9(4):Article 40. 2013.
2 C. Nugteren, R. Corvino, and H. Corporaal. Algorithmic SpeciesRevisited: A Program Code Classification Based on ArrayReferences. In MuCoCoS’13: International Workshop onMulti-/Many-core Computing Systems. IEEE, 2013.
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 12 / 27
Outline
1 Introducing skeletons
2 Introducing algorithmic species
3 ‘Bones’: a skeleton-based source-to-source compiler
4 Host-accelerator transfer optimisations
5 Experimental results
6 Summary
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 13 / 27
Introducing Bones (1/2)
Bones provides a set of skeletons for multiple targets:
C-to-CUDA (NVIDIA GPUs)
C-to-OpenCL (3 targets: AMD GPUs, AMD CPUs, Intel CPUs)
C-to-OpenMP (multi-core CPUs)
C-to-C (pass-through)
C-to-FPGA (under construction)
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 14 / 27
Introducing Bones (1/2)
Bones provides a set of skeletons for multiple targets:
C-to-CUDA (NVIDIA GPUs)
C-to-OpenCL (3 targets: AMD GPUs, AMD CPUs, Intel CPUs)
C-to-OpenMP (multi-core CPUs)
C-to-C (pass-through)
C-to-FPGA (under construction)
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 14 / 27
Introducing Bones (2/2)
Bones’ main strengths:
1 Code readability: skeletons allow for a lightweight compiler
2 Performance: Write optimised skeletons in the native language
3 Low programmer effort: Species can be extracted automatically
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27
Introducing Bones (2/2)
Bones’ main strengths:
1 Code readability: skeletons allow for a lightweight compiler
2 Performance: Write optimised skeletons in the native language
3 Low programmer effort: Species can be extracted automatically
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27
Introducing Bones (2/2)
Bones’ main strengths:
1 Code readability: skeletons allow for a lightweight compiler
2 Performance: Write optimised skeletons in the native language
3 Low programmer effort: Species can be extracted automatically
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27
Introducing Bones (2/2)
Bones’ main strengths:
1 Code readability: skeletons allow for a lightweight compiler
2 Performance: Write optimised skeletons in the native language
3 Low programmer effort: Species can be extracted automatically
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27
Introducing Bones (2/2)
Bones’ main strengths:
1 Code readability: skeletons allow for a lightweight compiler
2 Performance: Write optimised skeletons in the native language
3 Low programmer effort: Species can be extracted automatically
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27
Example simplified skeleton within Bones
Example OpenMP skeleton:
i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;
#pragma omp p a r a l l e l{
i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = <parallelism>/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {
<ids><code>
}
Instantiated skeleton:
i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;
#pragma omp p a r a l l e l{
i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = 128/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {
int gid = i;r[gid] = 0;for (j=0; j<128; j++)
r[gid] += M[gid][j] * v[j] ;}
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 16 / 27
Example simplified skeleton within Bones
Example OpenMP skeleton:
i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;
#pragma omp p a r a l l e l{
i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = <parallelism>/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {
<ids><code>
}
Instantiated skeleton:
i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;
#pragma omp p a r a l l e l{
i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = 128/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {
int gid = i;r[gid] = 0;for (j=0; j<128; j++)
r[gid] += M[gid][j] * v[j] ;}
}
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 16 / 27
Outline
1 Introducing skeletons
2 Introducing algorithmic species
3 ‘Bones’: a skeleton-based source-to-source compiler
4 Host-accelerator transfer optimisations
5 Experimental results
6 Summary
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 17 / 27
Example of host-accelerator transfer optimisations
GPU example with 2 kernels:
Kernel 1: consumes A, produces B
Kernel 1: consumes B, produces C
Copy-in before and copy-out directly after the kernel
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27
Example of host-accelerator transfer optimisations
GPU example with 2 kernels:
Kernel 1: consumes A, produces B
Kernel 1: consumes B, produces C
Copy-in before and copy-out directly after the kernel
Asynchronous copies:
Create a second thread to perform asynchronous copies
Use synchronisation barriers
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27
Example of host-accelerator transfer optimisations
Asynchronous copies:
Create a second thread to perform asynchronous copies
Use synchronisation barriers
Remove redundant copies:
Remove the redundant copy-in of B
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27
Example of host-accelerator transfer optimisations
Remove redundant copies:
Remove the redundant copy-in of B
Postpone copy-outs:
Postpone the copy-out of B
Overlap transfers with computations
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27
Outline
1 Introducing skeletons
2 Introducing algorithmic species
3 ‘Bones’: a skeleton-based source-to-source compiler
4 Host-accelerator transfer optimisations
5 Experimental results
6 Summary
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 19 / 27
Performance results for CPUs
Performance results for execution on a 4-core i7 CPU:
Based on kernels from the PolyBench benchmark set
Targets: OpenCL (2 different SDKs) and OpenMP
Showing speed-ups of individual kernels
Comparing to unoptimised sequential C-code
0.5x
1x
2x
4x
10x
2m
m−
1
2m
m−
2
3m
m−
1
3m
m−
2
3m
m−
3
adi−
1
atax
−1
atax
−2
bic
g−
1
bic
g−
2
corr
el−
1
corr
el−
2
corr
el−
3
cov
ar−
1
cov
ar−
2
do
itg
en−
1
do
itg
en−
2
Sp
eed
−u
p (
log
sca
le)
OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 20 / 27
Performance results for CPUs
Performance results for execution on a 4-core i7 CPU:
Based on kernels from the PolyBench benchmark set
Targets: OpenCL (2 different SDKs) and OpenMP
Showing speed-ups of individual kernels
Comparing to unoptimised sequential C-code
0.5x
1x
2x
4x
10x
2m
m−
1
2m
m−
2
3m
m−
1
3m
m−
2
3m
m−
3
adi−
1
atax
−1
atax
−2
bic
g−
1
bic
g−
2
corr
el−
1
corr
el−
2
corr
el−
3
cov
ar−
1
cov
ar−
2
do
itg
en−
1
do
itg
en−
2
Sp
eed
−u
p (
log
sca
le)
OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 20 / 27
Performance results for CPUs
0.5x
1x
2x
4x
10x2
mm
−1
2m
m−
2
3m
m−
1
3m
m−
2
3m
m−
3
adi−
1
atax
−1
atax
−2
bic
g−
1
bic
g−
2
corr
el−
1
corr
el−
2
corr
el−
3
cov
ar−
1
cov
ar−
2
do
itg
en−
1
do
itg
en−
2
Sp
eed
−u
p (
log
sca
le)
OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP
0.5x
1x
2x
4x
10x
fdtd
2d
−2
fdtd
2d
−3
fdtd
2d
−4
gem
m
gem
ver
−1
gem
ver
−2
gem
ver
−4
ges
um
mv
jaco
bi1
d−
1
jaco
bi1
d−
2
jaco
bi2
d−
1
jaco
bi2
d−
2
mv
t−1
mv
t−2
syr2
k
syrk
geo
mea
n
Sp
eed
−u
p (
log
sca
le)
OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 21 / 27
Performance results for GPUs
Comparing Bones to others on a GPU:
Based on kernels from the PolyBench benchmark set
Target: CUDA on a GTX470 GPU
Showing speed-ups of Bones over Par4All and ppcg[Par4All and ppcg are the only other automatic C-to-CUDA compilers available]
Higher is in favour of Bones
Two different experiments:
1 Individual GPU kernels
2 Full program, including host-accelerator transfers
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 22 / 27
Performance results for GPUs
Comparing Bones to others on a GPU:
Based on kernels from the PolyBench benchmark set
Target: CUDA on a GTX470 GPU
Showing speed-ups of Bones over Par4All and ppcg[Par4All and ppcg are the only other automatic C-to-CUDA compilers available]
Higher is in favour of Bones
Two different experiments:
1 Individual GPU kernels
2 Full program, including host-accelerator transfers
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 22 / 27
Performance results for GPUs (kernel only)
0.5x
1x
2x
4x
10x2
mm
−1
2m
m−
2
3m
m−
1
3m
m−
2
3m
m−
3
atax
−1
atax
−2
bic
g−
1
bic
g−
2
corr
el−
1
corr
el−
2
corr
el−
3
cov
ar−
1
cov
ar−
2
do
itg
en−
1
do
itg
en−
2
Sp
eed
−u
p (
log
sca
le)
x x x
Compared to Par4AllCompared to PPCG
0.5x
1x
2x
4x
10x
fdtd
2d
−1
fdtd
2d
−2
fdtd
2d
−3
fdtd
2d
−4
gem
m
gem
ver
−1
gem
ver
−2
gem
ver
−3
gem
ver
−4
ges
um
mv
jaco
bi1
d−
1
jaco
bi1
d−
2
jaco
bi2
d−
1
jaco
bi2
d−
2
mv
t−1
mv
t−2
syr2
k
syrk
geo
mea
n
Sp
eed
−u
p (
log
sca
le)
Compared to Par4AllCompared to PPCG
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 23 / 27
Performance results for GPUs (full program)
0.5x
1x
2x
4x
10x
2m
m
3m
m
atax
bic
g
corr
el
cov
ar
do
itg
en
fdtd
−2
d
Sp
eed
−u
p (
log
sca
le)
x
368x 933x 378x
0.27x
Compared to PolyBench/GPUCompared to Par4AllCompared to PPCGCompared to Bones (naive transfers)
0.5x
1x
2x
4x
10x
gem
m
gem
ver
ges
um
mv
jaco
bi−
1d
jaco
bi−
2d
mv
t
syr2
k
syrk
Sp
eed
−u
p (
log
sca
le)
x x x
19x13x 17x
Compared to PolyBench/GPUCompared to Par4AllCompared to PPCGCompared to Bones (naive transfers)
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 24 / 27
Outline
1 Introducing skeletons
2 Introducing algorithmic species
3 ‘Bones’: a skeleton-based source-to-source compiler
4 Host-accelerator transfer optimisations
5 Experimental results
6 Summary
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 25 / 27
Summary
The source-to-source compiler Bones:
Uses algorithmic skeletons
Generates readable CUDA/OpenCL/OpenMP code
Delivers competitive GPU performance
Performs host-accelerator transfer optimisations
Is based on algorithmic species
The classification ‘algorithmic species’:
Captures memory access patterns from C source code
Automates classification through aset and a-darwin
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 26 / 27
Summary
The source-to-source compiler Bones:
Uses algorithmic skeletons
Generates readable CUDA/OpenCL/OpenMP code
Delivers competitive GPU performance
Performs host-accelerator transfer optimisations
Is based on algorithmic species
The classification ‘algorithmic species’:
Captures memory access patterns from C source code
Automates classification through aset and a-darwin
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 26 / 27
Questions / further information
Thank you for your attention!
Bones is available at:http://parse.ele.tue.nl/bones/
For more information and links to publications, visit:http://parse.ele.tue.nl/
http://parse.ele.tue.nl/species/http://www.cedricnugteren.nl/
Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 27 / 27