Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing...

46
Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification Cedric Nugteren (presenter), Pieter Custers, Henk Corporaal Eindhoven University of Technology (TU/e) http://parse.ele.tue.nl/ [email protected] August 28, 2013 Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 1 / 27

Transcript of Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing...

Page 1: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Automatic Skeleton-Based Compilation throughIntegration with an Algorithm Classification

Cedric Nugteren (presenter), Pieter Custers, Henk Corporaal

Eindhoven University of Technology (TU/e)http://parse.ele.tue.nl/

[email protected]

August 28, 2013

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 1 / 27

Page 2: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Species and skeletons

Are these two of the same species?

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 1 / 27

Page 3: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Species and skeletons

They are. Possible explanation: their skeletons look alike.

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 2 / 27

Page 4: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Species and skeletons

And what about these two?

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 3 / 27

Page 5: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Species and skeletons

They are not: their skeleton is quite different.

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 4 / 27

Page 6: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Species and skeletons

Functionality of the code: what you want tocompute

Structure of the code: parallelism andmemory access patterns

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 5 / 27

Page 7: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example C to CUDA transformation

Example 1: Sumint sum = 0;

for (int i=0;i<N;i++) {

sum = sum + in[i];

}

[image taken from ‘Optimizing Parallel

Reduction in CUDA’ by Mark Harris]

Just a part of the optimized CUDA code:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {

if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];

}

template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {

sm[tid] += g_idata[i]sm[tid] += g_idata[i+blockSize];i += gridSize;

}__syncthreads();if (blockSize >= 512) {

if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();

}if (blockSize >= 256) {

if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();

}if (blockSize >= 128) {

if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();

}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 6 / 27

Page 8: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example C to CUDA transformation

Example 1: Sumint sum = 0;

for (int i=0;i<N;i++) {

sum = sum + in[i];

}

[image taken from ‘Optimizing Parallel

Reduction in CUDA’ by Mark Harris]

Just a part of the optimized CUDA code:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {

if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];

}

template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {

sm[tid] += g_idata[i]sm[tid] += g_idata[i+blockSize];i += gridSize;

}__syncthreads();if (blockSize >= 512) {

if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();

}if (blockSize >= 256) {

if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();

}if (blockSize >= 128) {

if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();

}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 6 / 27

Page 9: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example C to CUDA transformation

Example 1: Sumint sum = 0;

for (int i=0;i<N;i++) {

sum = sum + in[i];

}

[image taken from ‘Optimizing Parallel

Reduction in CUDA’ by Mark Harris]

Just a part of the optimized CUDA code:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {

if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];

}

template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {

sm[tid] += g_idata[i]sm[tid] += g_idata[i+blockSize];i += gridSize;

}__syncthreads();if (blockSize >= 512) {

if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();

}if (blockSize >= 256) {

if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();

}if (blockSize >= 128) {

if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();

}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 6 / 27

Page 10: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

What about a second example?

Example 1: Sumint sum = 0;

for (int i=0;i<N;i++) {

sum = sum + in[i];

}

Example 2: Maxint max = 0;

for (int i=0;i<N;i++) {

max = (max>in[i]) ? max : in[i];

}

Highlighted is thefunctionality

The remainder isthe structure: theskeleton of the code

CUDA code for example 2:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {

if (blockSize >= 64) sm[tid] = (sm[tid]>sm[tid+32]) ? sm[tid] : sm[tid+32];if (blockSize >= 32) sm[tid] = (sm[tid]>sm[tid+16]) ? sm[tid] : sm[tid+16];if (blockSize >= 16) sm[tid] = (sm[tid]>sm[tid+ 8]) ? sm[tid] : sm[tid+ 8];if (blockSize >= 8) sm[tid] = (sm[tid]>sm[tid+ 4]) ? sm[tid] : sm[tid+ 4];if (blockSize >= 4) sm[tid] = (sm[tid]>sm[tid+ 2]) ? sm[tid] : sm[tid+ 2];if (blockSize >= 2) sm[tid] = (sm[tid]>sm[tid+ 1]) ? sm[tid] : sm[tid+ 1];

}

template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {

sm[tid] = (sm[tid]>g_idata[i]) ? sm[tid] : g_idata[i];sm[tid] = (sm[tid]>g_idata[i+blockSize]) ? sm[tid] : g_idata[i+blockSize];i += gridSize;

}__syncthreads();if (blockSize >= 512) {

if (tid < 256) { sm[tid] = (sm[tid]>sm[tid+256]) ? sm[tid] : sm[tid+256]; }__syncthreads();

}if (blockSize >= 256) {

if (tid < 128) { sm[tid] = (sm[tid]>sm[tid+128]) ? sm[tid] : sm[tid+128]; }__syncthreads();

}if (blockSize >= 128) {

if (tid < 64) { sm[tid] = (sm[tid]>sm[tid+ 64]) ? sm[tid] : sm[tid+ 64]; }__syncthreads();

}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 7 / 27

Page 11: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

What about a second example?

Example 1: Sumint sum = 0;

for (int i=0;i<N;i++) {

sum = sum + in[i];

}

Example 2: Maxint max = 0;

for (int i=0;i<N;i++) {

max = (max>in[i]) ? max : in[i];

}

Highlighted is thefunctionality

The remainder isthe structure: theskeleton of the code

CUDA code for example 2:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {

if (blockSize >= 64) sm[tid] = (sm[tid]>sm[tid+32]) ? sm[tid] : sm[tid+32];if (blockSize >= 32) sm[tid] = (sm[tid]>sm[tid+16]) ? sm[tid] : sm[tid+16];if (blockSize >= 16) sm[tid] = (sm[tid]>sm[tid+ 8]) ? sm[tid] : sm[tid+ 8];if (blockSize >= 8) sm[tid] = (sm[tid]>sm[tid+ 4]) ? sm[tid] : sm[tid+ 4];if (blockSize >= 4) sm[tid] = (sm[tid]>sm[tid+ 2]) ? sm[tid] : sm[tid+ 2];if (blockSize >= 2) sm[tid] = (sm[tid]>sm[tid+ 1]) ? sm[tid] : sm[tid+ 1];

}

template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {

sm[tid] = (sm[tid]>g_idata[i]) ? sm[tid] : g_idata[i];sm[tid] = (sm[tid]>g_idata[i+blockSize]) ? sm[tid] : g_idata[i+blockSize];i += gridSize;

}__syncthreads();if (blockSize >= 512) {

if (tid < 256) { sm[tid] = (sm[tid]>sm[tid+256]) ? sm[tid] : sm[tid+256]; }__syncthreads();

}if (blockSize >= 256) {

if (tid < 128) { sm[tid] = (sm[tid]>sm[tid+128]) ? sm[tid] : sm[tid+128]; }__syncthreads();

}if (blockSize >= 128) {

if (tid < 64) { sm[tid] = (sm[tid]>sm[tid+ 64]) ? sm[tid] : sm[tid+ 64]; }__syncthreads();

}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 7 / 27

Page 12: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

What about a second example?

Example 1: Sumint sum = 0;

for (int i=0;i<N;i++) {

sum = sum + in[i];

}

Example 2: Maxint max = 0;

for (int i=0;i<N;i++) {

max = (max>in[i]) ? max : in[i];

}

Highlighted is thefunctionality

The remainder isthe structure: theskeleton of the code

CUDA code for example 1:template <unsigned int blockSize>__device__ void warpReduce(volatile int *sm, unsigned int tid) {

if (blockSize >= 64) sm[tid] += sm[tid + 32];if (blockSize >= 32) sm[tid] += sm[tid + 16];if (blockSize >= 16) sm[tid] += sm[tid + 8];if (blockSize >= 8) sm[tid] += sm[tid + 4];if (blockSize >= 4) sm[tid] += sm[tid + 2];if (blockSize >= 2) sm[tid] += sm[tid + 1];

}

template <unsigned int blockSize>__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {

extern __shared__ int sm[];unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockSize*2) + tid;unsigned int gridSize = blockSize*2*gridDim.x;sm[tid] = 0;while (i < n) {

sm[tid] += g_idata[i];sm[tid] += g_idata[i+blockSize];i += gridSize;

}__syncthreads();if (blockSize >= 512) {

if (tid < 256) { sm[tid] += sm[tid + 256]; }__syncthreads();

}if (blockSize >= 256) {

if (tid < 128) { sm[tid] += sm[tid + 128]; }__syncthreads();

}if (blockSize >= 128) {

if (tid < 64) { sm[tid] += sm[tid + 64]; }__syncthreads();

}if (tid < 32) { warpReduce<blockSize>(sm, tid); }if (tid == 0) { g_odata[blockIdx.x] = sm[0]; }

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 8 / 27

Page 13: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Outline

1 Introducing skeletons

2 Introducing algorithmic species

3 ‘Bones’: a skeleton-based source-to-source compiler

4 Host-accelerator transfer optimisations

5 Experimental results

6 Summary

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 9 / 27

Page 14: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Outline

1 Introducing skeletons

2 Introducing algorithmic species

3 ‘Bones’: a skeleton-based source-to-source compiler

4 Host-accelerator transfer optimisations

5 Experimental results

6 Summary

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 10 / 27

Page 15: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example algorithmic species

Matrix-vector multiplication:

f o r ( i =0; i <64; i++) {r [ i ] = 0 ;f o r ( j =0; j <128; j++) {

r [ i ] += M[ i ] [ j ] ∗ v [ j ] ;}

}

0 127

0

63

0

63

0

127

0

127

0

63

0

63

+

+

M v r

0:63,0:127|chunk(0:0,0:127) ∧ 0:127|full → 0:63|element

Stencil computation:

f o r ( i =1; i <128−1; i++) {m[ i ] = 0 .33 ∗ ( a [ i −1]+a [ i ]+a [ i +1]) ;

}0 127 0 127

→a m

1:126|neighbourhood(-1:1) → 1:126|element

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 11 / 27

Page 16: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example algorithmic species

Matrix-vector multiplication:

f o r ( i =0; i <64; i++) {r [ i ] = 0 ;f o r ( j =0; j <128; j++) {

r [ i ] += M[ i ] [ j ] ∗ v [ j ] ;}

}

0 127

0

63

0

63

0

127

0

127

0

63

0

63

+

+

M v r

0:63,0:127|chunk(0:0,0:127) ∧ 0:127|full → 0:63|element

Stencil computation:

f o r ( i =1; i <128−1; i++) {m[ i ] = 0 .33 ∗ ( a [ i −1]+a [ i ]+a [ i +1]) ;

}0 127 0 127

→a m

1:126|neighbourhood(-1:1) → 1:126|element

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 11 / 27

Page 17: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example algorithmic species

Matrix-vector multiplication:

f o r ( i =0; i <64; i++) {r [ i ] = 0 ;f o r ( j =0; j <128; j++) {

r [ i ] += M[ i ] [ j ] ∗ v [ j ] ;}

}

0 127

0

63

0

63

0

127

0

127

0

63

0

63

+

+

M v r

0:63,0:127|chunk(0:0,0:127) ∧ 0:127|full → 0:63|element

Stencil computation:

f o r ( i =1; i <128−1; i++) {m[ i ] = 0 .33 ∗ ( a [ i −1]+a [ i ]+a [ i +1]) ;

}0 127 0 127

→a m

1:126|neighbourhood(-1:1) → 1:126|element

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 11 / 27

Page 18: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Algorithmic species

Algorithmic species:

Classifies code based on memory access patterns and parallelism

Is more fine-grained compared to other skeleton classifications

Can be extracted automatically from C-code using aset or a-darwin

For more information on species:

1 C. Nugteren, P. Custers, and H. Corporaal. Algorithmic Species:An Algorithm Classification of Affine Loop Nests for ParallelProgramming. In ACM TACO: Transactions on Architecture andCode Optimisations, 9(4):Article 40. 2013.

2 C. Nugteren, R. Corvino, and H. Corporaal. Algorithmic SpeciesRevisited: A Program Code Classification Based on ArrayReferences. In MuCoCoS’13: International Workshop onMulti-/Many-core Computing Systems. IEEE, 2013.

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 12 / 27

Page 19: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Algorithmic species

Algorithmic species:

Classifies code based on memory access patterns and parallelism

Is more fine-grained compared to other skeleton classifications

Can be extracted automatically from C-code using aset or a-darwin

For more information on species:

1 C. Nugteren, P. Custers, and H. Corporaal. Algorithmic Species:An Algorithm Classification of Affine Loop Nests for ParallelProgramming. In ACM TACO: Transactions on Architecture andCode Optimisations, 9(4):Article 40. 2013.

2 C. Nugteren, R. Corvino, and H. Corporaal. Algorithmic SpeciesRevisited: A Program Code Classification Based on ArrayReferences. In MuCoCoS’13: International Workshop onMulti-/Many-core Computing Systems. IEEE, 2013.

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 12 / 27

Page 20: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Outline

1 Introducing skeletons

2 Introducing algorithmic species

3 ‘Bones’: a skeleton-based source-to-source compiler

4 Host-accelerator transfer optimisations

5 Experimental results

6 Summary

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 13 / 27

Page 21: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Introducing Bones (1/2)

Bones provides a set of skeletons for multiple targets:

C-to-CUDA (NVIDIA GPUs)

C-to-OpenCL (3 targets: AMD GPUs, AMD CPUs, Intel CPUs)

C-to-OpenMP (multi-core CPUs)

C-to-C (pass-through)

C-to-FPGA (under construction)

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 14 / 27

Page 22: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Introducing Bones (1/2)

Bones provides a set of skeletons for multiple targets:

C-to-CUDA (NVIDIA GPUs)

C-to-OpenCL (3 targets: AMD GPUs, AMD CPUs, Intel CPUs)

C-to-OpenMP (multi-core CPUs)

C-to-C (pass-through)

C-to-FPGA (under construction)

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 14 / 27

Page 23: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Introducing Bones (2/2)

Bones’ main strengths:

1 Code readability: skeletons allow for a lightweight compiler

2 Performance: Write optimised skeletons in the native language

3 Low programmer effort: Species can be extracted automatically

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27

Page 24: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Introducing Bones (2/2)

Bones’ main strengths:

1 Code readability: skeletons allow for a lightweight compiler

2 Performance: Write optimised skeletons in the native language

3 Low programmer effort: Species can be extracted automatically

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27

Page 25: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Introducing Bones (2/2)

Bones’ main strengths:

1 Code readability: skeletons allow for a lightweight compiler

2 Performance: Write optimised skeletons in the native language

3 Low programmer effort: Species can be extracted automatically

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27

Page 26: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Introducing Bones (2/2)

Bones’ main strengths:

1 Code readability: skeletons allow for a lightweight compiler

2 Performance: Write optimised skeletons in the native language

3 Low programmer effort: Species can be extracted automatically

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27

Page 27: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Introducing Bones (2/2)

Bones’ main strengths:

1 Code readability: skeletons allow for a lightweight compiler

2 Performance: Write optimised skeletons in the native language

3 Low programmer effort: Species can be extracted automatically

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 15 / 27

Page 28: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example simplified skeleton within Bones

Example OpenMP skeleton:

i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;

#pragma omp p a r a l l e l{

i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = <parallelism>/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {

<ids><code>

}

Instantiated skeleton:

i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;

#pragma omp p a r a l l e l{

i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = 128/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {

int gid = i;r[gid] = 0;for (j=0; j<128; j++)

r[gid] += M[gid][j] * v[j] ;}

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 16 / 27

Page 29: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example simplified skeleton within Bones

Example OpenMP skeleton:

i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;

#pragma omp p a r a l l e l{

i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = <parallelism>/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {

<ids><code>

}

Instantiated skeleton:

i n t count ;count = omp get num procs ( ) ;omp set num threads ( count ) ;

#pragma omp p a r a l l e l{

i n t t i d , i ;i n t work , s t a r t , end ;t i d = omp get thread num ( ) ;work = 128/ count ;s t a r t = t i d ∗work ;end = ( t i d +1)∗work ;f o r ( i=s t a r t ; i<end ; i++) {

int gid = i;r[gid] = 0;for (j=0; j<128; j++)

r[gid] += M[gid][j] * v[j] ;}

}

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 16 / 27

Page 30: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Outline

1 Introducing skeletons

2 Introducing algorithmic species

3 ‘Bones’: a skeleton-based source-to-source compiler

4 Host-accelerator transfer optimisations

5 Experimental results

6 Summary

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 17 / 27

Page 31: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example of host-accelerator transfer optimisations

GPU example with 2 kernels:

Kernel 1: consumes A, produces B

Kernel 1: consumes B, produces C

Copy-in before and copy-out directly after the kernel

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27

Page 32: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example of host-accelerator transfer optimisations

GPU example with 2 kernels:

Kernel 1: consumes A, produces B

Kernel 1: consumes B, produces C

Copy-in before and copy-out directly after the kernel

Asynchronous copies:

Create a second thread to perform asynchronous copies

Use synchronisation barriers

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27

Page 33: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example of host-accelerator transfer optimisations

Asynchronous copies:

Create a second thread to perform asynchronous copies

Use synchronisation barriers

Remove redundant copies:

Remove the redundant copy-in of B

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27

Page 34: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Example of host-accelerator transfer optimisations

Remove redundant copies:

Remove the redundant copy-in of B

Postpone copy-outs:

Postpone the copy-out of B

Overlap transfers with computations

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 18 / 27

Page 35: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Outline

1 Introducing skeletons

2 Introducing algorithmic species

3 ‘Bones’: a skeleton-based source-to-source compiler

4 Host-accelerator transfer optimisations

5 Experimental results

6 Summary

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 19 / 27

Page 36: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Performance results for CPUs

Performance results for execution on a 4-core i7 CPU:

Based on kernels from the PolyBench benchmark set

Targets: OpenCL (2 different SDKs) and OpenMP

Showing speed-ups of individual kernels

Comparing to unoptimised sequential C-code

0.5x

1x

2x

4x

10x

2m

m−

1

2m

m−

2

3m

m−

1

3m

m−

2

3m

m−

3

adi−

1

atax

−1

atax

−2

bic

g−

1

bic

g−

2

corr

el−

1

corr

el−

2

corr

el−

3

cov

ar−

1

cov

ar−

2

do

itg

en−

1

do

itg

en−

2

Sp

eed

−u

p (

log

sca

le)

OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 20 / 27

Page 37: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Performance results for CPUs

Performance results for execution on a 4-core i7 CPU:

Based on kernels from the PolyBench benchmark set

Targets: OpenCL (2 different SDKs) and OpenMP

Showing speed-ups of individual kernels

Comparing to unoptimised sequential C-code

0.5x

1x

2x

4x

10x

2m

m−

1

2m

m−

2

3m

m−

1

3m

m−

2

3m

m−

3

adi−

1

atax

−1

atax

−2

bic

g−

1

bic

g−

2

corr

el−

1

corr

el−

2

corr

el−

3

cov

ar−

1

cov

ar−

2

do

itg

en−

1

do

itg

en−

2

Sp

eed

−u

p (

log

sca

le)

OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 20 / 27

Page 38: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Performance results for CPUs

0.5x

1x

2x

4x

10x2

mm

−1

2m

m−

2

3m

m−

1

3m

m−

2

3m

m−

3

adi−

1

atax

−1

atax

−2

bic

g−

1

bic

g−

2

corr

el−

1

corr

el−

2

corr

el−

3

cov

ar−

1

cov

ar−

2

do

itg

en−

1

do

itg

en−

2

Sp

eed

−u

p (

log

sca

le)

OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP

0.5x

1x

2x

4x

10x

fdtd

2d

−2

fdtd

2d

−3

fdtd

2d

−4

gem

m

gem

ver

−1

gem

ver

−2

gem

ver

−4

ges

um

mv

jaco

bi1

d−

1

jaco

bi1

d−

2

jaco

bi2

d−

1

jaco

bi2

d−

2

mv

t−1

mv

t−2

syr2

k

syrk

geo

mea

n

Sp

eed

−u

p (

log

sca

le)

OpenCL (Intel SDK)OpenCL (AMD SDK)OpenMP

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 21 / 27

Page 39: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Performance results for GPUs

Comparing Bones to others on a GPU:

Based on kernels from the PolyBench benchmark set

Target: CUDA on a GTX470 GPU

Showing speed-ups of Bones over Par4All and ppcg[Par4All and ppcg are the only other automatic C-to-CUDA compilers available]

Higher is in favour of Bones

Two different experiments:

1 Individual GPU kernels

2 Full program, including host-accelerator transfers

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 22 / 27

Page 40: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Performance results for GPUs

Comparing Bones to others on a GPU:

Based on kernels from the PolyBench benchmark set

Target: CUDA on a GTX470 GPU

Showing speed-ups of Bones over Par4All and ppcg[Par4All and ppcg are the only other automatic C-to-CUDA compilers available]

Higher is in favour of Bones

Two different experiments:

1 Individual GPU kernels

2 Full program, including host-accelerator transfers

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 22 / 27

Page 41: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Performance results for GPUs (kernel only)

0.5x

1x

2x

4x

10x2

mm

−1

2m

m−

2

3m

m−

1

3m

m−

2

3m

m−

3

atax

−1

atax

−2

bic

g−

1

bic

g−

2

corr

el−

1

corr

el−

2

corr

el−

3

cov

ar−

1

cov

ar−

2

do

itg

en−

1

do

itg

en−

2

Sp

eed

−u

p (

log

sca

le)

x x x

Compared to Par4AllCompared to PPCG

0.5x

1x

2x

4x

10x

fdtd

2d

−1

fdtd

2d

−2

fdtd

2d

−3

fdtd

2d

−4

gem

m

gem

ver

−1

gem

ver

−2

gem

ver

−3

gem

ver

−4

ges

um

mv

jaco

bi1

d−

1

jaco

bi1

d−

2

jaco

bi2

d−

1

jaco

bi2

d−

2

mv

t−1

mv

t−2

syr2

k

syrk

geo

mea

n

Sp

eed

−u

p (

log

sca

le)

Compared to Par4AllCompared to PPCG

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 23 / 27

Page 42: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Performance results for GPUs (full program)

0.5x

1x

2x

4x

10x

2m

m

3m

m

atax

bic

g

corr

el

cov

ar

do

itg

en

fdtd

−2

d

Sp

eed

−u

p (

log

sca

le)

x

368x 933x 378x

0.27x

Compared to PolyBench/GPUCompared to Par4AllCompared to PPCGCompared to Bones (naive transfers)

0.5x

1x

2x

4x

10x

gem

m

gem

ver

ges

um

mv

jaco

bi−

1d

jaco

bi−

2d

mv

t

syr2

k

syrk

Sp

eed

−u

p (

log

sca

le)

x x x

19x13x 17x

Compared to PolyBench/GPUCompared to Par4AllCompared to PPCGCompared to Bones (naive transfers)

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 24 / 27

Page 43: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Outline

1 Introducing skeletons

2 Introducing algorithmic species

3 ‘Bones’: a skeleton-based source-to-source compiler

4 Host-accelerator transfer optimisations

5 Experimental results

6 Summary

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 25 / 27

Page 44: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Summary

The source-to-source compiler Bones:

Uses algorithmic skeletons

Generates readable CUDA/OpenCL/OpenMP code

Delivers competitive GPU performance

Performs host-accelerator transfer optimisations

Is based on algorithmic species

The classification ‘algorithmic species’:

Captures memory access patterns from C source code

Automates classification through aset and a-darwin

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 26 / 27

Page 45: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Summary

The source-to-source compiler Bones:

Uses algorithmic skeletons

Generates readable CUDA/OpenCL/OpenMP code

Delivers competitive GPU performance

Performs host-accelerator transfer optimisations

Is based on algorithmic species

The classification ‘algorithmic species’:

Captures memory access patterns from C source code

Automates classification through aset and a-darwin

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 26 / 27

Page 46: Automatic Skeleton-Based Compilation through Integration ...1 Introducing skeletons 2 Introducing algorithmic species 3 ‘Bones’: a skeleton-based source-to-source compiler 4 Host-accelerator

Questions / further information

Thank you for your attention!

Bones is available at:http://parse.ele.tue.nl/bones/

For more information and links to publications, visit:http://parse.ele.tue.nl/

http://parse.ele.tue.nl/species/http://www.cedricnugteren.nl/

Cedric Nugteren (TU/e) Species and Skeletons August 28, 2013 27 / 27