1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009...

27
1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick U.K 1st UK CUDA Developers Conference 7 th Dec 2009 – Oxford, U.K.

Transcript of 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009...

Page 1: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

1High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU

S.J. PennycookG.R. Mudalige,

S.D. Hammond, and S.A. Jarvis.

High Performance Systems Group Department of Computer Science

University of WarwickU.K

S.J. PennycookG.R. Mudalige,

S.D. Hammond, and S.A. Jarvis.

High Performance Systems Group Department of Computer Science

University of WarwickU.K

1st UK CUDA Developers Conference7th Dec 2009 – Oxford, U.K.

1st UK CUDA Developers Conference7th Dec 2009 – Oxford, U.K.

Page 2: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

2High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Overview

Wavefront Computations

A GPGPU Solution?

Wavefronts within Wavefronts

Performance Modelling

Beating the CPU – Optimisations to Win

Results, Validations and Model Projections

Current and Future Work

Conclusions

Page 3: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

3High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Wavefront Computations

Wavefront computations are at the core of a number of large scientific computing workloads.

Centers including the Los Alamos National Laboratory (LANL) in the United States and the Atomic Weapons Establishment (AWE) in the UK use these codes heavily.

Lamport’s core (hyperplane) algorithm that underpins these codes has existed for more than thirty five years.

Defining characteristics:

Operating on a grid of cells with each cell requiring some computation to be performed.

Each cell has a data dependency, such that the solution of up to three neighbouring cells is required.

Page 4: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

4High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Cell Dependencies

Page 5: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

5High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Motivation

Our previous work was on analysing and optimising applications that use the wavefront algorithm using MPI.

Processor (1,m)

Processor (1,1)

Processor (n,m)

Processor (n,1)

Ny

Nz

Nx

Proceeds as Wavefronts through the 3D data cube

Page 6: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

6High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Motivation (cont’d)

Algorithm operates over a three-dimensional structure of sizeNx ×Ny ×Nz .

Grid is mapped onto a 2D m x n grid of processors; each is assigned a stack of Nx / n x Ny / m x Nz cells.

Data dependency results in a sequence of wavefronts (or a sweep) that starts from one corner and makes its way through other cells.

We have modelled codes (e.g. Chimaera, LU, and Sweep3D) that employ wavefront computations with MPI.

Page 7: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

7High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Motivation (cont’d)

Our focus is now on using GPUs to investigate improvements to the solution per processor.

A canonical solution is normally employed by the CPU to solve the computation per processor.

Listing: Canonical Algorithm

For k=1; k<=kend do For j=1; j<=jend do

For i=1; i<=iend do A(i,j,k)=A(i−1,j,k)+

A(i,j−1,k)+A(i,j,k−1) // Compute cell End for End for

End for

Page 8: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

8High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Hyperplane (Wavefront) Algorithm

Let f = i + j + k, g = k and h = j.

The plane defined by i + j + k = CONST is called a hyperplane.

Listing : Hyperplane Algorithm

DO CONCURRENTLY ON EACH PROCESSORFor f = 3, iend+jend+kend do

A(f−g−h,h,g) = A(f−g−h−1,h,g)+A(f−g−h,h−1,g)+ A(f − g − h, h, g − 1)

End For

The critical dependencies are preserved, even though the solution is carried out across the grid in wavefronts.

Page 9: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

9High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

A GPGPU Solution ?

Can we utilise the many cores on a GPU to get a speedup to this algorithm?

Theoretically simple...

Page 10: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

10High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

A GPGPU Solution ? (cont’d)

For a 3D cube of cells:

Page 11: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

11High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

GPU Limitations

What’s the practical situation?

Experimental System – Daresbury Laboratory U.K. 8 x NVIDIA Tesla S1070 servers, each with four Tesla C1060 cards.

Compute nodes consists of Nehalem processors (@ 2.53 GHz quad-core, 24 GB RAM).

Each CPU core sees one Tesla card.

Voltaire HCA410-4EX InfiniBand adapter.

NVIDIA Tesla C1060 GPU Specifications:

Each GPU card has 30 multi-processors – Streaming Multiprocessor (SM) with 8 cores per processor.

Each card therefore has 240 cores (streaming processor cores).

Each core operates at 1.296 to 1.44 GHz.

4 GB Memory per card.

Page 12: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

12High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

GPU Limitations (cont’d)

CUDA Device Architecture:

SM 1

SM 2

SM 4

SM 30

Registers Shared Memory

Processor Cores (8 cores)

GPU

Constant and Texture Cache Memory

DRAM

Local

Global

Constant

Texture

To Host

Page 13: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

13High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

GPU Limitations (cont’d)

Each SM is allocated a number of threads, arranged as blocks.

No synchronisation between threads in different blocks.

Limit of 512 threads per block.

Memory hierarchy:

Global memory access is slow and should be avoided.

Limit of 16KB of shared memory per SM.

Other considerations:

Limit of 16,384 registers per block.

Aligning half-warps for performance.

Page 14: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

14High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

A Solution ?

Wavefronts within Wavefronts

Need to be scalable. Run more than 512 threads by utilising parallelism across all the multiprocessors.

The cells on each diagonal are decomposed into coarse subtasks, and assigned to an SM as thread blocks.

Page 15: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

15High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Wavefronts within Wavefronts

Each diagonal is computed by a kernel:

for (wave = 0; wave < (3*(N/dimBlock.x)) - 2; wave++) { // Run the kernel. hyperplane_3d <<< dimGrid, dimBlock, shared_mem_size

>>> (d_gpu, wave); } cudaThreadSynchronize(); // Not strictly necessary.

The time to compute one diagonal is ≈

ceiling (number of blocks per diagonal / number of SMs)

Each block utilises the resources available to an SM to solve the cells – we will talk about this later.

Page 16: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

16High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

A Performance Model

What does this solution mean in terms of a performance model?

Modelling Block level performance Assume a 3D cube of data cells with dimension N PGPU – Number of SMs on the GPU Wg,GPU – Time to solve a block of cells WGPU – Time to solve the 3D cube of cells using the GPU

Page 17: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

17High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Initial Results

Each cell is randomly initialised, and at each step calculates the average of itself and its top, north and west neighbours.

How the 3D data is decomposed has a significant effect on execution time.

Strange behaviour where the number of cells is a multiple of 32 (especially at powers of 2).

Page 18: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

18High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Initial Results (cont’d)

Page 19: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

19High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Initial Results (cont’d)

Page 20: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

20High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Initial Results (cont’d)

Page 21: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

21High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Beating the CPU

Optimisation within the blocks:

Thread re-use.

Caching values in shared memory.

Coalesced memory accesses.

Avoiding shared memory bank-conflicts.

Optimisations over the blocks:

Explicit vs implicit CPU synchronisation.

Inter-block synchronisation using mutexes.

Page 22: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

22High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Thread Reuse in a Block

Thread 0

Thread 4

Thread 1

Thread 8

Thread 5

Thread 2 Thread 3

Thread 6

Thread 9

Thread 12 Thread 13

Thread 10

Thread 7

Thread 11

Thread 14 Thread 15

Page 23: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

23High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Coalesced Memory Access

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 4 1 8 5 2 12 9 6 3 13 10 7 14 11 15

Requires padding on devices below compute capability 1.3.

How does this apply to 3D?

Page 24: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

24High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Beating the CPU (Results)

Page 25: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

25High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Beating the CPU (Results)

Code was restructured for GPU to avoid unnecessary branching. Similar restructuring applied to CPU in kind.

Re-use of threads and shared memory offers a 2x speedup over the naive GPU implementation.

Spikes remain, likely to be an issue at the warp level.

Kernel information:

17 registers.

2948 bytes of shared memory per block.

42% occupancy.

Page 26: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

26High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

The Bigger Picture

Current work:

Porting LU, Sweep3D and Chimaera to GPU. (CUDA and OpenCL)

Additional barriers from larger programs: Double precision.

Multiple computations per cell.

Looking towards the future:

How well does our algorithm perform on a consumer card(eg GTX 295)?

How well will our algorithm perform on Fermi?

Benchmarking and analysis should facilitate predictions.

Page 27: 1 High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009 Parallelising Pipelined Wavefront Computations on the.

27High Performance Systems Group – Dept. Of Computer Science, University of Warwick U.K. Dec 2009

Conclusions

Wavefront computations can utilise emerging GPU architectures, despite their dependencies.

To see speedup:

Memcpy() needs to be faster.

Require more work per Memcpy().

Codes cannot be ported naively. Hardware limitations may be a problem (particularly for larger codes).

Performance modelling will offer insights into which applications can be ported successfully.