ClumeqCastonguay.pdf

67
Accelerating CFD Simulations with GPUs Patrice Castonguay HPC Developer Technology Engineer December 4 th , 2012

Transcript of ClumeqCastonguay.pdf

Page 1: ClumeqCastonguay.pdf

Accelerating CFD

Simulations with GPUs

Patrice Castonguay

HPC Developer Technology Engineer

December 4th, 2012

Page 2: ClumeqCastonguay.pdf

Outline

Computational Fluid Dynamics

Why GPUs?

Accelerating ANSYS Fluent (commercial CFD software)

Background

Novel parallel algorithms to replace sequential algorithms

Performance results

Accelerating SD3D (research code developed at Stanford)

Background

Performance results

Conclusions

2

Page 3: ClumeqCastonguay.pdf

Computational Fluid Dynamics

Page 4: ClumeqCastonguay.pdf

Computational Fluid Dynamics

Simulation of fluid flows

Large number of applications

Page 5: ClumeqCastonguay.pdf

Computational Fluid Dynamics

Navier-Stokes equations: coupled nonlinear partial differential

equations which govern unsteady, compressible, viscous fluid

flows

5

Page 6: ClumeqCastonguay.pdf

Computational Fluid Dynamics

Partition the domain into large number of cells, and solve for

fluid properties (density, velocity, pressure) inside each cell

6

Page 7: ClumeqCastonguay.pdf

Why GPUs?

Page 8: ClumeqCastonguay.pdf

Why GPUs

Higher performance

Intel CPU:

8-core Sandy Bridge

fp32 performance: 384 Gflops

fp64 performance: 192 Gflops

Memory bandwidth: 52 GB/s

NVIDIA GPU:

Tesla GK110

fp32 performance: 3935 Gflops 10X

fp64 performance: 1311 Gflops 6.8X

Memory bandwidth: 250 GB/s 4.8X

8

Page 9: ClumeqCastonguay.pdf

Why GPUs

Power efficiency

Traditional CPUs not economically feasible

Jaguar (3rd fastest supercomputer in Nov. 2011)

2.3 Petaflops @ 7 megawatts 7 megawatts ≅ 7,000 Homes

=

9

Page 10: ClumeqCastonguay.pdf

Power efficiency

Traditional CPUs not economically feasible

Why GPUs

Scaled to 100 petaflops

Jaguar @ 2.3 Petaflops

7 megawatts

7,000 homes

300 megawatts

300,000 homes

≅ Quebec City and its metropolitan area

10

Page 11: ClumeqCastonguay.pdf

CPU Optimized for Serial Tasks

GPU Accelerator Optimized for Many Parallel

Tasks

Why GPUs

11

Higher computational

power per watt

Page 12: ClumeqCastonguay.pdf

Why GPUs

Many scientific applications already benefit from GPUs

12

Relative Performance K20x vs. dual-socket Sandy Bridge

E5-2687w 3.10 GHz Sandy Bridge

WS-LSMS

Chroma

SPECFEM3D

AMBER

NAMD

1.00 1.71

2.73

1.00 1.80

4.40

1.00 3.46

7.17

1.00 5.41

8.85

1.00 8.00

10.20

0 1 2 3 4 5 6 7 8 9 10 11

Dual-CPU

Single-CPU+M2090

Single-CPU+K20X

Dual-CPU

Single-CPU+M2090

Single-CPU+K20X

Dual-CPU

Single-CPU+M2090

Single-CPU+K20X

Dual-CPU

Single-CPU+M2090

Single-CPU+K20X

Dual-CPU

Single-CPU+M2090

Single-CPU+K20X

Page 13: ClumeqCastonguay.pdf

Why GPUs

Thriving gaming market funds GPU

development for supercomputers

13

Algorithms found in computer games are surprisingly similar to

the ones found in scientific applications

Page 14: ClumeqCastonguay.pdf

ANSYS Fluent

14

Page 15: ClumeqCastonguay.pdf

Largest share of CFD software market

Finite volume method

Many different options: incompressible/compressible,

inviscid/viscous, two-phase, explicit/implicit, …)

Most popular option: implicit incompressible Navier-Stokes solver

ANSYS Fluent

15

Page 16: ClumeqCastonguay.pdf

Coupled non-linear PDEs

Unknowns: u,v,w and p (density is constant)

Incompressible Navier-Stokes

Mass:

Momentum x:

Momentum y:

Momentum z:

16

Page 17: ClumeqCastonguay.pdf

Incompressible Navier-Stokes

Solve Linear System of Equation, Ax = b

Assemble Linear System of Equations

Converged ? No Yes

Stop

Solution procedure:

Runtime:

~ 67%

~ 33% Accelerate

this first

17

Page 18: ClumeqCastonguay.pdf

Incompressible Navier-Stokes

Aij is 4x4 matrix

In practice,

millions of rows

xi is 4x1 vector

( xi = [pi, ui, vi, wi] )

Large sparse system of equations with 4x4 block entries

18

Page 19: ClumeqCastonguay.pdf

Multigrid

2D Poisson equation, Jacobi smoother

Convergence of error norm

Initial error Error after 5

iterations

Error after 10

iterations

Error after 15

iterations

19

Page 20: ClumeqCastonguay.pdf

Multigrid

Most simple iterative solvers are efficient at damping high-

frequency errors, but inefficient at damping low-frequency error

Key idea:

Represent the error on a coarser grid so that low-frequency

errors become high frequency errors

20

Page 21: ClumeqCastonguay.pdf

Smooth

Fine Grid

Restrict

Multigrid

Coarse Grid

Smooth …

21

Page 22: ClumeqCastonguay.pdf

Algebraic Multigrid (AMG)

Solve

Pre-Smooth Post-Smooth

Restrict Residual Prolongate Correction

Compute Residual

Create

Example: Two-Level V-cycle

22

Page 23: ClumeqCastonguay.pdf

Algebraic Multigrid (AMG)

Pre-smooth

Pre-smooth

Pre-smooth

Post-smooth

Post-smooth

Post-smooth

Solve

Coarsen

Coarsen

Coarsen

Prolongate Correction

Prolongate Correction

Prolongate Correction

~106 Unknowns

1-2 Unknowns

Apply recursively

23

Page 24: ClumeqCastonguay.pdf

ANSYS Fluent

We have accelerated the AMG solver in Fluent with GPUs

Some parts of the algorithm were easily ported to the GPU

Sparse matrix vector multiplications

Norm computations

Other operations were inherently sequential

Aggregation procedure to create restriction operator

Gauss-Seidel, DILU smoothers

Matrix-matrix multiplication

Need to develop novel parallel algorithms!

24

Page 25: ClumeqCastonguay.pdf

Aggregation

1

2

6

4

3

5

How do we create the coarser levels?

Graph representation of a matrix: w3,5 = f(A3,5, A5,3)

25

Page 26: ClumeqCastonguay.pdf

Aggregation

How do we create the coarser levels?

26

wi,j = f(Ai,j, Aj,i)

i

j

Page 27: ClumeqCastonguay.pdf

Aggregation

In aggregation-based AMG, group vertices that are strongly

connected to each other

27

Page 28: ClumeqCastonguay.pdf

Aggregation

New matrix/graph:

28

Page 29: ClumeqCastonguay.pdf

Aggregation

Want do merge vertices that are strongly connected to each other

Similar to weighted graph matching problem

29

wi,j = f(Ai,j, Aj,i) i

j

Page 30: ClumeqCastonguay.pdf

Parallel aggregation

Each vertex extends a hand to its strongest neighbor

30

Page 31: ClumeqCastonguay.pdf

Parallel aggregation

Each vertex checks if its strongest neighbor extended a hand

back

31

Page 32: ClumeqCastonguay.pdf

Parallel aggregation

32

Page 33: ClumeqCastonguay.pdf

Parallel aggregation

Repeat with unmatched vertices

33

Page 34: ClumeqCastonguay.pdf

Parallel aggregation

34

Page 35: ClumeqCastonguay.pdf

Parallel aggregation

35

Page 36: ClumeqCastonguay.pdf

Parallel aggregation

36

Page 37: ClumeqCastonguay.pdf

Parallel aggregation

37

Page 38: ClumeqCastonguay.pdf

Parallel aggregation

38

Page 39: ClumeqCastonguay.pdf

Smoothing step

M is called preconditioning matrix

Here x is solution vector (includes all unknowns in the system)

Smoothers

39

Page 40: ClumeqCastonguay.pdf

Jacobi Smoother

For Jacobi, M is block-diagonal

Inherently parallel, each xi can be updated independently of each

other

Maps very well to the GPU 40

Page 41: ClumeqCastonguay.pdf

Parallel Smoothers

Smoothers available in Fluent (Gauss-Seidel & DILU) are sequential!

For DILU smoother, M has the form

E is block diagonal matrix such that

Can recover ILU(0) for certain matrices

Only requires one extra diagonal of storage

41

Page 42: ClumeqCastonguay.pdf

Construction of block diagonal matrix is sequential

Inversion of is also sequential (two triangular matrix

inversions)

DILU Smoother

42

Page 43: ClumeqCastonguay.pdf

Coloring

Use coloring to extract parallelism

Coloring: assignment of color to vertices such that no two

vertices of same color are adjacent

43

12 24

57 64

11 98 45 75

91 39

86

71 66

9

1

77 10

14

51

54

79

33

25

81

60

50 7

59 39 44 48 74

62

11 88

72 90

20

65

75

80

10

95

2

64

69 79 44

Page 44: ClumeqCastonguay.pdf

Coloring

Use coloring to extract parallelism

With m unknowns and p colors, m/p unknowns can be processed

in parallel

44

Page 45: ClumeqCastonguay.pdf

Construction of block diagonal matrix is now parallel

Inversion of matrix is also parallel

DILU Smoother

45

Page 46: ClumeqCastonguay.pdf

Parallel Graph Coloring – Min/Max

How do you color a graph/matrix in parallel?

Parallel graph coloring algorithm of Luby, new variant developed at NVIDIA

46

Page 47: ClumeqCastonguay.pdf

Parallel Graph Coloring – Min/Max

Assign a random number to each vertex

47

12 24

57 64

11 98 45 75

91 39

86

71 66

9

1

77 10

14

51

54

79

33

25

81

60

50 7

59 39 44 48 74

62

11 88

72 90

20

65

75

80

10

95

2

64

69 79 44

Page 48: ClumeqCastonguay.pdf

Parallel Graph Coloring – Min/Max

Round 1: Each vertex checks if it’s a local maximum or minimum

If max, color=dark blue. If min, color=green

48

12 24

57 64

11 98 45 75

91 39

86

71 66

9

1

77 10

14

51

54

79

33

25

81

60

50 7

59 39 44 48 74

62

11 88

72 90

20

65

75

80

10

95

2

64

69 79 44

Page 49: ClumeqCastonguay.pdf

Parallel Graph Coloring – Min/Max

Round 2: Each vertex checks if it’s a local maximum or minimum.

If max, color=pink. If min, color=red

49

12 24

57 64

11 98 45 75

91 39

86

71 66

9

1

77 10

14

51

54

79

33

25

81

60

50 7

59 39 44 48 74

62

11 88

72 90

20

65

75

80

10

95

2

64

69 79 44

Page 50: ClumeqCastonguay.pdf

Parallel Graph Coloring – Min/Max

Round 3: Each vertex checks if it’s a local maximum or minimum.

If max, color=purple. If min, color=white

50

12 24

57 64

11 98 45 75

91 39

86

71 66

9

1

77 10

14

51

54

79

33

25

81

60

50 7

59 39 44 48 74

62

11 88

72 90

20

65

75

80

10

95

2

64

69 79 44

Page 51: ClumeqCastonguay.pdf

AMG Timings CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post)

GPU nvAMG solver: AMG(V-cycle, agg8, MC-DILU, 0pre, 3post)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Helix (hex 208K) Helix (tet 1173K)

K20X

C2090

3930K(6)

Lower is

Better

51

Page 52: ClumeqCastonguay.pdf

AMG Timings CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post)

GPU nvAMG solver: AMG(V-cycle, agg2, MC-DILU, 0pre, 3post)

0

1

2

3

4

5

6

7

8

9

Airfoil (hex 784K) Aircraft (hex 1798K)

K20X

C2090

3930K(6)

Lower is

Better

52

Page 53: ClumeqCastonguay.pdf

SD3D

Page 54: ClumeqCastonguay.pdf

Flux Reconstruction (FR) Method

Proposed by Huynh in 2007, similar to the Spectral Difference

and Discontinuous Galerkin methods

High-order method: spatial order of accuracy is > 2

54 Computational Cost

Err

or

Low-Order Method

High-Order Method

Page 55: ClumeqCastonguay.pdf

Flux Reconstruction (FR) Method

Solution in each element

approximated by a multi-

dimensional polynomial of order N

Order of accuracy: hN+1

Multiple DOFs per element

55

N=1 N=4 N=3 N=2

Page 56: ClumeqCastonguay.pdf

Flux Reconstruction (FR) Method

Plunging airfoil: zero AOA, Re=1850, frequency: 2.46 rad/s

5th order accuracy in space, 4th order accurate RK time stepping

56

Page 57: ClumeqCastonguay.pdf

Flux Reconstruction (FR) Method

Computations are demanding:

Millions of DOFS

Hundreds of thousands of time steps

More work per DOF compared to low-order methods

Until recently, high-order simulations over complex 3D

geometries were intractable, unless you had access to large

cluster

GPUs to the rescue!

57

Page 58: ClumeqCastonguay.pdf

GPU Implementation

Ported entire code to the GPU

FR and other high-order methods for unstructured grids map well

to GPUs:

Large amount of parallelism (millions of DOFs)

More work per DOF compared to low-order methods

Cell-local operations benefit from fast user-managed on-chip memory

Required some programming efforts, but was worth while…

58

Page 59: ClumeqCastonguay.pdf

Single-GPU Implementation

Speedup of the single-GPU algorithm (Tesla C2050) relative to a parallel

computation on a six-core Xeon x5670 (Westmere) @ 2.9GHz 59

7.60 7.58 7.44 7.21

8.01

11.27

9.23

8.63

6.21 6.23

9.05

5.51

0

2

4

6

8

10

12

3 4 5 6

Sp

eed

up

Order of Accuracy

Tets

Hexs

Prisms

Page 60: ClumeqCastonguay.pdf

Applications

Unsteady viscous flow over sphere at Reynolds 300, Mach=0.3

28914 prisms and 258075 tets , 4th order accuracy, 6.3 million DOFs

Ran on desktop machine with 3 C2050 GPUs

60

Page 61: ClumeqCastonguay.pdf

Applications

Iso-surfaces of Q-criterion colored by Mach number for flow over sphere at Re=300, M=0.3

3 GPUs: same performance as 21 Xeon x5670 CPUs (126 cores)

3 GPUs personal computer: ∼$10,000, easy to manage

61

Page 62: ClumeqCastonguay.pdf

Multi-GPU Implementation

Speedup relative to 1 GPU for a 6th order accurate simulation running on a

mesh with 55947 tetrahedral elements 62

0

5

10

15

20

25

30

2 4 8 12 16 32

Sp

eed

up

rela

tive t

o 1

GP

U

Number of GPUs

No Overlap

Communication Overlap

Communication and GPU Transfers Overlap

Page 63: ClumeqCastonguay.pdf

Applications

Transitional flow over SD7003 airfoil, Re=60000, Mach=0.2, AOA=4°

4th order accurate solution, 400000 RK iterations, 21.2 million DOFs

63

Page 64: ClumeqCastonguay.pdf

Applications

64

15 hours on 16 C2070s

157 hours ( > 6 days )

on 16 Xeon x5670 CPUs

Page 65: ClumeqCastonguay.pdf

Conclusions

Most scientific applications have large amount of parallelism

Parallel applications map well to GPUs which have hundreds of

simple, power efficient cores

Higher performance, higher performance/watt

Presented two successful uses of GPUs in CFD

Linear solver in Ansys Fluent (hard to parallelize)

Research-oriented CFD code

65

Page 66: ClumeqCastonguay.pdf

Conclusions

Future of HPC is CPU + GPU/Accelerator

Need to develop new parallel numerical methods to replace

inherently sequential algorithms (such as Gauss-Seidel, ILU

preconditioners, etc..)

Peak flops vs memory bandwidth gap still growing

Flops are “free”

Need to develop numerical methods that have larger flops/bytes ratio

66

Page 67: ClumeqCastonguay.pdf

Questions?

Patrice Castonguay

[email protected]