New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton!...

1
R Gandham, T Warburton Department of Computational and Applied Mathematics, Rice University ALMOND: Algebraic Multigrid On Numerous Devices Authors gratefully acknowledge the funding support from DOE, ANL, ONR, NVIDIA, AMD and an internship at Stoneridge Technology Company. AMG Setup Cycling Construct Ops Aggregate Galerkin prod Smoothing Grid transfers SpM transpose Graph coloring Sort SpMV Applications Reservoir simulations. Computational fluid dynamics. Solid mechanics. Challenges Heterogenous media. Millions of unknowns. Efficiency and portability. −∇ i ( κp) = i u Elliptic PDE model Ax = b, A ! n×n , x, b ! n Reservoir simulations are crucial in optimizing oil production rate. Simulations are accelerated by preconditioning techniques. Algebraic Multi-Grid is a possible preconditioner, but ... AMG presents challenges for many-core computation. OCCA provides portability across architectures [Medina et al]. Piecewise constant interpolation [Notay et al]. Optimality via Krylov accelerated cycles [Notay et al]. GPU aggregation using maximal independent set [Bell et al]. Fine-grain parallel MIS [Bell et al]. Galerkin product through sort [Gandham et al]. Radix sort on GPU [Hoberock et al]. Recursive multigrid cycle A 0 A 1 A 2 A 3 r 2 R 1 r 1 x 1 x 1 + P 1 x 2 Construct coarse grids via aggregation 1 3 5 7 6 9 11 12 2 4 8 10 2 1 3 A l A l +1 = R l A l P l Introduction 1 2 Algorithms Performance: 3D FEM 3 4 Ongoing work Kernel launch overhead Kernel execution α 1 = x 1 T y 1 α 2 = x 2 T y 2 α 1 = x 1 T y 1 , α 2 = x 2 T y 2 Kernel 1 Kernel 2 Fused kernel Kernel fusion to reduce kernel launch overhead. Texture memory for SpMV. MPI for computation on multiple devices. Matrix free approaches to reduce memory overheads. Domain decomposition to minimize communications. Special reduction kernels for OpenMP. Other potential scientific computing applications. Y. Notay, 2010. R. Gandham, K. Esler, and Y. Zhang, 2014. N. Bell, S. Dalton, and L. N. Olson, 2012. J. Hoberock and N. Bell, 2009. D. S. Medina, T. Warburton, and A. St. Cyr, 2014 CUDA is faster than OpenCL on NVIDIA Titan GPU. OpenMP is faster than OpenCL on Intel i7 CPU. 4x speed up on NVIDIA Titan compared to 6-core Intel i7. Discrete methods such as finite difference, finite volume, and finite element yield a sparse linear system: SPE 10 Benchmark problem Fig source : http://www.spe.org/web/csp/ datasets/set02.htm Inner prod etc 24% Smoother 7% Galerkin prod 56% Ops 4% Aggregate 10% % time spent during Setup Inner prod etc 15% Grid transfer 56% Smooth 28% % time spent during Cycling 0 0.5 1 1.5 2 2.5 1.8M 4.2M 5.9M 8.1M CUDA on Titan OpenCL on Titan OpenCL on Tahiti OpenCL on Intel i7 OpenMP on Intel i7 Scaling of Setup # of unknowns 0 2 4 6 8 1.8M 4.2M 5.9M 8.1M Scaling of Cycling # of unknowns M. unknowns per sec M. unknowns per sec

Transcript of New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton!...

Page 1: New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton! Department of Computational and Applied Mathematics, Rice University ALMOND: Algebraic

R Gandham, T Warburton!Department of Computational and Applied Mathematics, Rice University

ALMOND: Algebraic Multigrid On Numerous Devices

Authors gratefully acknowledge the funding support from DOE, ANL, ONR, NVIDIA, AMD and an internship at Stoneridge Technology Company.

AMG Setup

Cycling

Construct Ops

Aggregate

Galerkin prod

Smoothing

Grid transfers

SpM transpose

Graph coloring

Sort

SpMV

Applications✦ Reservoir simulations.!✦ Computational fluid dynamics.!✦ Solid mechanics.

Challenges✦ Heterogenous media.!✦ Millions of unknowns.!✦ Efficiency and portability.

−∇ i (κ∇p) = ∇ iuElliptic PDE model

Ax = b, A ∈!n×n , x,b∈!n

✦ Reservoir simulations are crucial in optimizing oil production rate.!✦ Simulations are accelerated by preconditioning techniques.

✦ Algebraic Multi-Grid is a possible preconditioner, but ...!✦ AMG presents challenges for many-core computation.

✦ OCCA provides portability across architectures [Medina et al].

✦ Piecewise constant interpolation [Notay et al].!✦ Optimality via Krylov accelerated cycles [Notay et al].!✦ GPU aggregation using maximal independent set [Bell et al]. !✦ Fine-grain parallel MIS [Bell et al].!✦ Galerkin product through sort [Gandham et al].!✦ Radix sort on GPU [Hoberock et al].

Recursive multigrid cycle

A0

A1

A2

A3

r2 ←R1r1 x1 ← x1 + P1x2

Construct coarse grids via! aggregation

1

3

5

7

6

9

11

12

2

4

8

10

2

1

3

Al Al+1 = RlAlPl

Introduction 1 2 Algorithms

Performance: 3D FEM 3 4 Ongoing work

Kernel launch overhead Kernel execution

α1 = x1T y1

α 2 = x2T y2

α1 = x1T y1, α 2 = x2

T y2

Kernel 1

Kernel 2Fused kernel

✦ Kernel fusion to reduce kernel launch overhead.

✦ Texture memory for SpMV.!✦ MPI for computation on multiple devices.!✦ Matrix free approaches to reduce memory overheads.!✦ Domain decomposition to minimize communications.!✦ Special reduction kernels for OpenMP.!✦ Other potential scientific computing applications.

✦ Y. Notay, 2010. ✦ R. Gandham, K. Esler, and Y. Zhang, 2014.✦ N. Bell, S. Dalton, and L. N. Olson, 2012. ✦ J. Hoberock and N. Bell, 2009. ✦ D. S. Medina, T. Warburton, and A. St. Cyr, 2014

✦ CUDA is faster than OpenCL on NVIDIA Titan GPU. !✦ OpenMP is faster than OpenCL on Intel i7 CPU.!✦ 4x speed up on NVIDIA Titan compared to 6-core Intel i7.

Discrete methods such as finite difference, finite volume, and finite element yield a sparse linear system:

SPE 10 Benchmark !problem

Fig source :!http://www.spe.org/web/csp/

datasets/set02.htm

Inner prod etc!24%

Smoother!7%

Galerkin prod!56%

Ops!4%

Aggregate!10%

% time spent during Setup

Inner prod etc!15%

Grid transfer!56%

Smooth!28%

% time spent during Cycling

0

0.5

1

1.5

2

2.5

1.8M 4.2M 5.9M 8.1M

CUDA on Titan OpenCL on Titan OpenCL on TahitiOpenCL on Intel i7 OpenMP on Intel i7

Scaling of Setup# of unknowns

0

2

4

6

8

1.8M 4.2M 5.9M 8.1M

Scaling of Cycling# of unknowns

M. u

nkno

wns

per

sec

M. u

nkno

wns

per

sec