New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton!...

R Gandham, T Warburton!Department of Computational and Applied Mathematics, Rice University

ALMOND: Algebraic Multigrid On Numerous Devices

Authors gratefully acknowledge the funding support from DOE, ANL, ONR, NVIDIA, AMD and an internship at Stoneridge Technology Company.

AMG Setup

Cycling

Construct Ops

Aggregate

Galerkin prod

Smoothing

Grid transfers

SpM transpose

Graph coloring

Sort

SpMV

Applications✦ Reservoir simulations.!✦ Computational fluid dynamics.!✦ Solid mechanics.

Challenges✦ Heterogenous media.!✦ Millions of unknowns.!✦ Efficiency and portability.

−∇ i (κ∇p) = ∇ iuElliptic PDE model

Ax = b, A ∈!n×n , x,b∈!n

✦ Reservoir simulations are crucial in optimizing oil production rate.!✦ Simulations are accelerated by preconditioning techniques.

✦ Algebraic Multi-Grid is a possible preconditioner, but ...!✦ AMG presents challenges for many-core computation.

✦ OCCA provides portability across architectures [Medina et al].

✦ Piecewise constant interpolation [Notay et al].!✦ Optimality via Krylov accelerated cycles [Notay et al].!✦ GPU aggregation using maximal independent set [Bell et al]. !✦ Fine-grain parallel MIS [Bell et al].!✦ Galerkin product through sort [Gandham et al].!✦ Radix sort on GPU [Hoberock et al].

Recursive multigrid cycle

A0

A1

A2

A3

r2 ←R1r1 x1 ← x1 + P1x2

Construct coarse grids via! aggregation

1

3

5

7

6

9

11

12

2

4

8

10

2

1

3

Al Al+1 = RlAlPl

Introduction 1 2 Algorithms

Performance: 3D FEM 3 4 Ongoing work

Kernel launch overhead Kernel execution

α1 = x1T y1

α 2 = x2T y2

α1 = x1T y1, α 2 = x2

T y2

Kernel 1

Kernel 2Fused kernel

✦ Kernel fusion to reduce kernel launch overhead.

✦ Texture memory for SpMV.!✦ MPI for computation on multiple devices.!✦ Matrix free approaches to reduce memory overheads.!✦ Domain decomposition to minimize communications.!✦ Special reduction kernels for OpenMP.!✦ Other potential scientific computing applications.

✦ Y. Notay, 2010. ✦ R. Gandham, K. Esler, and Y. Zhang, 2014.✦ N. Bell, S. Dalton, and L. N. Olson, 2012. ✦ J. Hoberock and N. Bell, 2009. ✦ D. S. Medina, T. Warburton, and A. St. Cyr, 2014

✦ CUDA is faster than OpenCL on NVIDIA Titan GPU. !✦ OpenMP is faster than OpenCL on Intel i7 CPU.!✦ 4x speed up on NVIDIA Titan compared to 6-core Intel i7.

Discrete methods such as finite difference, finite volume, and finite element yield a sparse linear system:

SPE 10 Benchmark !problem

Fig source :!http://www.spe.org/web/csp/

datasets/set02.htm

Inner prod etc!24%

Smoother!7%

Galerkin prod!56%

Ops!4%

Aggregate!10%

% time spent during Setup

Inner prod etc!15%

Grid transfer!56%

Smooth!28%

% time spent during Cycling

0

0.5

1

1.5

2

2.5

1.8M 4.2M 5.9M 8.1M

CUDA on Titan OpenCL on Titan OpenCL on TahitiOpenCL on Intel i7 OpenMP on Intel i7

Scaling of Setup# of unknowns

0

2

4

6

8

1.8M 4.2M 5.9M 8.1M

Scaling of Cycling# of unknowns

M. u

nkno

wns

per

sec

M. u

nkno

wns

per

sec

New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton!...

Documents

Transcript of New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton!...