New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton!...
Transcript of New ALMOND: Algebraic Multigrid On Numerous Devices · 2014. 10. 21. · R Gandham, T Warburton!...
R Gandham, T Warburton!Department of Computational and Applied Mathematics, Rice University
ALMOND: Algebraic Multigrid On Numerous Devices
Authors gratefully acknowledge the funding support from DOE, ANL, ONR, NVIDIA, AMD and an internship at Stoneridge Technology Company.
AMG Setup
Cycling
Construct Ops
Aggregate
Galerkin prod
Smoothing
Grid transfers
SpM transpose
Graph coloring
Sort
SpMV
Applications✦ Reservoir simulations.!✦ Computational fluid dynamics.!✦ Solid mechanics.
Challenges✦ Heterogenous media.!✦ Millions of unknowns.!✦ Efficiency and portability.
−∇ i (κ∇p) = ∇ iuElliptic PDE model
Ax = b, A ∈!n×n , x,b∈!n
✦ Reservoir simulations are crucial in optimizing oil production rate.!✦ Simulations are accelerated by preconditioning techniques.
✦ Algebraic Multi-Grid is a possible preconditioner, but ...!✦ AMG presents challenges for many-core computation.
✦ OCCA provides portability across architectures [Medina et al].
✦ Piecewise constant interpolation [Notay et al].!✦ Optimality via Krylov accelerated cycles [Notay et al].!✦ GPU aggregation using maximal independent set [Bell et al]. !✦ Fine-grain parallel MIS [Bell et al].!✦ Galerkin product through sort [Gandham et al].!✦ Radix sort on GPU [Hoberock et al].
Recursive multigrid cycle
A0
A1
A2
A3
r2 ←R1r1 x1 ← x1 + P1x2
Construct coarse grids via! aggregation
1
3
5
7
6
9
11
12
2
4
8
10
2
1
3
Al Al+1 = RlAlPl
Introduction 1 2 Algorithms
Performance: 3D FEM 3 4 Ongoing work
Kernel launch overhead Kernel execution
α1 = x1T y1
α 2 = x2T y2
α1 = x1T y1, α 2 = x2
T y2
Kernel 1
Kernel 2Fused kernel
✦ Kernel fusion to reduce kernel launch overhead.
✦ Texture memory for SpMV.!✦ MPI for computation on multiple devices.!✦ Matrix free approaches to reduce memory overheads.!✦ Domain decomposition to minimize communications.!✦ Special reduction kernels for OpenMP.!✦ Other potential scientific computing applications.
✦ Y. Notay, 2010. ✦ R. Gandham, K. Esler, and Y. Zhang, 2014.✦ N. Bell, S. Dalton, and L. N. Olson, 2012. ✦ J. Hoberock and N. Bell, 2009. ✦ D. S. Medina, T. Warburton, and A. St. Cyr, 2014
✦ CUDA is faster than OpenCL on NVIDIA Titan GPU. !✦ OpenMP is faster than OpenCL on Intel i7 CPU.!✦ 4x speed up on NVIDIA Titan compared to 6-core Intel i7.
Discrete methods such as finite difference, finite volume, and finite element yield a sparse linear system:
SPE 10 Benchmark !problem
Fig source :!http://www.spe.org/web/csp/
datasets/set02.htm
Inner prod etc!24%
Smoother!7%
Galerkin prod!56%
Ops!4%
Aggregate!10%
% time spent during Setup
Inner prod etc!15%
Grid transfer!56%
Smooth!28%
% time spent during Cycling
0
0.5
1
1.5
2
2.5
1.8M 4.2M 5.9M 8.1M
CUDA on Titan OpenCL on Titan OpenCL on TahitiOpenCL on Intel i7 OpenMP on Intel i7
Scaling of Setup# of unknowns
0
2
4
6
8
1.8M 4.2M 5.9M 8.1M
Scaling of Cycling# of unknowns
M. u
nkno
wns
per
sec
M. u
nkno
wns
per
sec