GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific...

17
GAMPACK A Scalable GPU-Accelerated Algebraic Multigrid PACKage Yongpeng Zhang, Ken Esler, Rajesh Gandham and Vincent Natoli GTC San Jose, May 27 2014

Transcript of GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific...

Page 1: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

GAMPACK A Scalable GPU-Accelerated Algebraic

Multigrid PACKage Yongpeng Zhang, Ken Esler, Rajesh Gandham and Vincent Natoli

GTC  San  Jose,  May  27  2014  

Page 2: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

History and Features GAMPACK in a Nutshell

•  Solves Ax = b for sparse matrix A using algebraic multigrid method •  One of the first fully GPU-accelerated AMG solution

–  Previous talk in GTC in 2012 –  Since then improving scalability (the topic today) –  Integrating to commercial applications

•  Features: –  Simple C/Fortran APIs, a few tunable parameters –  Classical and Unsmoothed Aggregation methods –  Support GMRES, CG, BICGSTAB method –  Multi-GPU enabled –  Support OpenMP and C++11 threading model –  Fast setup for value-only matrix update –  Hybrid solution (use both GPU and CPU) –  Setup on stream, may overlap setup and other host to device DMAs

Page 3: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Scaling Solvers in Scientific Applications MOTIVATION

•  Multiphysics application Charon –  Running on Cray

•  Weak Scaling •  Solver is AMG with GMRES •  List of challenges:

–  Extreme levels of concurrency –  Resilience and non-deterministic behavior –  Reduced memory sizes per core –  Data storage and movement –  Deep memory hierarchies –  Portability with performance

•  Motivations for scaling GAMPACK

Solver Dominance in ExaMath Era Image courtesy to the report in DOE Workshop on Extreme-Scale Solvers, ExaMath 13

https://collab.mcs.anl.gov/display/examath/ExaMath13+Workshop

Page 4: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Background Multigrid

•  Motivation –  Low-frequency errors decay slowly in standard iterative method –  Number of iterations proportional to number of unknowns

•  Constructing a hierarchy of grids (multigrid) –  Low-frequency error becomes high-frequency error in coarse grid –  Need ways to transfer errors between levels –  Number of iterations independent of number of unknowns

•  Geometrical and Algebraic Multigrid –  Geometrical

•  Structured meshes •  Easy to setup

–  Algebraic •  Only uses matrix coefficients to construct hierarchy •  System can be treated as a “black box”

Image  courtesy  LLNL  

Page 5: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Classical and Aggregation-Based Algebraic Multigrid

0    1  

*    *  0    *  

0  

1  

2  

4   5  

3  

C   F  

0  

1  

P  

Classical  AMG  Unsmoothed  AggregaHon   Smoothed-­‐AggregaHon  

•  Classical: partition nodes into coarse and fine grid nodes •  Aggregation: each coarse node is aggregated by non-overlapped fine nodes

–  Unsmoothed: simple prolongator/restrictor, at most one non-zero entry on P –  Smoothed: apply smoothing on P, more dense at low levels

0  

1  

2  

4   5  

3   0  

1  

0    *  *    *  *    0  *    0  

0    1  

0    1  0    1  

P  0    1  1    0  1    0  1    0  

0    1  

P   M  0    1  0    1  0    1  1    0  1    0  1    0  

Page 6: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Implementations Algebraic Multigrid

•  Divide into Setup and Solve

–  Setup: •  Generate hierarchies •  Maybe reused by many solve passes

–  Solve: •  Recursively apply smoothing and coarse grid correction on each level •  Full-cycle, V-cycle, K-cycle, W-cycle…

•  Parallelizing lower levels is key to achieve good scalability in AMG

–  Less concurrency available due to small size

Page 7: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Challenges and Why Scaling GAMPACK

•  A case study

–  13 million unknowns –  94 million non-zeros (NNZs) –  Classical AMG generates 13 levels –  GPU: one K40m –  CPU: two 8-core Sandy Bridge

(E5-2680 @ 2.7~3.5 GHz) •  Statistics on each level

Level Num Rows NNZ NNZ/Row 0 13464033 93800843 6.97 1 5683899 74538389 13.11 2 1825181 41377697 22.67 3 521650 15772686 30.24 4 137061 4778495 34.86 5 33150 1241600 37.45 6 7269 275747 37.93 7 1544 54622 35.38 8 338 10232 30.27 9 54 1154 21.37

10 10 82 8.2 11 3 9 3 12 1 1 1

40%  

32%  

18%  7%  

2%   1%  NNZ  Distribu+on  

Level  0  

Level  1  

Level  2  

Level  3  

Level  4  

Level  5+  

Page 8: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Challenges and Why Scaling GAMPACK

•  Two different trends in CPU and GPU

•  Cache effect for CPU implementation

–  More efficient for small levels that fit in cache

–  Latency-oriented processors •  GPU good for large data set

–  Throughput-oriented processors •  Harder to scale multigrid on GPUs

–  Amdahl’s law

40%  

32%  

18%  7%  

2%  1%  

NNZ  Distribu+on  

Level  0  Level  1  Level  2  Level  3  Level  4  Level  5+  

38%  

32%  

18%  

7%  

3%   2%  GPU  Run+me  Distribu+on  

Level  0  Level  1  Level  2  Level  3  Level  4  Level  5+  

58%  25%  

11%  

4%   1%   1%  CPU  Run+me  Distribu+on  

Level  0  Level  1  Level  2  Level  3  Level  4  Level  5+  

Multi-threaded in-house CPU   Single GPU

Page 9: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Baseline Scaling Optimizations

•  Test bed –  8 K40ms, each 12 GB mem –  2 8-core Ivy Bridge (2.1GHz) –  256 GB host memory

•  Total solution time –  Solving to 1e-6 (relative residual)

•  From 1 GPU to 8 GPUs (strong scaling)

•  13 Million unknowns •  Initial implementation

–  Low efficiency 41%

0  

1  

2  

3  

4  

5  

6  

Baseline  

Speedup  (1  to  8)  

3.28X  

Page 10: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Opt I : Overlapping communication and computation Scaling Optimizations

•  Made possible by –  cudaMemcpyAsync –  Page-locked memory (cudaMallocHost or

cudaHostRegister) •  SPMV (y = Ax + y)

–  Calculate y’s with local x –  Exchange remote x values at the same

time –  Update y with remote x –  Hybrid matrix format

•  Similar idea in SPMM and PMIS –  Calculate internal region while exchanging

halos

0  

1  

2  

3  

4  

5  

6  

Baseline   Opt  I  

Speedup  (1  to  8)  

3.28X  3.72X  

Page 11: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Opt II: Aggregating levels Scaling Optimizations

•  Aggregating coarser levels to fewer GPUs –  Intermediate levels to 4 GPUs –  Coarse levels to 1 GPU

•  Use CPU for the coarsest level –  SUPERLU –  PARDISO –  Multi-threaded CPU AMG solver

•  Single-GPU benefits from CPU coarse solver

0  

1  

2  

3  

4  

5  

6  

Baseline   Opt  I   Opt  II  

Speedup  (1  to  8)  

3.28X  3.72X  

4.66X  

Page 12: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Opt III: METIS and Others Scaling Optimizations

•  Apply METIS for input matrix before solving –  Reduce communication –  METIS time not included

•  Pin memory for input buffer –  Better H2D transfer

•  Thread affinity –  Setting CPU threads to the socket closer to

the GPU •  Upgrade CUDA release

–  2% faster from 5.5 to 6.0 –  Compiler advances –  Thrust improvements

•  Final X factor 5.5X (strong scaling) •  Efficiency: 68.75%

0  

1  

2  

3  

4  

5  

6  

Baseline   Opt  I   Opt  II   Opt  III  

Speedup  (1  to  8)  

3.28X  3.72X  

4.66X  5.5X  

Page 13: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Future Optimizations Scaling Optimizations

•  Try out different matrix format

–  Still an active research topic •  Reduce kernel launch overhead

–  Not negligible on coarse levels –  Worse when multi-threaded

•  Same principles apply to cluster solution

0  

1  

2  

3  

4  

5  

6  

Baseline   Opt  I   Opt  II   Opt  III  

Speedup  (1  to  8)  

3.28X   3.72X  4.66X  

5.5X  

Page 14: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Better efficiency Weak Scaling

•  Measuring millions unknown/sec •  55 Million unknowns solved in 3.43

seconds by 8 K40ms •  95 Million unknowns could possibly

fit in 8 K40ms •  Include all transfer time, setup and

solve to 10-6 relative tolerance

•  Efficiency: –  2 GPU: 92% –  4 GPU: 72% –  8 GPU: 74%

0  

5  

10  

15  

20  

25  

1  GPU   2  GPU   4  GPU   8  GPU  

GAMPACK   Perfect  Scaling  

Page 15: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

0  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

4  

4.5  

5  

HYPER   GAMPACK  

University of Florida Sparse Matrix Collection Compare to HYPRE

•  GPU: one K40m

–  Max out clock freq •  HYPRE 2.9.0 OpenMP

–  Xeon E5-2680 –  2.7~3.5 GHz –  7 CPU threads –  64 GB host memory

•  Tuned parameters with each matrix

•  X factor 5.53X ~ 16.10X

12.01X  

6.74X  

5.53X   6.60X  

8.90X  13.19X  

16.10X  

12.53X

14.56X  

6.38X  

12.78X  

Page 16: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Summary

•  GAMPACK is a fast scalable GPU-accelerated AMG solver

–  Easy to use, highly configurable APIs –  Flexible threading models –  Solve in double/single/mixed precision –  Highly tuned for Fermi and Kepler –  Support CUDA 5.5 and above –  Win64 and Linux

•  On-going work

–  Near completion of cluster version with MPI

Page 17: GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific Applications MOTIVATION ... GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid Package

Team Members

Contact: [email protected]

     Ken  Esler                                              Yongpeng  Zhang                              Rajesh  Gandham                                  Vincent  Natoli                                                                                                                                                                                        (intern  2013)    

Stone Ridge Technology is located in Bel Air, MD and specializes in the development, acceleration, and extension of scientific and engineering software for conventional CPU, multicore, and hybrid many-core platforms. It employs computational physicists, applied mathematicians, computer scientists, and electrical engineers. It has been an NVIDIA partner since 2008.