GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific...

GAMPACK A Scalable GPU-Accelerated Algebraic

Multigrid PACKage Yongpeng Zhang, Ken Esler, Rajesh Gandham and Vincent Natoli

GTC San Jose, May 27 2014

History and Features GAMPACK in a Nutshell

•  Solves Ax = b for sparse matrix A using algebraic multigrid method •  One of the first fully GPU-accelerated AMG solution

–  Previous talk in GTC in 2012 –  Since then improving scalability (the topic today) –  Integrating to commercial applications

•  Features: –  Simple C/Fortran APIs, a few tunable parameters –  Classical and Unsmoothed Aggregation methods –  Support GMRES, CG, BICGSTAB method –  Multi-GPU enabled –  Support OpenMP and C++11 threading model –  Fast setup for value-only matrix update –  Hybrid solution (use both GPU and CPU) –  Setup on stream, may overlap setup and other host to device DMAs

Scaling Solvers in Scientific Applications MOTIVATION

•  Multiphysics application Charon –  Running on Cray

•  Weak Scaling •  Solver is AMG with GMRES •  List of challenges:

–  Extreme levels of concurrency –  Resilience and non-deterministic behavior –  Reduced memory sizes per core –  Data storage and movement –  Deep memory hierarchies –  Portability with performance

•  Motivations for scaling GAMPACK

Solver Dominance in ExaMath Era Image courtesy to the report in DOE Workshop on Extreme-Scale Solvers, ExaMath 13

https://collab.mcs.anl.gov/display/examath/ExaMath13+Workshop

Background Multigrid

•  Motivation –  Low-frequency errors decay slowly in standard iterative method –  Number of iterations proportional to number of unknowns

•  Constructing a hierarchy of grids (multigrid) –  Low-frequency error becomes high-frequency error in coarse grid –  Need ways to transfer errors between levels –  Number of iterations independent of number of unknowns

•  Geometrical and Algebraic Multigrid –  Geometrical

•  Structured meshes •  Easy to setup

–  Algebraic •  Only uses matrix coefficients to construct hierarchy •  System can be treated as a “black box”

Image courtesy LLNL

Classical and Aggregation-Based Algebraic Multigrid

0 1

* * 0 *

0

1

2

4 5

3

C F

0

1

P

Classical AMG Unsmoothed AggregaHon Smoothed-‐AggregaHon

•  Classical: partition nodes into coarse and fine grid nodes •  Aggregation: each coarse node is aggregated by non-overlapped fine nodes

–  Unsmoothed: simple prolongator/restrictor, at most one non-zero entry on P –  Smoothed: apply smoothing on P, more dense at low levels

0

1

2

4 5

3 0

1

0 * * * * 0 * 0

0 1

0 1 0 1

P 0 1 1 0 1 0 1 0

0 1

P M 0 1 0 1 0 1 1 0 1 0 1 0

Implementations Algebraic Multigrid

•  Divide into Setup and Solve

–  Setup: •  Generate hierarchies •  Maybe reused by many solve passes

–  Solve: •  Recursively apply smoothing and coarse grid correction on each level •  Full-cycle, V-cycle, K-cycle, W-cycle…

•  Parallelizing lower levels is key to achieve good scalability in AMG

–  Less concurrency available due to small size

Challenges and Why Scaling GAMPACK

•  A case study

–  13 million unknowns –  94 million non-zeros (NNZs) –  Classical AMG generates 13 levels –  GPU: one K40m –  CPU: two 8-core Sandy Bridge

(E5-2680 @ 2.7~3.5 GHz) •  Statistics on each level

Level Num Rows NNZ NNZ/Row 0 13464033 93800843 6.97 1 5683899 74538389 13.11 2 1825181 41377697 22.67 3 521650 15772686 30.24 4 137061 4778495 34.86 5 33150 1241600 37.45 6 7269 275747 37.93 7 1544 54622 35.38 8 338 10232 30.27 9 54 1154 21.37

10 10 82 8.2 11 3 9 3 12 1 1 1

40%

32%

18% 7%

2% 1% NNZ Distribu+on

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5+

Challenges and Why Scaling GAMPACK

•  Two different trends in CPU and GPU

•  Cache effect for CPU implementation

–  More efficient for small levels that fit in cache

–  Latency-oriented processors •  GPU good for large data set

–  Throughput-oriented processors •  Harder to scale multigrid on GPUs

–  Amdahl’s law

40%

32%

18% 7%

2% 1%

NNZ Distribu+on

Level 0 Level 1 Level 2 Level 3 Level 4 Level 5+

38%

32%

18%

7%

3% 2% GPU Run+me Distribu+on


58% 25%

11%

4% 1% 1% CPU Run+me Distribu+on


Multi-threaded in-house CPU Single GPU

Baseline Scaling Optimizations

•  Test bed –  8 K40ms, each 12 GB mem –  2 8-core Ivy Bridge (2.1GHz) –  256 GB host memory

•  Total solution time –  Solving to 1e-6 (relative residual)

•  From 1 GPU to 8 GPUs (strong scaling)

•  13 Million unknowns •  Initial implementation

–  Low efficiency 41%

0

1

2

3

4

5

6

Baseline

Speedup (1 to 8)

3.28X

Opt I : Overlapping communication and computation Scaling Optimizations

•  Made possible by –  cudaMemcpyAsync –  Page-locked memory (cudaMallocHost or

cudaHostRegister) •  SPMV (y = Ax + y)

–  Calculate y’s with local x –  Exchange remote x values at the same

time –  Update y with remote x –  Hybrid matrix format

•  Similar idea in SPMM and PMIS –  Calculate internal region while exchanging

halos

0

1

2

3

4

5

6

Baseline Opt I

Speedup (1 to 8)

3.28X 3.72X

Opt II: Aggregating levels Scaling Optimizations

•  Aggregating coarser levels to fewer GPUs –  Intermediate levels to 4 GPUs –  Coarse levels to 1 GPU

•  Use CPU for the coarsest level –  SUPERLU –  PARDISO –  Multi-threaded CPU AMG solver

•  Single-GPU benefits from CPU coarse solver

0

1

2

3

4

5

6

Baseline Opt I Opt II

Speedup (1 to 8)

3.28X 3.72X

4.66X

Opt III: METIS and Others Scaling Optimizations

•  Apply METIS for input matrix before solving –  Reduce communication –  METIS time not included

•  Pin memory for input buffer –  Better H2D transfer

•  Thread affinity –  Setting CPU threads to the socket closer to

the GPU •  Upgrade CUDA release

–  2% faster from 5.5 to 6.0 –  Compiler advances –  Thrust improvements

•  Final X factor 5.5X (strong scaling) •  Efficiency: 68.75%

0

1

2

3

4

5

6

Baseline Opt I Opt II Opt III

Speedup (1 to 8)

3.28X 3.72X

4.66X 5.5X

Future Optimizations Scaling Optimizations

•  Try out different matrix format

–  Still an active research topic •  Reduce kernel launch overhead

–  Not negligible on coarse levels –  Worse when multi-threaded

•  Same principles apply to cluster solution

0

1

2

3

4

5

6

Baseline Opt I Opt II Opt III

Speedup (1 to 8)

3.28X 3.72X 4.66X

5.5X

Better efficiency Weak Scaling

•  Measuring millions unknown/sec •  55 Million unknowns solved in 3.43

seconds by 8 K40ms •  95 Million unknowns could possibly

fit in 8 K40ms •  Include all transfer time, setup and

solve to 10-6 relative tolerance

•  Efficiency: –  2 GPU: 92% –  4 GPU: 72% –  8 GPU: 74%

0

5

10

15

20

25

1 GPU 2 GPU 4 GPU 8 GPU

GAMPACK Perfect Scaling

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

HYPER GAMPACK

University of Florida Sparse Matrix Collection Compare to HYPRE

•  GPU: one K40m

–  Max out clock freq •  HYPRE 2.9.0 OpenMP

–  Xeon E5-2680 –  2.7~3.5 GHz –  7 CPU threads –  64 GB host memory

•  Tuned parameters with each matrix

•  X factor 5.53X ~ 16.10X

12.01X

6.74X

5.53X 6.60X

8.90X 13.19X

16.10X

12.53X

14.56X

6.38X

12.78X

Summary

•  GAMPACK is a fast scalable GPU-accelerated AMG solver

–  Easy to use, highly configurable APIs –  Flexible threading models –  Solve in double/single/mixed precision –  Highly tuned for Fermi and Kepler –  Support CUDA 5.5 and above –  Win64 and Linux

•  On-going work

–  Near completion of cluster version with MPI

Team Members

Contact: [email protected]

Ken Esler Yongpeng Zhang Rajesh Gandham Vincent Natoli (intern 2013)

Stone Ridge Technology is located in Bel Air, MD and specializes in the development, acceleration, and extension of scientific and engineering software for conventional CPU, multicore, and hybrid many-core platforms. It employs computational physicists, applied mathematicians, computer scientists, and electrical engineers. It has been an NVIDIA partner since 2008.

GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific...

Documents

Transcript of GAMPACK: A Scalable GPU-Accelerated Algebraic Multigrid ... · Scaling Solvers in Scientific...