AmgX 2.0: Scaling toward ... AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current...
date post
23-Sep-2020Category
Documents
view
0download
0
Embed Size (px)
Transcript of AmgX 2.0: Scaling toward ... AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current...
Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL
2
Agenda
Introduction to AmgX
Current Capabilities
Scaling
V2.0
Roadmap for the future
3
AmgX
Fast, scalable linear solvers, emphasis on iterative methods
Flexible toolkit for GPU accelerated Ax = b solver
Simple API makes it easy to solve your problems faster
4
“ Using AmgX has allowed us to
exploit the power of the GPU
while freeing up development
time to concentrate on
reservoir simulation.”
Garf Bowen, RidgewayKiteSoftware
5
1150
197 98
0
500
1000
1500
CPU GPU Custom
AmgX
AmgX in Reservoir Simulation
Solve Faster
Solve Larger Systems
Flexible High Level API
Application Time (seconds)
Lower is
Better
3-phase Black Oil Reservoir Simulation. 400K
grid blocks solved fully implicitly.
CPU: Intel Xeon CPU E5-2670
GPU: NVIDIA Tesla K10
6
AmgX 2.0: New Features since 1.0
Classical AMG with truncation, robust aggressive coarsening
Complex arithmetic
GPUDirect, RDMA-async
Power8 support, Maxwell support
Crash-proof object management
Re-usable setup phase
Adaptors for major solver packages:
HYPRE, PETSc, Trilinos
Import data structures directly to AmgX for solve, export solution
Host or Device pointer support
JSON configuration
7
Key Features
Un-smoothed Aggregation AMG
Krylov methods: CG, GMRES, BiCGStab, IDR
Smoothers and Solvers:
Block-Jacobi, Gauss-Seidel
Incomplete LU, Dense LU
KPZ-Polynomial, Chebyshev
Flexible composition system
Scalar or coupled block systems, multi-precision
MPI, OpenMP support
Auto-consolidation
Flexible, simple high level C API
8
Minimal Example With Config
//One header
#include “amgx_c.h”
//Read config file
AMGX_create_config(&cfg, cfgfile);
//Create resources based on config
AMGX_resources_create_simple(&res,
cfg);
//Create solver object, A,x,b, set
precision
AMGX_solver_create(&solver, res,
mode, cfg);
AMGX_matrix_create(&A,res,mode);
AMGX_vector_create(&x,res,mode);
AMGX_vector_create(&b,res,mode);
//Read coefficients from a file
AMGX_read_system(&A,&x,&b,
matrixfile);
//Setup and Solve Loop
AMGX_solver_setup(solver,A);
AMGX_solver_solve(solver, b, x);
//Download Result
AMGX_download_vector(&x)
solver(main)=FGMRES
main:max_iters=100
main:convergence=RELATIVE_MAX
main:tolerance=0.1
main:preconditioner(amg)=AMG
amg:algorithm=AGGREGATION
amg:selector=SIZE_8
amg:cycle=V
amg:max_iters=1
amg:max_levels=10
amg:smoother(smoother)=BLOCK_JACOBI
amg:relaxation_factor= 0.75
amg:presweeps=1
amg:postsweeps=2
amg:coarsest_sweeps=4
determinism_flag=1
9
Integrates easily MPI and OpenMP domain decomposition
Adding GPU support to existing applications raises new issues
Proper ratio of CPU cores / GPU?
How can multiple CPU cores (MPI ranks) share a single GPU?
How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs?
AmgX handles this via Consolidation
Consolidate multiple smaller sub-matrices into single matrix
Handled automatically during PCIE data copy
10
u1
u2 u4 u3
u5
u6
u7
u1
u2
u4
u3
u5
u6 u7
u’4
u’2
Rank 0
Rank 1
GPU
u1
u2 u4 u3
u5
u6 u7
PCIE
PCIE
Original Problem
Partitioned to 2 MPI Ranks
Consolidated onto 1 GPU
Boundary exchange
11
Consolidation Examples
1 CPU socket 1 GPU
Dual socket CPU 2 GPUs
Dual socket CPU 4 GPUs
Arbitrary Cluster:
4 nodes x [2 CPUs + 3 GPUs] IB
12
PETSc KSP vs AmgX performance test
PDE:
∂u2∂2x+∂u2∂2y+∂u2∂2z=−12π2cos(2πx)cos(2πy)cos(2πz)
BCs:
∂u∂x∣∣∣x=0=∂u∂x∣∣∣x=1=∂u∂y∣∣∣y=0=∂u∂y∣∣∣y=1=∂u∂z∣∣∣z=0=∂u∂z∣∣∣z=1=0
Exact solution:
u(x,y)=cos(2πx)cos(2πy)cos(2πz)
13
PETSc vs AmgX
7x speedup @4M unknowns 16 cores vs 1 GPU 8x speedup @100M unknowns 512 cores vs 32 GPUs
Machine specification
GPU nodes:
GPU: two K20m per node
CPU nodes:
CPU: two Intel Xeon E5-2670 per node (totally
16 cores per node)
PETSc KSP solver
14
SPE10 Cases We derived several test cases from the SPE10
permeability distribution by fixing an x-y resolution
and adding resolution in z, using TPFA stencil.
15
SPE10 Matrix Tests
GPU: NVIDIA K40
CPU: HYPRE on 10 core IvyBridge Xeon E5-2690 V2 @ 3.0GHz
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 2 4 6 8 10
Sp e
e d
u p
Millions of Unknowns
1 Socket vs 1 GPU
16
Scaling up the right way
17
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Aggregation and Classical Weak Scaling, 8Million DOF per GPU
0.0
2.0
4.0
6.0
8.0
10.0
12.0
1 2 4 8 16 32 64 128 256 512
T im
e (
s)
Number of GPUs
Setup
AmgX 1.0 (PMIS) AmgX 1.0 (AGG)
18
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Aggregation and Classical Weak Scaling, 8Million DOF per GPU
y = 0.0062x + 0.0719 R² = 0.9249
y = 0.0022x + 0.0585 R² = 0.9437
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
1 2 4 8 16 32 64 128 256 512
S o lv
e T
im e
Number of GPUs
Time per Iteration vs Log(P)
ClassicalAMGSolve
AggregationAMGSolve
Linear (ClassicalAMGSolve)
Linear (AggregationAMGSolve)
19
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Classical AMG Preconditioner, 8Million DOF per GPU
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256 512
It e ra
ti o n s
Number of GPUs
PCG
GMRES
20
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Classical AMG Preconditioner, 8Million DOF per GPU
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
1 2 4 8 16 32 64 128 256 512
So lv
e Ti
m e(
s)
Number of GPUs
GMRES
PCG
21
AmgX 2.0: MPI with GPUDirect RDMA
4x lower latency, 3x Bandwidth, 45% lower CPU utilization
22
Basic Coarsening
23
Basic Coarsening
24
Aggressive Coarsening
25
Aggressive Coarsening
Less Memory, Faster Setup
26
AmgX 2.0 Licensing
Developer/Academic License
non commercial use, free
Commercial License, Developer License, Premier Support Service
Subscription License (node/year)
Includes Support and Maintenance
Volume based pricing
Site License
Perpetual License
20% Maintenance and Support
27
AmgX Roadmap
Continuous Improvement
Availability Features
Classical AMG
- multi node
- multi GPU
- Aggressive coarsening
Complex Arithmetic + Aggregation
Easy interfaces, python
PETSc, HYPRE, Trilinos
Robust convergence on SPE10
GPUDirect v2.0
Scalable Sparse Eigensolvers
Scaling past 512 GPUs
Range Decomposition AMG
Guaranteed convergence aggregation
Commerci