AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward...

Post on 23-Sep-2020

6 views 0 download

Transcript of AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward...

Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL

2

Agenda

Introduction to AmgX

Current Capabilities

Scaling

V2.0

Roadmap for the future

3

AmgX

Fast, scalable linear solvers, emphasis on iterative methods

Flexible toolkit for GPU accelerated Ax = b solver

Simple API makes it easy to solve your problems faster

4

“ Using AmgX has allowed us to

exploit the power of the GPU

while freeing up development

time to concentrate on

reservoir simulation.”

Garf Bowen, RidgewayKiteSoftware

5

1150

197 98

0

500

1000

1500

CPU GPUCustom

AmgX

AmgX in Reservoir Simulation

Solve Faster

Solve Larger Systems

Flexible High Level API

Application Time (seconds)

Lower is

Better

3-phase Black Oil Reservoir Simulation. 400K

grid blocks solved fully implicitly.

CPU: Intel Xeon CPU E5-2670

GPU: NVIDIA Tesla K10

6

AmgX 2.0: New Features since 1.0

Classical AMG with truncation, robust aggressive coarsening

Complex arithmetic

GPUDirect, RDMA-async

Power8 support, Maxwell support

Crash-proof object management

Re-usable setup phase

Adaptors for major solver packages:

HYPRE, PETSc, Trilinos

Import data structures directly to AmgX for solve, export solution

Host or Device pointer support

JSON configuration

7

Key Features

Un-smoothed Aggregation AMG

Krylov methods: CG, GMRES, BiCGStab, IDR

Smoothers and Solvers:

Block-Jacobi, Gauss-Seidel

Incomplete LU, Dense LU

KPZ-Polynomial, Chebyshev

Flexible composition system

Scalar or coupled block systems, multi-precision

MPI, OpenMP support

Auto-consolidation

Flexible, simple high level C API

8

Minimal Example With Config

//One header

#include “amgx_c.h”

//Read config file

AMGX_create_config(&cfg, cfgfile);

//Create resources based on config

AMGX_resources_create_simple(&res,

cfg);

//Create solver object, A,x,b, set

precision

AMGX_solver_create(&solver, res,

mode, cfg);

AMGX_matrix_create(&A,res,mode);

AMGX_vector_create(&x,res,mode);

AMGX_vector_create(&b,res,mode);

//Read coefficients from a file

AMGX_read_system(&A,&x,&b,

matrixfile);

//Setup and Solve Loop

AMGX_solver_setup(solver,A);

AMGX_solver_solve(solver, b, x);

//Download Result

AMGX_download_vector(&x)

solver(main)=FGMRES

main:max_iters=100

main:convergence=RELATIVE_MAX

main:tolerance=0.1

main:preconditioner(amg)=AMG

amg:algorithm=AGGREGATION

amg:selector=SIZE_8

amg:cycle=V

amg:max_iters=1

amg:max_levels=10

amg:smoother(smoother)=BLOCK_JACOBI

amg:relaxation_factor= 0.75

amg:presweeps=1

amg:postsweeps=2

amg:coarsest_sweeps=4

determinism_flag=1

9

Integrates easily MPI and OpenMP domain decomposition

Adding GPU support to existing applications raises new issues

Proper ratio of CPU cores / GPU?

How can multiple CPU cores (MPI ranks) share a single GPU?

How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs?

AmgX handles this via Consolidation

Consolidate multiple smaller sub-matrices into single matrix

Handled automatically during PCIE data copy

10

u1

u2 u4 u3

u5

u6

u7

u1

u2

u4

u3

u5

u6

u7

u’4

u’2

Rank 0

Rank 1

GPU

u1

u2 u4 u3

u5

u6

u7

PCIE

PCIE

Original Problem

Partitioned to 2 MPI Ranks

Consolidated onto 1 GPU

Boundary exchange

11

Consolidation Examples

1 CPU socket <=> 1 GPU

Dual socket CPU <=> 2 GPUs

Dual socket CPU <=> 4 GPUs

Arbitrary Cluster:

4 nodes x [2 CPUs + 3 GPUs] IB

12

PETSc KSP vs AmgX performance test

PDE:

∂u2∂2x+∂u2∂2y+∂u2∂2z=−12π2cos(2πx)cos(2πy)cos(2πz)

BCs:

∂u∂x∣∣∣x=0=∂u∂x∣∣∣x=1=∂u∂y∣∣∣y=0=∂u∂y∣∣∣y=1=∂u∂z∣∣∣z=0=∂u∂z∣∣∣z=1=0

Exact solution:

u(x,y)=cos(2πx)cos(2πy)cos(2πz)

13

PETSc vs AmgX

7x speedup @4M unknowns 16 cores vs 1 GPU 8x speedup @100M unknowns 512 cores vs 32 GPUs

Machine specification

GPU nodes:

GPU: two K20m per node

CPU nodes:

CPU: two Intel Xeon E5-2670 per node (totally

16 cores per node)

PETSc KSP solver

14

SPE10 Cases We derived several test cases from the SPE10

permeability distribution by fixing an x-y resolution

and adding resolution in z, using TPFA stencil.

15

SPE10 Matrix Tests

GPU: NVIDIA K40

CPU: HYPRE on 10 core IvyBridge Xeon E5-2690 V2 @ 3.0GHz

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 2 4 6 8 10

Spe

ed

up

Millions of Unknowns

1 Socket vs 1 GPU

16

Scaling up the right way

17

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Aggregation and Classical Weak Scaling, 8Million DOF per GPU

0.0

2.0

4.0

6.0

8.0

10.0

12.0

1 2 4 8 16 32 64 128 256 512

Tim

e (

s)

Number of GPUs

Setup

AmgX 1.0 (PMIS) AmgX 1.0 (AGG)

18

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Aggregation and Classical Weak Scaling, 8Million DOF per GPU

y = 0.0062x + 0.0719 R² = 0.9249

y = 0.0022x + 0.0585 R² = 0.9437

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

1 2 4 8 16 32 64 128 256 512

Solv

e T

ime

Number of GPUs

Time per Iteration vs Log(P)

ClassicalAMGSolve

AggregationAMGSolve

Linear (ClassicalAMGSolve)

Linear (AggregationAMGSolve)

19

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Classical AMG Preconditioner, 8Million DOF per GPU

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512

Itera

tions

Number of GPUs

PCG

GMRES

20

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Classical AMG Preconditioner, 8Million DOF per GPU

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

1 2 4 8 16 32 64 128 256 512

Solv

e Ti

me(

s)

Number of GPUs

GMRES

PCG

21

AmgX 2.0: MPI with GPUDirect RDMA

4x lower latency, 3x Bandwidth, 45% lower CPU utilization

22

Basic Coarsening

23

Basic Coarsening

24

Aggressive Coarsening

25

Aggressive Coarsening

Less Memory, Faster Setup

26

AmgX 2.0 Licensing

Developer/Academic License

non commercial use, free

Commercial License, Developer License, Premier Support Service

Subscription License (node/year)

Includes Support and Maintenance

Volume based pricing

Site License

Perpetual License

20% Maintenance and Support

27

AmgX Roadmap

Continuous Improvement

Availability Features

Classical AMG

- multi node

- multi GPU

- Aggressive coarsening

Complex Arithmetic + Aggregation

Easy interfaces, python

PETSc, HYPRE, Trilinos

Robust convergence on SPE10

GPUDirect v2.0

Scalable Sparse Eigensolvers

Scaling past 512 GPUs

Range Decomposition AMG

Guaranteed convergence aggregation

Commercial License

Premier Support

AmgX 2.5 Q2 2016

AmgX 2.0 Release Q4 2015

CUDA 8.0 with Pascal Support

Tuning for Maxwell

AmgX 2.0 was made by a great team of contributors. AmgX 2.0 Team: Marat Arsaev, Joe Eaton, Alex Fender, Andrei Schaffer AmgX 2.0 Devtechs: Simon Layton, Nikolai Sakharnykh, Nikolay Markovskiy Interns: Rohit Gupta, Constantine Stulov