AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward...

Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL

Agenda

Introduction to AmgX

Current Capabilities

Scaling

Roadmap for the future

Fast, scalable linear solvers, emphasis on iterative methods

Flexible toolkit for GPU accelerated Ax = b solver

Simple API makes it easy to solve your problems faster

“ Using AmgX has allowed us to

exploit the power of the GPU

while freeing up development

time to concentrate on

reservoir simulation.”

Garf Bowen, RidgewayKiteSoftware

197 98

CPU GPUCustom

AmgX in Reservoir Simulation

Solve Faster

Solve Larger Systems

Flexible High Level API

Application Time (seconds)

Lower is

Better

3-phase Black Oil Reservoir Simulation. 400K

grid blocks solved fully implicitly.

CPU: Intel Xeon CPU E5-2670

GPU: NVIDIA Tesla K10

AmgX 2.0: New Features since 1.0

Classical AMG with truncation, robust aggressive coarsening

Complex arithmetic

GPUDirect, RDMA-async

Power8 support, Maxwell support

Crash-proof object management

Re-usable setup phase

Adaptors for major solver packages:

HYPRE, PETSc, Trilinos

Import data structures directly to AmgX for solve, export solution

Host or Device pointer support

JSON configuration

Key Features

Un-smoothed Aggregation AMG

Krylov methods: CG, GMRES, BiCGStab, IDR

Smoothers and Solvers:

Block-Jacobi, Gauss-Seidel

Incomplete LU, Dense LU

KPZ-Polynomial, Chebyshev

Flexible composition system

Scalar or coupled block systems, multi-precision

MPI, OpenMP support

Auto-consolidation

Flexible, simple high level C API

Minimal Example With Config

//One header

#include “amgx_c.h”

//Read config file

AMGX_create_config(&cfg, cfgfile);

//Create resources based on config

AMGX_resources_create_simple(&res,

//Create solver object, A,x,b, set

precision

AMGX_solver_create(&solver, res,

mode, cfg);

AMGX_matrix_create(&A,res,mode);

AMGX_vector_create(&x,res,mode);

AMGX_vector_create(&b,res,mode);

//Read coefficients from a file

AMGX_read_system(&A,&x,&b,

matrixfile);

//Setup and Solve Loop

AMGX_solver_setup(solver,A);

AMGX_solver_solve(solver, b, x);

//Download Result

AMGX_download_vector(&x)

solver(main)=FGMRES

main:max_iters=100

main:convergence=RELATIVE_MAX

main:tolerance=0.1

main:preconditioner(amg)=AMG

amg:algorithm=AGGREGATION

amg:selector=SIZE_8

amg:cycle=V

amg:max_iters=1

amg:max_levels=10

amg:smoother(smoother)=BLOCK_JACOBI

amg:relaxation_factor= 0.75

amg:presweeps=1

amg:postsweeps=2

amg:coarsest_sweeps=4

determinism_flag=1

Integrates easily MPI and OpenMP domain decomposition

Adding GPU support to existing applications raises new issues

Proper ratio of CPU cores / GPU?

How can multiple CPU cores (MPI ranks) share a single GPU?

How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs?

AmgX handles this via Consolidation

Consolidate multiple smaller sub-matrices into single matrix

Handled automatically during PCIE data copy

u2 u4 u3

Rank 0

Rank 1

u2 u4 u3

Original Problem

Partitioned to 2 MPI Ranks

Consolidated onto 1 GPU

Boundary exchange

Consolidation Examples

1 CPU socket <=> 1 GPU

Dual socket CPU <=> 2 GPUs

Dual socket CPU <=> 4 GPUs

Arbitrary Cluster:

4 nodes x [2 CPUs + 3 GPUs] IB

PETSc KSP vs AmgX performance test

∂u2∂2x+∂u2∂2y+∂u2∂2z=−12π2cos(2πx)cos(2πy)cos(2πz)

∂u∂x∣∣∣x=0=∂u∂x∣∣∣x=1=∂u∂y∣∣∣y=0=∂u∂y∣∣∣y=1=∂u∂z∣∣∣z=0=∂u∂z∣∣∣z=1=0

Exact solution:

u(x,y)=cos(2πx)cos(2πy)cos(2πz)

PETSc vs AmgX

7x speedup @4M unknowns 16 cores vs 1 GPU 8x speedup @100M unknowns 512 cores vs 32 GPUs

Machine specification

GPU nodes:

GPU: two K20m per node

CPU nodes:

CPU: two Intel Xeon E5-2670 per node (totally

16 cores per node)

PETSc KSP solver

SPE10 Cases We derived several test cases from the SPE10

permeability distribution by fixing an x-y resolution

and adding resolution in z, using TPFA stencil.

SPE10 Matrix Tests

GPU: NVIDIA K40

CPU: HYPRE on 10 core IvyBridge Xeon E5-2690 V2 @ 3.0GHz

0 2 4 6 8 10

Millions of Unknowns

1 Socket vs 1 GPU

Scaling up the right way

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Aggregation and Classical Weak Scaling, 8Million DOF per GPU

1 2 4 8 16 32 64 128 256 512

Number of GPUs

AmgX 1.0 (PMIS) AmgX 1.0 (AGG)

Aggregation and Classical Weak Scaling, 8Million DOF per GPU

y = 0.0062x + 0.0719 R² = 0.9249

y = 0.0022x + 0.0585 R² = 0.9437

1 2 4 8 16 32 64 128 256 512

Number of GPUs

Time per Iteration vs Log(P)

ClassicalAMGSolve

AggregationAMGSolve

Linear (ClassicalAMGSolve)

Linear (AggregationAMGSolve)

Classical AMG Preconditioner, 8Million DOF per GPU

1 2 4 8 16 32 64 128 256 512

Number of GPUs

Classical AMG Preconditioner, 8Million DOF per GPU

1 2 4 8 16 32 64 128 256 512

Number of GPUs

AmgX 2.0: MPI with GPUDirect RDMA

4x lower latency, 3x Bandwidth, 45% lower CPU utilization

Basic Coarsening

Aggressive Coarsening

Less Memory, Faster Setup

AmgX 2.0 Licensing

Developer/Academic License

non commercial use, free

Commercial License, Developer License, Premier Support Service

Subscription License (node/year)

Includes Support and Maintenance

Volume based pricing

Site License

Perpetual License

20% Maintenance and Support

AmgX Roadmap

Continuous Improvement

Availability Features

Classical AMG

- multi node

- multi GPU

- Aggressive coarsening

Complex Arithmetic + Aggregation

Easy interfaces, python

PETSc, HYPRE, Trilinos

Robust convergence on SPE10

GPUDirect v2.0

Scalable Sparse Eigensolvers

Scaling past 512 GPUs

Range Decomposition AMG

Guaranteed convergence aggregation

Commercial License

Premier Support

AmgX 2.5 Q2 2016

AmgX 2.0 Release Q4 2015

CUDA 8.0 with Pascal Support

Tuning for Maxwell

AmgX 2.0 was made by a great team of contributors. AmgX 2.0 Team: Marat Arsaev, Joe Eaton, Alex Fender, Andrei Schaffer AmgX 2.0 Devtechs: Simon Layton, Nikolai Sakharnykh, Nikolay Markovskiy Interns: Rohit Gupta, Constantine Stulov

AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward...

Documents

Transcript of AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward...

Scaling Factors and scaling parameters

Challenges and Opportunities with UGT Enzymes · • In vitro scaling methods • Simple allometric scaling • Interspecies scaling methods: - single species scaling with and without

OPENFOAM ON GPUS USING AMGX - The Society For …scs.org/wp-content/uploads/2017/06/44_Final_Manuscript.pdf · OPENFOAM ON GPUS USING AMGX ... It contains meshing tools like ... OpenFOAM

SCALING NEW HEIGHTS SCALING - IIA Indonesia

Pathways toward scaling up climate services for farmers Arame Tall 2013

Chapter Ten Measurement and Scaling: Noncomparative Scaling Techniques.

The Scaling Scan - CIMMYT · The Scaling Scan 4 What is Scaling? Using the Scaling Scan Step 1: Scaling ambition Step 2: Scaling ingredients Step 2: 40 QuestionsStep 3: Points of

Toward Scaling Hardware Security Module for Emerging …sgxhsm-slides.pdfToward Scaling Hardware Security Module for Emerging Cloud Services Juhyeng Han*, Seongmin Kim*, TaesooKim

Scaling Formal Methods Toward Hierarchical Protocols in Shared Memory Processors Presenters: Ganesh Gopalakrishnan and Xiaofang Chen School of Computing,

scaling up your business scaling up your business

AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Scaling Impact toward Systems Change: Exploring Gender ... · ‘political’ areas like education. B. Data, Evidence and Research • The research does not exist to help define the

AWS Auto ScalingAWS Auto Scaling User Guide Features of AWS Auto Scaling What Is AWS Auto Scaling? AWS Auto Scaling enables you to conﬁgure automatic scaling for the AWS resources

UK FFAG Plans Introduction to FFAGs Scaling vs non-scaling Non-scaling FFAGs Non-scaling POP Why the interest? UK plans.

Scaling Agile is scaling people

Scaling ...

Chapter Nine Measurement and Scaling: Noncomparative Scaling Techniques.

Assessing Systems Change: A Funders’ Workshop Report · 2019-10-06 · 1 I. Background and Workshop Structure In our Scaling Solutions Toward Shifting Systems Initiative1, Rockefeller

Cloud Scalability Patterns · Scaling Up == Vertical Scaling Scaling Out == Horizontal Scaling •Architectural Decision –ig decision… hard to change. Scaling Up: Scaling the

Non ComparativeNon comparative scaling techniques.ppt Scaling Techniques

Toward Scaling Hardware Security Module for Emerging …sgxhsm-slides.pdfToward Scaling Hardware Security Module for Emerging Cloud Services Juhyeng Han, Seongmin Kim, TaesooKim