AmgX 2.0: Scaling toward ... AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current...

Click here to load reader

  • date post

    23-Sep-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of AmgX 2.0: Scaling toward ... AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current...

  • Joe Eaton, November 19, 2015

    AmgX 2.0: Scaling toward CORAL

  • 2

    Agenda

    Introduction to AmgX

    Current Capabilities

    Scaling

    V2.0

    Roadmap for the future

  • 3

    AmgX

    Fast, scalable linear solvers, emphasis on iterative methods

    Flexible toolkit for GPU accelerated Ax = b solver

    Simple API makes it easy to solve your problems faster

  • 4

    “ Using AmgX has allowed us to

    exploit the power of the GPU

    while freeing up development

    time to concentrate on

    reservoir simulation.”

    Garf Bowen, RidgewayKiteSoftware

  • 5

    1150

    197 98

    0

    500

    1000

    1500

    CPU GPU Custom

    AmgX

    AmgX in Reservoir Simulation

    Solve Faster

    Solve Larger Systems

    Flexible High Level API

    Application Time (seconds)

    Lower is

    Better

    3-phase Black Oil Reservoir Simulation. 400K

    grid blocks solved fully implicitly.

    CPU: Intel Xeon CPU E5-2670

    GPU: NVIDIA Tesla K10

  • 6

    AmgX 2.0: New Features since 1.0

    Classical AMG with truncation, robust aggressive coarsening

    Complex arithmetic

    GPUDirect, RDMA-async

    Power8 support, Maxwell support

    Crash-proof object management

    Re-usable setup phase

    Adaptors for major solver packages:

    HYPRE, PETSc, Trilinos

    Import data structures directly to AmgX for solve, export solution

    Host or Device pointer support

    JSON configuration

  • 7

    Key Features

    Un-smoothed Aggregation AMG

    Krylov methods: CG, GMRES, BiCGStab, IDR

    Smoothers and Solvers:

    Block-Jacobi, Gauss-Seidel

    Incomplete LU, Dense LU

    KPZ-Polynomial, Chebyshev

    Flexible composition system

    Scalar or coupled block systems, multi-precision

    MPI, OpenMP support

    Auto-consolidation

    Flexible, simple high level C API

  • 8

    Minimal Example With Config

    //One header

    #include “amgx_c.h”

    //Read config file

    AMGX_create_config(&cfg, cfgfile);

    //Create resources based on config

    AMGX_resources_create_simple(&res,

    cfg);

    //Create solver object, A,x,b, set

    precision

    AMGX_solver_create(&solver, res,

    mode, cfg);

    AMGX_matrix_create(&A,res,mode);

    AMGX_vector_create(&x,res,mode);

    AMGX_vector_create(&b,res,mode);

    //Read coefficients from a file

    AMGX_read_system(&A,&x,&b,

    matrixfile);

    //Setup and Solve Loop

    AMGX_solver_setup(solver,A);

    AMGX_solver_solve(solver, b, x);

    //Download Result

    AMGX_download_vector(&x)

    solver(main)=FGMRES

    main:max_iters=100

    main:convergence=RELATIVE_MAX

    main:tolerance=0.1

    main:preconditioner(amg)=AMG

    amg:algorithm=AGGREGATION

    amg:selector=SIZE_8

    amg:cycle=V

    amg:max_iters=1

    amg:max_levels=10

    amg:smoother(smoother)=BLOCK_JACOBI

    amg:relaxation_factor= 0.75

    amg:presweeps=1

    amg:postsweeps=2

    amg:coarsest_sweeps=4

    determinism_flag=1

  • 9

    Integrates easily MPI and OpenMP domain decomposition

    Adding GPU support to existing applications raises new issues

    Proper ratio of CPU cores / GPU?

    How can multiple CPU cores (MPI ranks) share a single GPU?

    How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs?

    AmgX handles this via Consolidation

    Consolidate multiple smaller sub-matrices into single matrix

    Handled automatically during PCIE data copy

  • 10

    u1

    u2 u4 u3

    u5

    u6

    u7

    u1

    u2

    u4

    u3

    u5

    u6 u7

    u’4

    u’2

    Rank 0

    Rank 1

    GPU

    u1

    u2 u4 u3

    u5

    u6 u7

    PCIE

    PCIE

    Original Problem

    Partitioned to 2 MPI Ranks

    Consolidated onto 1 GPU

    Boundary exchange

  • 11

    Consolidation Examples

    1 CPU socket 1 GPU

    Dual socket CPU 2 GPUs

    Dual socket CPU 4 GPUs

    Arbitrary Cluster:

    4 nodes x [2 CPUs + 3 GPUs] IB

  • 12

    PETSc KSP vs AmgX performance test

    PDE:

    ∂u2∂2x+∂u2∂2y+∂u2∂2z=−12π2cos(2πx)cos(2πy)cos(2πz)

    BCs:

    ∂u∂x∣∣∣x=0=∂u∂x∣∣∣x=1=∂u∂y∣∣∣y=0=∂u∂y∣∣∣y=1=∂u∂z∣∣∣z=0=∂u∂z∣∣∣z=1=0

    Exact solution:

    u(x,y)=cos(2πx)cos(2πy)cos(2πz)

  • 13

    PETSc vs AmgX

    7x speedup @4M unknowns 16 cores vs 1 GPU 8x speedup @100M unknowns 512 cores vs 32 GPUs

    Machine specification

    GPU nodes:

    GPU: two K20m per node

    CPU nodes:

    CPU: two Intel Xeon E5-2670 per node (totally

    16 cores per node)

    PETSc KSP solver

  • 14

    SPE10 Cases We derived several test cases from the SPE10

    permeability distribution by fixing an x-y resolution

    and adding resolution in z, using TPFA stencil.

  • 15

    SPE10 Matrix Tests

    GPU: NVIDIA K40

    CPU: HYPRE on 10 core IvyBridge Xeon E5-2690 V2 @ 3.0GHz

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    0 2 4 6 8 10

    Sp e

    e d

    u p

    Millions of Unknowns

    1 Socket vs 1 GPU

  • 16

    Scaling up the right way

  • 17

    Poisson Equation / Laplace operator

    Titan (Oak Ridge National Laboratory)

    GPU: NVIDIA K20x (one per node)

    CPU: 16 core AMD Opteron 6274 @ 2.2GHz

    Aggregation and Classical Weak Scaling, 8Million DOF per GPU

    0.0

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    1 2 4 8 16 32 64 128 256 512

    T im

    e (

    s)

    Number of GPUs

    Setup

    AmgX 1.0 (PMIS) AmgX 1.0 (AGG)

  • 18

    Poisson Equation / Laplace operator

    Titan (Oak Ridge National Laboratory)

    GPU: NVIDIA K20x (one per node)

    CPU: 16 core AMD Opteron 6274 @ 2.2GHz

    Aggregation and Classical Weak Scaling, 8Million DOF per GPU

    y = 0.0062x + 0.0719 R² = 0.9249

    y = 0.0022x + 0.0585 R² = 0.9437

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    0.14

    0.16

    1 2 4 8 16 32 64 128 256 512

    S o lv

    e T

    im e

    Number of GPUs

    Time per Iteration vs Log(P)

    ClassicalAMGSolve

    AggregationAMGSolve

    Linear (ClassicalAMGSolve)

    Linear (AggregationAMGSolve)

  • 19

    Poisson Equation / Laplace operator

    Titan (Oak Ridge National Laboratory)

    GPU: NVIDIA K20x (one per node)

    CPU: 16 core AMD Opteron 6274 @ 2.2GHz

    Classical AMG Preconditioner, 8Million DOF per GPU

    0

    20

    40

    60

    80

    100

    120

    1 2 4 8 16 32 64 128 256 512

    It e ra

    ti o n s

    Number of GPUs

    PCG

    GMRES

  • 20

    Poisson Equation / Laplace operator

    Titan (Oak Ridge National Laboratory)

    GPU: NVIDIA K20x (one per node)

    CPU: 16 core AMD Opteron 6274 @ 2.2GHz

    Classical AMG Preconditioner, 8Million DOF per GPU

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    1 2 4 8 16 32 64 128 256 512

    So lv

    e Ti

    m e(

    s)

    Number of GPUs

    GMRES

    PCG

  • 21

    AmgX 2.0: MPI with GPUDirect RDMA

    4x lower latency, 3x Bandwidth, 45% lower CPU utilization

  • 22

    Basic Coarsening

  • 23

    Basic Coarsening

  • 24

    Aggressive Coarsening

  • 25

    Aggressive Coarsening

    Less Memory, Faster Setup

  • 26

    AmgX 2.0 Licensing

    Developer/Academic License

    non commercial use, free

    Commercial License, Developer License, Premier Support Service

    Subscription License (node/year)

    Includes Support and Maintenance

    Volume based pricing

    Site License

    Perpetual License

    20% Maintenance and Support

  • 27

    AmgX Roadmap

    Continuous Improvement

    Availability Features

    Classical AMG

    - multi node

    - multi GPU

    - Aggressive coarsening

    Complex Arithmetic + Aggregation

    Easy interfaces, python

    PETSc, HYPRE, Trilinos

    Robust convergence on SPE10

    GPUDirect v2.0

    Scalable Sparse Eigensolvers

    Scaling past 512 GPUs

    Range Decomposition AMG

    Guaranteed convergence aggregation

    Commerci