LAtoolbox: A Multi-platform Sparse Linear Algebra Toolbox

Karlsruhe Institute of Technology

KIT – University of the State of Baden-Wuerttemberg and

National Research Center of the Helmholtz Association www.kit.edu

LAtoolbox:A Multi-platform Sparse Linear Algebra ToolboxVincent Heuveline, Dimitar Lukarski, Jan-Philipp WeissNVIDIA GPU Technology Conference • San Jose, CA • May 17, 2012 • Session S0291

Engineering Mathematics and Computing Lab (EMCL)

Motivation HiFlow3 LAtoolbox Performance Test Summary

The Shift towards Multicore/ManycoreParadigm shift

The paradigm shift towards manycore is well on the wayBut still many problems have to be overcome

Scalable algorithms and implementations, ubiquitous parallelism,programmers’ productivity, ...

Many processors concepts have evolved (and some have gone)x86, SPARC (2-12 cores)NVIDIA / AMD GPUs (240 - 1536 cores)Accelerated Processing Units(2-4 cores + 400 cores, hybrid CPU/GPU)NVIDIA Denver (ARM-CPUs + GPUs on a chip)Cell, ClearSpeed (8+1, 92 cores)Intel SCC, MIC (48, 32/80 cores)Tile64, Epiphany (64, 64 - 4096 cores)

Manycore trend: much higher core count, less complex cores2/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT

Manycore – Revolution or Evolution?

Manycore DevicesHigh core count and incredible computing performanceFine-grained parallelism, stream processors + wide vector unitsFast on-chip bandwidth, caches + local memory

Impact on the SoftwareWithout changing the software there will be no more speedupsFor performance benefits the mathematical models, algorithms,numerical schemes and programs need to be adapted,re-engineered and further developedNew ways of parallel thinking required

3/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT

The Programming Challenge

Zoo of languages and programming approachesOpenMP, CUDA, OpenCL, OpenACC, ArBB, TBB, pthreads,UPC/PGAS, Cilk Plus, and many many more

Who can handle them all? Which is the best approach?We need some further abstractions!

Three main questionsPortability: How portable is my program?(in terms of platforms and performance)Flexibility: Can I easily modify the software?(new modules, algorithms, platforms, functionality, ...)Scalability: How scalable is my algorithm/programming model?(ready for future platforms? ready for exascale?)

The Algorithmic ChallengeWe have to rethink our methods and approaches!

Typical way of parallel thinking (?)Cluster-like problem decomposition with many fat nodes

Substructure problem into huge blocks (of same size)Process blocks in parallelMinimize communication between blocksBut typically: sequential processing on the block-level

New way of parallel thinkingDecompose problem into blocks (of varying size)

There should be only a lower bound on the block size

Process blocks in sequential order (or in parallel)But: use scalable parallelism on the block-level !

The Algorithmic ChallengeWe have to rethink our methods and approaches!

Typical way of parallel thinking (?)Cluster-like problem decomposition with many fat nodes

Substructure problem into huge blocks (of same size)Process blocks in parallelMinimize communication between blocksBut typically: sequential processing on the block-level

New way of parallel thinkingDecompose problem into blocks (of varying size)

There should be only a lower bound on the block size

Process blocks in sequential order (or in parallel)But: use scalable parallelism on the block-level !

Two Solution ApproachesWe would like to show you two things:

I: How to build scalable and efficient algorithmsHow to re-arrange algorithms for scalable parallelismHow to make them ready for GPUs and multicore CPUsHow to keep mathematical quality and efficiency

II: How to build portable software (this talk!)Modular building of solver schemes

Single code base for GPUs, CPUs and other acceleratorsSeamless integration of accelerators by unified interfacesChoice of the platform can be taken at run time

Highly optimized implementationsTransparent to the programmer/user

I: How to build scalable and efficient algorithms

We have demonstrated new algorithms for parallel preconditionerson GPUs in Session S0289, Wednesday, May 16, 9:00am.Please check these slides for further details.

I: How to build scalable and efficient algorithms

We have demonstrated new algorithms for parallel preconditionerson GPUs in Session S0289, Wednesday, May 16, 9:00am.Please check these slides for further details.

HiFlow3 - Parallel Finite Element SoftwareApplication Fields: Numerical simulation in computationalfluid dynamics, meteorology, medical engineering and energyresearchPerformance: Scalability proven on huge clusters andGPU-accelerated systemsFlexibility: Generic, modular and extensible software formulti-purpose simulation based on object-orientation in C++

HiFlow3 - Modules

Mesh: Distributed and flexiblemesh handlingFEM: Finite element spacesDoF: Management of degrees offreedomIntegration routines: Numericalintegration and assemblyroutinesLinear algebra/solvers:Multi-platform linear algebra andhardware-aware computing

Features of the LAtoolboxCommunication and computation layer of HiFlow3

Multi-platform parallelization(GPUs, multicore-CPUs, accelerators, ...)Unified interfaces across platforms for all linear algebraoperations and solversAbstract data structuresPlatform-adapted library implementations chosen at run timeNo hardware knowledge required for building applicationsMinimal communication costs for intra-node data exchange

Solvers and preconditionersIterative linear solvers: CG, (F)GMRES, geometric multigridNewton-like solvers for nonlinear problemsParallel preconditioners for GPUs and CPUs

LAtoolbox - A Two Level Library

Parallelism is introduced on two levels:Coarse-grained parallelism by means of distributed grids anddistributed data structuresFine-grained parallelism by means of platform-optimized linearalgebra back-ends

Data Distribution and Communicationfor Sparse Matrix-Vector Multiplication (SpMV)

Figure: Domain partitioning: Degrees of Freedom (DoF) of process P0 aremarked in green (interior DoF in diagonal block); the remaining DoFrepresent inter-process couplings for process P0 (ghost DoF inoff-diagonal block)

Data Distribution and Communicationfor Sparse Matrix-Vector Multiplication (cont.) P0

︸︷︷︸

diagonal block

•••••

︸︷︷︸

interior

P1 P2 P3

︸︷︷︸

offdiagonal block

••••

︸︷︷︸

Sparse Matrix-Vector Multiplication:start asynchronous communicationexchange ghost valuesyint = Adiag xint

synchronize communicationyint = yint + Aoffdiag xghost

Advanced Features

Advanced Features of the LAtoolboxNo irregular memory access over the PCIe bus by blockingtechniquesRaw access to the data is possible with explicit declaration ofdevice-specific classes (e.g. for nested loops, irregular memoryaccess, etc.)Possible mixture of different types of platforms in the cluster(e.g. nodes with and without GPUs)

lmpLAtoolbox - Sample Source Code 1/2

lVector<double> *x, *y ;

// initialize a vector on a specific platformx = init_vector<double>(size,"vec x", platf, impl);

// clone y as xy = x->CloneWithoutContent();

// Usage of BLAS 1y->CopyFrom(*x); // y = xx->Scale(5.0); // x = x * 5.0y->Axpy(*x, 3.3 ); // y = y + 3.3*x

lmpLAtoolbox - Sample Source Code 2/2lMatrix<double> *mat, *mat_c;

// init empty matrix on a specific platformmat = init_matrix<double>(0,0,0,"A",platf,impl,CSR);

// init CPU matrixmat_c = init_matrix<double>(0,0,0,"A",CPU,OpenMP,CSR);mat_c->ReadFile("matrix.mt");

// Copy the sparse structuremat->CopyStructureFrom(*mat_c);// Copy only the values of the matrixmat->CopyFrom(*mat_c);

mat->VectorMult(*y, x); // x = mat*y

Example: Geometric Multigrid SolverModel problem: −∆u = f in Ω := (0, 1)2 \ (0, 0.5)2

Solve Poisson problem on the L-shaped domainZero Dirichlet boundary conditions, i. e. u = 0 on ∂Ω

Locally refined and statically adapted meshMixed finite elements: Q1 elements on quadrilaterals and P1elements on triangles (avoiding hanging nodes)Problem size: 3,211,425 DoF

Matrix-based geometric multigrid solverStart with zero initial guessRelative residual of 10−6 taken as stopping criterionVarious parallel smoothers: Multi-colored Gauss-Seidel,power(q)-pattern enhanced ILU(p), FSAI

Solution in the L-shaped domain

Figure: Locally refined mesh for the L-shaped domain (left) and discretesolution of the Poisson problem for f = 8π2 cos(2πx) cos(2πy) (right).

Performance Data for the Multigrid SolverRun time comparison for various smoothers and platforms

CPUseq CPUOpenMP GPU

Poisson problem - multi-grid V-cycle

r-JacobiGS

SGSILU(0,1)ILU(1,1)ILU(1,2)

FSAI1FSAI2FSAI3

Platform: 2x Xeon quadcore CPU + Tesla S1070 4G memoryBest multigrid solver with ILU(1,2) takes 2.75sec on the GPUSignificantly faster than the best preconditioned CG solver(with ILU(1,2)) on the GPU with 158sec

Example: CG Solver for a 3D problem3D-Neumann-Laplace problem

−∆u = f, in Ω = [0, 1]3

∂u∂n

= 0, on ∂Ω

Configuration of the conjugate gradient solver:Q1 finite elementswith 2.1 M degrees of freedomCG solver with absolute tolerance 10−14

510 iterations neededDouble precision computations

Solution on a cluster with 16 GPUs:8 nodes with two NVIDIATesla M1060 GPU each, 4G memory20 Gbit/s Infiniband (small DDR switch)

Strong Scalability Resultsfor the 3D Neumann problem on a 16-GPU-cluster

1 2 4 8

number of nodes

Strong Scalability Test on a GPU Cluster

IdealCPU

1xGPU/node2xGPU/node

Close to linear speedup for all GPU and CPU configurationsSame source code for all setups

SummaryWe have presented the easy-to-use Linear Algebra Toolbox

Modular building of applicationsSame source code for all platformsNo hardware knowledge requiredCan be compiled on all platformsChoice of the platform can be taken at run time

Two-level parallelism with minimal communication costsDistributed data across nodesLinear algebra backends for GPUs, CPUs, and others

We have demonstratedScalability – very good strong scalability for GPUs and CPUsPortability – same source code for CPUs and GPUsFlexibility – easy modifications of solvers and algorithmsExtensibility – add new backends for new platforms

Contact and AcknowledgementsCheck our open-source software

The LAtoolbox is part of the HiFlow3 open source parallel finiteelement package with a rich suite of solvers.See http://www.hiflow3.org

Further informationdimitar.lukarski@kit.edujan-philipp.weiss@kit.eduhttp://www.emcl.kit.eduhttp://www.hiflow3.org

The Engineering Mathematics and Computing Lab(EMCL) at KIT is an NVIDIA CUDA Research Centerwith kind support by NVIDIA.

LAtoolbox: A Multi-platform Sparse Linear Algebra Toolbox

Documents

Transcript of LAtoolbox: A Multi-platform Sparse Linear Algebra Toolbox

Practical Linear Algebra: A Geometry Toolbox Third edition ...

SpaSM: A Matlab Toolbox for Sparse Statistical Modeling · JSS JournalofStatisticalSoftware MMMMMM YYYY, Volume VV, Issue II. SpaSM: A Matlab Toolbox for Sparse Statistical Modeling

Optimization workshop - nccs.nasa.gov · •Sparse BLAS •Sparse Solvers •Iterative •PARDISO* •Cluster Sparse Solver Linear Algebra ... Change your algorithm to reduce data

GHOST: BUILDING BLOCKS FOR HIGH PERFORMANCE SPARSE LINEAR · GHOST: BUILDING BLOCKS FOR HIGH PERFORMANCE SPARSE LINEAR ALGEBRA ON HETEROGENEOUS SYSTEMS MORITZ KREUTZER†, JONAS THIES‡,

Practical Linear Algebra a Geometry Toolbox

Sparse Linear Algebra in CUDA · Tutorial: High Performance Computing - Algorithms and Applications, November 18th 2015 18 Technische Universit¨at Munc¨ hen Compressed Sparse Row

Development of a Symbolic Computer Algebra Toolbox for 2D Fourier Transforms in Polar Coordinates

CUDA Libraries and CUDA Fortran - MASSIVE Home · —CUBLAS: Dense Linear Algebra —CUSPARSE : Sparse Linear Algebra —LIBM: Standard C Math library

Exercises 7: Sparse Linear Algebra via PETSc

SSS 06 Graphical SLAM and Sparse Linear Algebra Frank Dellaert.

CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

fratstock.eu › sample › Solutions-Manual-College... · 2017-05-08 · Full file at CHAPTER 2 Algebra Toolbox 121 Chapter 2 Quadratic and Other Nonlinear Functions Algebra Toolbox

spam: A Sparse Matrix R Package with Emphasis on …user.math.uzh.ch/.../documents/furrersain_jss_submitted.pdf2 spam, an R-Package for GMRF MCMC Sparse matrix algebra has seen a resurrection

L11: Sparse Linear Algebra on GPUs CS6235. 2 Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235 .

Chapter 1: Algebra Toolbox

Matlab’s Symbolic Toolbox - Sutherland · Matlab’s Symbolic Toolbox ChEn 1703 Thursday, December 4, 2008 1. Some Capabilities of the Symbolic Toolbox Algebra - manipulate expressions

ISM, Chapter 2 - Kentreed/Instructors/MATH 11009/ch02.pdfCHAPTER 2 Algebra Toolbox 121 Chapter 2 Quadratic and Other Nonlinear Functions Algebra Toolbox 1. a. x43 43 7ixx x==+ b. 12

Sparse Linear Algebra: Direct Methods

Nearly Sparse Linear Algebra and application to Discrete ... · This section ﬁrst presents the classical problems of linear algebra that are encountered when dealing with sparse

Implementing Sparse Matrix-Vector Multiplication on ... · Sparse matrix-vector multiplication (SpMV) is of singular impor-tance in sparse linear algebra. In contrast to the uniform