Post on 04-Feb-2022
Karlsruhe Institute of Technology
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association www.kit.edu
LAtoolbox:A Multi-platform Sparse Linear Algebra ToolboxVincent Heuveline, Dimitar Lukarski, Jan-Philipp WeissNVIDIA GPU Technology Conference • San Jose, CA • May 17, 2012 • Session S0291
Engineering Mathematics and Computing Lab (EMCL)
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
The Shift towards Multicore/ManycoreParadigm shift
The paradigm shift towards manycore is well on the wayBut still many problems have to be overcome
Scalable algorithms and implementations, ubiquitous parallelism,programmers’ productivity, ...
Many processors concepts have evolved (and some have gone)x86, SPARC (2-12 cores)NVIDIA / AMD GPUs (240 - 1536 cores)Accelerated Processing Units(2-4 cores + 400 cores, hybrid CPU/GPU)NVIDIA Denver (ARM-CPUs + GPUs on a chip)Cell, ClearSpeed (8+1, 92 cores)Intel SCC, MIC (48, 32/80 cores)Tile64, Epiphany (64, 64 - 4096 cores)
Manycore trend: much higher core count, less complex cores2/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Manycore – Revolution or Evolution?
Manycore DevicesHigh core count and incredible computing performanceFine-grained parallelism, stream processors + wide vector unitsFast on-chip bandwidth, caches + local memory
Impact on the SoftwareWithout changing the software there will be no more speedupsFor performance benefits the mathematical models, algorithms,numerical schemes and programs need to be adapted,re-engineered and further developedNew ways of parallel thinking required
3/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
The Programming Challenge
Zoo of languages and programming approachesOpenMP, CUDA, OpenCL, OpenACC, ArBB, TBB, pthreads,UPC/PGAS, Cilk Plus, and many many more
Who can handle them all? Which is the best approach?We need some further abstractions!
Three main questionsPortability: How portable is my program?(in terms of platforms and performance)Flexibility: Can I easily modify the software?(new modules, algorithms, platforms, functionality, ...)Scalability: How scalable is my algorithm/programming model?(ready for future platforms? ready for exascale?)
4/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
The Algorithmic ChallengeWe have to rethink our methods and approaches!
Typical way of parallel thinking (?)Cluster-like problem decomposition with many fat nodes
Substructure problem into huge blocks (of same size)Process blocks in parallelMinimize communication between blocksBut typically: sequential processing on the block-level
New way of parallel thinkingDecompose problem into blocks (of varying size)
There should be only a lower bound on the block size
Process blocks in sequential order (or in parallel)But: use scalable parallelism on the block-level !
5/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
The Algorithmic ChallengeWe have to rethink our methods and approaches!
Typical way of parallel thinking (?)Cluster-like problem decomposition with many fat nodes
Substructure problem into huge blocks (of same size)Process blocks in parallelMinimize communication between blocksBut typically: sequential processing on the block-level
New way of parallel thinkingDecompose problem into blocks (of varying size)
There should be only a lower bound on the block size
Process blocks in sequential order (or in parallel)But: use scalable parallelism on the block-level !
5/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Two Solution ApproachesWe would like to show you two things:
I: How to build scalable and efficient algorithmsHow to re-arrange algorithms for scalable parallelismHow to make them ready for GPUs and multicore CPUsHow to keep mathematical quality and efficiency
II: How to build portable software (this talk!)Modular building of solver schemes
Single code base for GPUs, CPUs and other acceleratorsSeamless integration of accelerators by unified interfacesChoice of the platform can be taken at run time
Highly optimized implementationsTransparent to the programmer/user
6/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Two Solution ApproachesWe would like to show you two things:
I: How to build scalable and efficient algorithms
We have demonstrated new algorithms for parallel preconditionerson GPUs in Session S0289, Wednesday, May 16, 9:00am.Please check these slides for further details.
II: How to build portable software (this talk!)Modular building of solver schemes
Single code base for GPUs, CPUs and other acceleratorsSeamless integration of accelerators by unified interfacesChoice of the platform can be taken at run time
Highly optimized implementationsTransparent to the programmer/user
6/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Two Solution ApproachesWe would like to show you two things:
I: How to build scalable and efficient algorithms
We have demonstrated new algorithms for parallel preconditionerson GPUs in Session S0289, Wednesday, May 16, 9:00am.Please check these slides for further details.
II: How to build portable software (this talk!)Modular building of solver schemes
Single code base for GPUs, CPUs and other acceleratorsSeamless integration of accelerators by unified interfacesChoice of the platform can be taken at run time
Highly optimized implementationsTransparent to the programmer/user
6/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
HiFlow3 - Parallel Finite Element SoftwareApplication Fields: Numerical simulation in computationalfluid dynamics, meteorology, medical engineering and energyresearchPerformance: Scalability proven on huge clusters andGPU-accelerated systemsFlexibility: Generic, modular and extensible software formulti-purpose simulation based on object-orientation in C++
7/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
HiFlow3 - Modules
Mesh: Distributed and flexiblemesh handlingFEM: Finite element spacesDoF: Management of degrees offreedomIntegration routines: Numericalintegration and assemblyroutinesLinear algebra/solvers:Multi-platform linear algebra andhardware-aware computing
8/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Features of the LAtoolboxCommunication and computation layer of HiFlow3
Multi-platform parallelization(GPUs, multicore-CPUs, accelerators, ...)Unified interfaces across platforms for all linear algebraoperations and solversAbstract data structuresPlatform-adapted library implementations chosen at run timeNo hardware knowledge required for building applicationsMinimal communication costs for intra-node data exchange
Solvers and preconditionersIterative linear solvers: CG, (F)GMRES, geometric multigridNewton-like solvers for nonlinear problemsParallel preconditioners for GPUs and CPUs
9/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
LAtoolbox - A Two Level Library
Parallelism is introduced on two levels:Coarse-grained parallelism by means of distributed grids anddistributed data structuresFine-grained parallelism by means of platform-optimized linearalgebra back-ends
10/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Data Distribution and Communicationfor Sparse Matrix-Vector Multiplication (SpMV)
P0
P1
P2
P3
Figure: Domain partitioning: Degrees of Freedom (DoF) of process P0 aremarked in green (interior DoF in diagonal block); the remaining DoFrepresent inter-process couplings for process P0 (ghost DoF inoff-diagonal block)
11/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Data Distribution and Communicationfor Sparse Matrix-Vector Multiplication (cont.) P0
︸ ︷︷ ︸
diagonal block
•••••
︸ ︷︷ ︸
interior
+
P1 P2 P3
︸ ︷︷ ︸
offdiagonal block
••••
︸ ︷︷ ︸
ghost
Sparse Matrix-Vector Multiplication:start asynchronous communicationexchange ghost valuesyint = Adiag xint
synchronize communicationyint = yint + Aoffdiag xghost
12/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Advanced Features
Advanced Features of the LAtoolboxNo irregular memory access over the PCIe bus by blockingtechniquesRaw access to the data is possible with explicit declaration ofdevice-specific classes (e.g. for nested loops, irregular memoryaccess, etc.)Possible mixture of different types of platforms in the cluster(e.g. nodes with and without GPUs)
13/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
lmpLAtoolbox - Sample Source Code 1/2
lVector<double> *x, *y ;
// initialize a vector on a specific platformx = init_vector<double>(size,"vec x", platf, impl);
// clone y as xy = x->CloneWithoutContent();
// Usage of BLAS 1y->CopyFrom(*x); // y = xx->Scale(5.0); // x = x * 5.0y->Axpy(*x, 3.3 ); // y = y + 3.3*x
14/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
lmpLAtoolbox - Sample Source Code 2/2lMatrix<double> *mat, *mat_c;
// init empty matrix on a specific platformmat = init_matrix<double>(0,0,0,"A",platf,impl,CSR);
// init CPU matrixmat_c = init_matrix<double>(0,0,0,"A",CPU,OpenMP,CSR);mat_c->ReadFile("matrix.mt");
// Copy the sparse structuremat->CopyStructureFrom(*mat_c);// Copy only the values of the matrixmat->CopyFrom(*mat_c);
mat->VectorMult(*y, x); // x = mat*y
15/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Example: Geometric Multigrid SolverModel problem: −∆u = f in Ω := (0, 1)2 \ (0, 0.5)2
Solve Poisson problem on the L-shaped domainZero Dirichlet boundary conditions, i. e. u = 0 on ∂Ω
Locally refined and statically adapted meshMixed finite elements: Q1 elements on quadrilaterals and P1elements on triangles (avoiding hanging nodes)Problem size: 3,211,425 DoF
Matrix-based geometric multigrid solverStart with zero initial guessRelative residual of 10−6 taken as stopping criterionVarious parallel smoothers: Multi-colored Gauss-Seidel,power(q)-pattern enhanced ILU(p), FSAI
16/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Solution in the L-shaped domain
Figure: Locally refined mesh for the L-shaped domain (left) and discretesolution of the Poisson problem for f = 8π2 cos(2πx) cos(2πy) (right).
17/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Performance Data for the Multigrid SolverRun time comparison for various smoothers and platforms
0
5
10
15
20
25
30
35
40
CPUseq CPUOpenMP GPU
Tim
e [s
ec]
Poisson problem - multi-grid V-cycle
r-JacobiGS
SGSILU(0,1)ILU(1,1)ILU(1,2)
FSAI1FSAI2FSAI3
Platform: 2x Xeon quadcore CPU + Tesla S1070 4G memoryBest multigrid solver with ILU(1,2) takes 2.75sec on the GPUSignificantly faster than the best preconditioned CG solver(with ILU(1,2)) on the GPU with 158sec
18/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Example: CG Solver for a 3D problem3D-Neumann-Laplace problem
−∆u = f, in Ω = [0, 1]3
∂u∂n
= 0, on ∂Ω
Configuration of the conjugate gradient solver:Q1 finite elementswith 2.1 M degrees of freedomCG solver with absolute tolerance 10−14
510 iterations neededDouble precision computations
Solution on a cluster with 16 GPUs:8 nodes with two NVIDIATesla M1060 GPU each, 4G memory20 Gbit/s Infiniband (small DDR switch)
19/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Strong Scalability Resultsfor the 3D Neumann problem on a 16-GPU-cluster
1
2
4
8
1 2 4 8
Spe
edup
number of nodes
Strong Scalability Test on a GPU Cluster
IdealCPU
1xGPU/node2xGPU/node
Close to linear speedup for all GPU and CPU configurationsSame source code for all setups
20/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
SummaryWe have presented the easy-to-use Linear Algebra Toolbox
Modular building of applicationsSame source code for all platformsNo hardware knowledge requiredCan be compiled on all platformsChoice of the platform can be taken at run time
Two-level parallelism with minimal communication costsDistributed data across nodesLinear algebra backends for GPUs, CPUs, and others
We have demonstratedScalability – very good strong scalability for GPUs and CPUsPortability – same source code for CPUs and GPUsFlexibility – easy modifications of solvers and algorithmsExtensibility – add new backends for new platforms
21/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT
Karlsruhe Institute of Technology
Motivation HiFlow3 LAtoolbox Performance Test Summary
Contact and AcknowledgementsCheck our open-source software
The LAtoolbox is part of the HiFlow3 open source parallel finiteelement package with a rich suite of solvers.See http://www.hiflow3.org
Further informationdimitar.lukarski@kit.edujan-philipp.weiss@kit.eduhttp://www.emcl.kit.eduhttp://www.hiflow3.org
The Engineering Mathematics and Computing Lab(EMCL) at KIT is an NVIDIA CUDA Research Centerwith kind support by NVIDIA.
22/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT