GPU Algorithms for Efficient Exascale Discretizations

15
GPU Algorithms for Efficient Exascale Discretizations Ahmad Abdelfattah e , Valeria Barra f , Natalie Beams e , Ryan Bleile i , Jed Brown f , Jean-Sylvain Camier a , Robert Carson j , Noel Chalmers h , Veselin Dobrev a , Yohann Dudouit a , Paul Fischer b,c,d , Ali Karakus m , Stefan Kerkemeier b , Tzanio Kolev a,* , Yu-Hsiang Lan b , Elia Merzari b,k , Misun Min b , Malachi Phillips c , Thilina Rathnayake c , Robert Rieben i , Thomas Stitt i , Ananias Tomboulides b,l , Stanimire Tomov e , Vladimir Tomov a , Arturo Vargas i , Tim Warburton g , Kenneth Weiss i a Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94550 b Mathematics and Computer Science, Argonne National Laboratory, Lemont, IL 60439 c Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801 d Department of Mechanical Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 e Innovative Computing Laboratory, University of Tennessee, Knoxville, TN 37996 f Department of Computer Science, University of Colorado, Boulder, CO 80309 g Department of Mathematics, Virginia Tech, Blacksburg, VA 24061 h AMD Research, Advanced Micro Devices Inc., Austin, TX 78735 i Weapons and Complex Integration, Lawrence Livermore National Laboratory, Livermore, CA 94550 j Computational Engineering Division, Lawrence Livermore National Laboratory, Livermore, CA 94550 k Department of Nuclear Engineering, Penn State, PA 16802 l Department of Mechanical Engineering, Aristotle University of Thessaloniki, Greece 54124 m Mechanical Engineering Department, Middle East Technical University, Ankara, Turkey 06800 Abstract In this paper we describe the research and development activities in the Center for Efficient Exascale Dis- cretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanu- mal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems. Keywords: High-Performance Computing, GPU Acceleration, High-Order Discretizations, Finite Element Methods, Exascale Applications 1. Introduction Exascale computing will provide scientists and engineers with an advanced tool to explore physical phenomena over a large range of scales and in com- plex domains. To maximize this potential, the sim- ulation codes must be efficient in their use of data movement, both in terms of implementation and algorithmic complexity. The DOE Center for Effi- cient Exascale Discretizations (CEED) [1, 2] seeks to meet both of these goals by providing highly per- formant libraries for high-order discretizations on GPU-based compute nodes that form the basis for current- and next-generation HPC platforms. * Corresponding author Email address: [email protected] (Tzanio Kolev) Central to CEED is the use of matrix-free high- order finite element discretizations, which require only O(n) data movement and yield exponential convergence rates, O(h p ), for pth-order approxi- mations to solutions having sufficient regularity. With the number of degrees of freedom scaling as n = O ( (p/h) d ) , in d dimensions, convergence through increased p offers clear advantages over simple reductions in the grid spacing h. Kreiss and Oliger [3] noted early on the particular relevance of increased approximation order in controlling cu- mulative dispersion errors for large-scale transport problems where propagated feature sizes λ are much smaller than the domain length, L, which is clearly in the scope of problems that are enabled by exas- cale architectures. Although exponential conver- 1 arXiv:2109.05072v1 [cs.DC] 10 Sep 2021

Transcript of GPU Algorithms for Efficient Exascale Discretizations

GPU Algorithms for Efficient Exascale Discretizations

Ahmad Abdelfattahe, Valeria Barraf, Natalie Beamse, Ryan Bleilei, Jed Brownf, Jean-Sylvain Camiera,Robert Carsonj, Noel Chalmersh, Veselin Dobreva, Yohann Dudouita, Paul Fischerb,c,d, Ali Karakusm,Stefan Kerkemeierb, Tzanio Koleva,∗, Yu-Hsiang Lanb, Elia Merzarib,k, Misun Minb, Malachi Phillipsc,

Thilina Rathnayakec, Robert Riebeni, Thomas Stitti, Ananias Tomboulidesb,l, Stanimire Tomove,Vladimir Tomova, Arturo Vargasi, Tim Warburtong, Kenneth Weissi

aCenter for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94550bMathematics and Computer Science, Argonne National Laboratory, Lemont, IL 60439

cDepartment of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801dDepartment of Mechanical Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801

eInnovative Computing Laboratory, University of Tennessee, Knoxville, TN 37996fDepartment of Computer Science, University of Colorado, Boulder, CO 80309

gDepartment of Mathematics, Virginia Tech, Blacksburg, VA 24061hAMD Research, Advanced Micro Devices Inc., Austin, TX 78735

iWeapons and Complex Integration, Lawrence Livermore National Laboratory, Livermore, CA 94550jComputational Engineering Division, Lawrence Livermore National Laboratory, Livermore, CA 94550

kDepartment of Nuclear Engineering, Penn State, PA 16802lDepartment of Mechanical Engineering, Aristotle University of Thessaloniki, Greece 54124

mMechanical Engineering Department, Middle East Technical University, Ankara, Turkey 06800

Abstract

In this paper we describe the research and development activities in the Center for Efficient Exascale Dis-cretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-elementalgorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developmentsin several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanu-mal, and Nek projects. We report performance and capability improvements in several CEED-enabledapplications on both NVIDIA and AMD GPU systems.

Keywords: High-Performance Computing, GPU Acceleration, High-Order Discretizations, Finite ElementMethods, Exascale Applications

1. Introduction

Exascale computing will provide scientists andengineers with an advanced tool to explore physicalphenomena over a large range of scales and in com-plex domains. To maximize this potential, the sim-ulation codes must be efficient in their use of datamovement, both in terms of implementation andalgorithmic complexity. The DOE Center for Effi-cient Exascale Discretizations (CEED) [1, 2] seeksto meet both of these goals by providing highly per-formant libraries for high-order discretizations onGPU-based compute nodes that form the basis forcurrent- and next-generation HPC platforms.

∗Corresponding authorEmail address: [email protected] (Tzanio Kolev)

Central to CEED is the use of matrix-free high-order finite element discretizations, which requireonly O(n) data movement and yield exponentialconvergence rates, O(hp), for pth-order approxi-mations to solutions having sufficient regularity.With the number of degrees of freedom scalingas n = O

((p/h)d

), in d dimensions, convergence

through increased p offers clear advantages oversimple reductions in the grid spacing h. Kreiss andOliger [3] noted early on the particular relevanceof increased approximation order in controlling cu-mulative dispersion errors for large-scale transportproblems where propagated feature sizes λ are muchsmaller than the domain length, L, which is clearlyin the scope of problems that are enabled by exas-cale architectures. Although exponential conver-

1

arX

iv:2

109.

0507

2v1

[cs

.DC

] 1

0 Se

p 20

21

gence is lost in many practical applications thatlack regularity, high-order methods still provide fa-vorable error constants with respect to norms andinsidious sources of error such as numerical disper-sion, and use of hp methods can sometimes restoreexponential convergence [4]. Efficient implemen-tation of methods of all orders, with a particularemphasis on high-order, is the principal objectiveof the CEED efforts.

Practical application of spectral methods forcomplex domains was first considered by Orszag [5],who laid out several essential elements for perfor-mant implementations. The principal feature wasthe use of pth-order tensor-product polynomial ap-proximations in a d-dimensional reference domain,r ∈ Ω = [−1, 1]d, transformed to a complex do-main, Ω, through an invertible map, x(r). UnlikeFourier bases, stable polynomial bases1 can yieldexponential convergence for non-periodic boundaryconditions, provided the solution has sufficient reg-ularity [6]. Orszag [5] noted that, although sep-arability was lost in the transformed domain, theforward operator evaluation could still be effectedefficiently using tensor-product sum factorization.He thus suggested using conjugate gradient itera-tion, preconditioned with spectrally-equivalent low-order (i.e., sparse) operators, to yield high-orderaccuracy at low-order costs. All three of theseideas—orthogonal-polynomial based bases, tensor-product sum factorization, and low-order precondi-tioners, are common elements of modern high-orderfinite element codes (e.g., [7, 8, 9, 10, 11, 12]), al-though the low-order preconditioners are commonlysupplanted with p-multigrid (e.g., [13, 14]) and/orSchwarz-overlapping methods [15].

Let Qp denote the Lagrange polynomial baseson Gauss-Lobatto quadrature points. In the caseof tensor product elements, for pd degrees-of-freedom per element, the use of tensor-productsum-factorization reduces operator evaluation costsfrom O(p2d) to (near-optimal) O(pd+1) and mem-ory/storage costs from O(p2d) to (optimal) O(pd).Matrix-free Qp bases form the foundation for muchof the CEED software stack, including MFEM,Nek5000/RS, libCEED, MAGMA, and libParanu-mal. We are also interested in high-order discretiza-tion on (non-tensor) triangular and tetrahedral Pp

elements.

1Stable bases include orthogonal polynomials or Lagrangepolynomials based on Gauss-type quadrature points.

Throughout the paper we use the CEED bake-off problems (BPs), introduced in [16], in orderto test and compare the performance of high-ordercodes. The CEED BPs are community benchmarksfor matrix-free operator evaluation of mass (BP1),stiffness (BP3) or collocated stiffness matrix (BP5).They include a mixture of compute-intensive ker-nels, nearest-neighbor communication and vectorreductions that is representative of high-order ap-plications. See [16] for details.

Both the tensor and non-tensor cases pose uniquechallenges for high-order algorithms on GPU plat-forms. In the following Section 2 we describethe GPU-specific developments that are addressingthese challenges in each of the CEED open-sourcelibraries. We follow this by a discussion in Section3, results from several CEED-enabled applicationsin Section 4, and conclusions in Section 5.

2. GPU Developments in the Center for Ef-ficient Exascale Discretizations

The major work intensive operations for pth or-der elements are local interpolation and differen-tiation, which involve tensor contractions for Qp

and DGEMMs for Pp elements. For example,for tensor elements in 3D derivatives at nodalpoints (ri, sj , tk) ∈ Ω take the form uer(ri, sj , tk) =∑

m Dimuemjk = Dru

e, where the ueijk are the local

basis coefficients for ue(r) on Ωe and D is a dense(p+ 1)× (p+ 1) derivative matrix.

On a CPU, performance in the tensor Qp caserelies primarily on casting the tensor contractions,which comprise 90% of the flops, as optimizedmatrix-matrix products. On GPUs, vectorizationmust be expressed over the entire set of E ele-ments local to a given GPU in order to leveragethe node-parallel architecture and amortize ker-nel launch overhead. Kernels thus tend to beexpressed at the operator level, such as the dis-crete Laplacian, wL = ALuL, where AL=block-diag(Ae), with Ae = DTGeD the matrix-free formof the stiffness matrix. For deformed elements Ae

is dense, with (p + 1)6 entries. The factored form,however, involves only the tensor-product deriva-tives, Dr, Ds, and Dt, and six diagonal matri-ces, Ge

rr, Gers, . . . , Ge

tt in the symmetric tensorGe. Including ue, the number of reads is thus7(p + 1)3 per element, with a corresponding flopcount of 12(p + 1)4 + 15(p + 1)3 (double the firstterm for non-collocated quadrature). The O(p)

2

arithmetic intensity (flop-to-byte ratio) with low-memory data structures leads to high performanceon GPU and CPU architectures that have sufficientregisters/cache to exhibit memory locality throughthe sequence of dependent operations.

In the rest of this section, we describe the devel-opments in each of the CEED packages that addressthe multitude of issues that arise when implement-ing operations of the form ALuL on GPUs. Fordetailed description of these operations see [16].

2.1. libCEED

libCEED [17, 18] is a new library that offersa purely algebraic interface for matrix-free opera-tor evaluation and supports run-time selection ofimplementations tuned for a variety of computa-tional device types, including CPUs and GPUs.libCEED’s purely algebraic interface can unobtru-sively be integrated in new and legacy softwareto provide performance portable interfaces. Whilethe focus is on high-order finite elements, the ap-proach is algebraic and thus applicable to other dis-cretizations in factored form. libCEED’s role, as alightweight portable library that allows a wide va-riety of applications to share highly optimized dis-cretization kernels, is illustrated in Figure 1, wherea non-exhaustive list of specialized implementations(backends) is listed. libCEED provides a low-level

Application

PETSc

Nek5000

MFEM

Library

libCEED

Backends

Pure C

AVX

LIBXSMM

OCCA

Pure CUDA

Pure HIP

MAGMA

Hardware

CPU

NVIDIA GPU

AMD GPU

Figure 1: libCEED allows different applications to sharehighly optimized discretization kernels.

Application Programming Interface (API) for usercodes so that applications with their own discretiza-tion infrastructure (e.g., those in PETSc, MFEMand Nek5000) can evaluate and use the core opera-tions enabled by libCEED. GPU backends are avail-able via pure CUDA or HIP implementations, as

well as the OCCA and MAGMA libraries. libCEEDprovides a unified interface, so that users onlyneed to write a single source code and can selectthe desired specialized implementation at run time.Moreover, each process or thread can instantiate anarbitrary number of backends.

Since matrix-free finite element algorithms movepotentially O(pd) times less data than algorithmsusing sparse matrices, where p is the polynomial or-der, and d the dimension, O(pd) speedup can the-oretically be achieved on architectures where thematrix-free algorithms are limited in performanceby the data movements, i.e. are memory bound.

The matrix-free algorithms to apply tensor finiteelement operators are completely memory bound onGPU due to their low arithmetic intensities. O(pd)speedup can be achieved in practice for tensor finiteelement when compared to a standard approach us-ing a sparse matrix, see Figure 2, because the GPUkernels are solely memory bound.

1 2 3 4 5 6 7 8

108

109

Polynomial order

Th

rou

ghp

ut

(DoF

sp

erse

con

d)

CuSparselibCEED

Figure 2: Comparison of the performance in the satu-rated regime for the CEED benchmark problem BP1 [16]on a NVIDIA V100 using a sparse matrix (with CuSparse)against the cuda-gen backend of libCEED for different poly-nomial orders p.

However, applying non-tensor finite element op-erators does not necessarily result in memory boundGPU kernels. The libCEED interface allows thecaller to provide arrays representing the evaluationand gradient of the basis functions along with thequadrature weights to be used. With the arith-metic intensity of the matrix-free algorithms in-creasing in O(pd) for general non-tensor elements–compared to O(p) in the tensor element case–GPUkernels quickly become computationally bound forhigh p orders on sufficiently many elements, es-pecially in 3D. The higher number of basis func-tions and quadrature points for non-tensor elements

3

also results in operations that are more difficultto cache efficiently on the very limited memoriesof GPU caches and registers. Therefore, achievingpeak performance for non-tensor finite elements ismore challenging, and the theoretical gain is notas interesting as for tensor finite elements. Wenote that there exist specialized sum factorizationschemes for collapsed elements (e.g. [19, 20], withrecent SIMD vectorization targeting CPU perfor-mance [10]) and other fast evaluation methods suchas [21, 22]; adding such methods to libCEED wouldrequire addition of a public interface and dedicatedbackend support.

On NVIDIA GPUs the libCEED library achievesclose to peak performance for operators using ten-sor finite elements. The performance on the CEEDbenchmark problems is comparable to state-of-the-art specialized hand tuned kernels [23].

The CUDA and HIP backends provide nativesupport for non-tensor finite elements, while theOCCA and MAGMA backends, which depend ontheir eponymous libraries [24, 25], also supportnon-tensor finite elements. The MAGMA backendachieves the highest performance of all libCEEDGPU backends for non-tensor finite elements.

2.2. MAGMA

MAGMA [25] is a high-performance linear al-gebra library that includes LAPACK for GPUs,BLAS, sparse iterative solvers, and many othergeneral matrix computation kernels. The batchedcomputations provided by MAGMA can be gen-eralized to provide highly efficient tensor compu-tations [26]. While the libCEED MAGMA back-end contains specialized tensor basis kernels sepa-rate from the MAGMA library itself, the library’sbatched GEMM capabilities are used directly to op-timize non-tensor basis computations, with a goalof hardware portability [27]. In contrast to the re-cent CPU-based work of Sun et al. [11], which ap-plies code generation and transformations towardSIMD vectorization across batches of (tensor andnon-tensor) element kernels, the MAGMA back-end’s batched computations leverage standard li-brary BLAS routines for the non-tensor basis ac-tions within libCEED’s algebraic framework.

As the non-tensor basis computations inlibCEED are basis- and quadrature-rule-agnostic,the full interpolation or gradient matrices mustbe applied for an input vector for every element,rather than performing a series of small tensorcontractions. If we consider all elements local to

the process at the same time (nelem), we canreshape the input vector to be a matrix of sized × P × nelem × ncomp, with each column corre-sponding to one component for one element, and Pand Q representing the total number of basis andquadrature points in the element, respectively. Nowthe application of the interpolation or gradient ma-trix action for all the elements is easily representedby one standard general matrix-matrix multiplica-tion, C = AB, as represented in Figure 3 for inputB, output C, and basis matrix A.

A or AT

B

C

m= p

k = q

n = nelem ncomp

Figure 3: Shape of the DGEMM operation for the non-tensorbasis action in libCEED.

Figure 3 shows a typical shape for the GEMM callin a libCEED non-tensor basis computation, withdimensions (m,n, k). Here m and k are relativelysmall, as they are tied to the number of basis nodesor quadrature points in one element; n can poten-tially be orders of magnitude larger, as it dependson the number of elements in the local operator.Since vendor-provided GEMM routines are gener-ally designed to achieve maximum performance forsquare matrices, these unbalanced dimensions mayprevent the GEMM operation from reaching theGPU peak performance. Therefore, to make thebest possible use of the available BLAS libraries,we also consider performing the GEMM operationin Figure 3 as a batched GEMM operation, splitacross the n dimension, so that each batched oper-ation has the same A matrix, but uses submatricesB and C of B and C, with η columns each. Theadditional parameter of batch size η increases thespace in which we can search for the best possibleparameter set, given dimensions m and k, with thegoal of creating a more balanced workload for theGPU. Transforming the GEMM in Figure 3 into abatched GEMM does not require setting up pointerarrays that may impact the performance, and thusdoes not add any overheads. Both cuBLAS andMAGMA provide stride-based batched GEMM ker-nels.

Figure 4 shows example performance numbers forthe GEMM versus the batched GEMM for typical

4

1.9x

1.59x

1.02x

2.7x

1.87x

1.36x

0

500

1000

1500

2000

2500

3000

3500

4000

(36,33) (84, 126) (165, 330)

Perfo

rman

ce (G

flop/

s)

Problem size (m, k) = (P,Q) -- n is fixed at 10,240

V100: DGEMMV100: Batch DGEMMMI100: DGEMMMI100: Batch DGEMM

Figure 4: Performance of different DGEMM configurationsusing hipBLAS and cuBLAS. Results are for different (P ,Q) pairs on a Nvidia V100 (CUDA 11.2) and an AMDInstinctTM MI100 (ROCm 4.2) GPUs.

sizes encountered in the MFEM-libCEED BP3 [16]benchmark for triangle or tetrahedron non-tensorelements. The figure considers an NVIDIA V100GPU and initial experience with an AMD MI100GPU. The best-performing routine of MAGMA andthe vendor-provided BLAS is shown for each GPU.We see that batching the GEMM operation acrossthe n dimension achieves a better performance thanlaunching a single GEMM operation for these sizes,with the speedup relative to the single GEMM in-dicated above each batched GEMM bar.

To determine the best possible combination ofroutine (vendor BLAS or MAGMA library) andbatch size η (with a standard, non-batch call corre-sponding to η = n), we use data from offline bench-mark parameter sweeps to construct a lightweightabstraction layer. This layer automatically selectsthe best choice for the non-tensor GEMM oper-ations. In [27], we show that for higher ordersof basis functions, the benefit of the optimizedGEMM formulation is clear in comparison to thepure CUDA kernels in the “cuda-ref” backend (upto 10× speedup for the MAGMA formulation onthe V100 GPU).

2.3. MFEM

MFEM [12] is a general-purpose finite element li-brary that since version 4.0 supports hardware ac-celerators, such as GPUs, as well as programmingmodels and libraries, such as CUDA, HIP, OCCA[24], libCEED [17], RAJA [28] and OpenMP. Thegoal of the MFEM developments in CEED is to pro-vide state-of-the-art optimized performance high-order kernels to applications in an ease of use, flex-ible form.

The MFEM performance portability approach isbased on a system of backends and kernels work-ing seamlessly with a lightweight memory spacesmanager. A distinctive feature of this approach isthe ability to select the backends at runtime. Forinstance, different MPI ranks can choose differentbackends (like CPU or GPU), allowing applicationsto take full advantage of heterogeneous architec-tures. Another important aspect of MFEM’s ap-proach is the ability to easily mix CPU-only codewith code that utilizes the new backends, thus al-lowing for selective gradual transition of existing ca-pabilities. Most of the kernels are based on a singlesource, while still offering good performance. Forperformance-critical kernels, where a single sourcedoes not provide good performance, the implemen-tation introduces dispatch points based on the se-lected backend and, in some cases, on kernel pa-rameters such as the finite element order.

Figure 5 illustrates the main components ofMFEM’s modular design for accelerator support.The Library side of MFEM (on the left) representsthe software components where new kernels havebeen added. Kernels and memory management arethe two ingredients most programming models haveto deal with when providing such an abstraction toaddress code portability for HPC platforms. Simi-larly to Figure 1 the MFEM design allows for a vari-ety of runtime-selectable backends that can executeits kernels on both CPU and GPU hardware. Un-like the libCEED figure though, Figure 5 includesthe kernel abstraction for a much broader range ofmeshing, finite element and linear algebra featuresthat a general finite element library like MFEMneeds to support. For more details, see [12, 16]and Section 3.

fem

mesh

linalg

Library

Execution

Memory

R RW W

Kernel

Kernels Backends

OMP

RAJA

libCEED

OCCA

HIP

CUDA

Hardware

NVIDIA GPU

AMD GPU

CPU

Figure 5: Diagram of MFEM’s modular design for acceler-ator support, combining flexible memory management withruntime-selectable backends for executing key finite elementand linear algebra kernels.

5

MFEM’s GPU acceleration has demonstrated ex-cellent performance in a number of single-GPUand multi-GPU benchmarks. The high-order al-gorithms in MFEM are particularly well suited forGPUs as shown by the results in Figure 6 which re-port performance results from LLNL’s Corona ma-chine and a Linux configuration similar to the com-pute nodes of LLNL’s Sierra supercomputer. Wenote that hand-tuning is still required for good per-formance and the AMD results are preliminary.

Figure 6: Performance results with selection of the backendavailable in MFEM v4.2: 2D Poisson problem with 1.3 mil-lion degrees of freedom solved using 200 unpreconditionedCG iterations, using Intel Xeon Gold [email protected] CPUplus NVIDIA GV100 (ceed-raja, raja-cuda, CUDA 10.1) andAMD Radeon InstinctTM MI60 GPUs (hip, ROCm 3.8).Switching from serial to parallel execution on a desktopworkstation leads to an order of magnitude performance im-provement (note: y-axis is logarithmic). Using the desktopGPU results in another order of magnitude performance.Note that these results are representative for the state ofthe MFEM backends as of version 4.2. We do not claim thatthey represent a fair comparison between CPUs and GPUsbecause not all backends are fully optimized. (For examplemuch better CPU results are reported in [13] and [14].)

2.4. libParanumal

The Paranumal project [29] started at VirginiaTech in 2017 as a new GPU effort targeting 90% ofthe capabilities of the CPU version of Nek5000. Itwas not originally designed as a user facing librarybut rather as a collection of high-order finite el-ement streaming benchmarks (streamParanumal),CEED benchmarks (benchParanumal), and selfcontained mini-apps (libParanumal). These wereall developed ab initio using the Open ConcurrentComputing Abstraction (OCCA) [24] and kernellanguage (OKL) [30]. Although OKL is a genericportable kernel programming language the initialkernels were optimized for NVIDIA P100 and V100GPUs, see [23]. By design the libParanumal OKL

kernels use intrinsic C types without recourse tostructs and classes facilitating their use in otherprojects.

Many finite element operations are heavily mem-ory bound including Krylov updates, finite ele-ment gather and scatter operations that admit highthroughput hardware agnostic implementations forboth the NVIDIA and AMD GPUs [31]. Imple-mentations of these kernels have been released inthe streamParanumal standalone benchmark suite[32].

103 104 105 106 107 1080

1

2

3

4

5·109

Degrees of freedom

Thro

ugh

put

(DoF

sp

erse

cond)

BP1 AMD MI100 (OCCA:HIP)

p = 1p = 2p = 3p = 4p = 5p = 6p = 7p = 8

103 104 105 106 107 1080

1

2

3

4

5·109

Degrees of freedom

BP1 NVIDIA V100 SXM2 (OCCA:CUDA)

p = 1p = 2p = 3p = 4p = 5p = 6p = 7p = 8

Figure 7: Performance of the benchParanumal version ofthe CEED benchmark problem BP1 on a single GPU ofHPE/Tulip AMS InstinctTM MI100 (left) and NVIDIA V100SXM2 (right).

103 104 105 106 1070

1

2

3

4·109

Degrees of freedom

Thro

ugh

put

(DoF

sp

erse

cond)

BP5 AMD MI100 (OCCA:HIP)

p = 1p = 2p = 3p = 4p = 5p = 6p = 7p = 8

103 104 105 106 1070

1

2

3

4·109

Degrees of freedom

BP5 NVIDIA V100 SXM2 (OCCA:CUDA)

p = 1p = 2p = 3p = 4p = 5p = 6p = 7p = 8

Figure 8: Performance of the benchParanumal version ofthe CEED benchmark problem BP5 on a single GPU ofHPE/Tulip AMD InstinctTM MI100 (left) and NVIDIAV100 SXM2 (right).

Portable implementations of the CEED BPbenchmarks [16] are available in the benchParanu-mal library. Figures 7 and 8 show the performanceof the benchParanumal implementations for theCEED BP1 and BP5 benchmark respectively on asingle AMD MI100 (ROCm 3.9) and NVIDIA V100

6

SXM2 (CUDA 10.1). The benchmarks achievedsustained average memory bandwidth dependingon polynomial degree and mesh size of up to950GB/s (AMD MI100) and 800GB/s (NVIDIAV100 SXM2) despite involving sequences of tensorcontractions for each element in the matrix-vectoroperations.

The libParanumal library is a self contained high-order finite element library that uses the samehighly optimized OKL kernels as the streamPara-numal and benchParnaumal benchmark suites. Italso includes sub-libraries for dense linear alge-bra, Krylov solvers, parallel mesh handling andpolynomial approximation, p-type and algebraicmultigrid, time stepping, gather-scatter operationsand halo exchanges, and core miscellaneous op-erations. The libParanumal sub-libraries sup-port meshes consisting of triangles, quadrilater-als, tetrahedra, or hexes. The libParanumalproject also includes mini-apps demonstrating GPUaccelerated PDE solvers for linearized acoustics,scalar advection, Galerkin-Boltzmann finite mo-ment gas-dynamics, compressible Navier-Stokes, el-liptic equations, Fokker-Planck, and incompressibleNavier-Stokes. Each solver supports multi-GPUsimulation via Nek’s gslib for efficient MPI basedgather-scatter operations and halo exchanges [33].

The libParanumal library is highly modular andearly fork of the project has been used with modifi-cations as a platform for the custom physics require-ments of the Nek5000 user community [34] as de-scribed in the next section albeit without the mostrecent developments in portable streaming and op-erator kernels. The algorithms and design princi-ples of libParanumal kernels have also influencedthe design of kernels produced by the libCEEDcuda-gen backend (see Section 2.1). libParanumalis also beginning to be used in non-CEED projectsas for example in high fidelity time-dependent roomacoustic modeling [35].

2.5. Nek5000/RS

Nek5000 [36] is a spectral element code that isused for a wide range of thermal-fluids applica-tions. A companion code, NekCEM [37], is usedfor computational electromagnetics. These codeshave scaled to millions of MPI ranks using the Nek-based gsLib communication library to handle allnear-neighbor and other stencil type communica-tions (e.g., for algebraic multigrid) [38]. On CPUs,tensor contractions constitute the principal compu-tational kernel (typically> 90% of the flops). These

can be cast as small dense matrix-matrix productsresulting in high performance with a minor amountof tuning [39].

For GPU-based platforms, node-level parallelismrequires kernels written at a higher level than sim-ple tensor contractions. Early GPU ports startedwith NekCEM [40], using OpenACC and run-ning on OLCF/Titan up to 16,384 NVIDIA K20XGPUs. The OpenACC-based implementation andperformance studies were extended to Nek5000 [41,42].

For portability and performance reasons, we de-cided to develop a new version of Nek5000, calledNekRS, which is written in C++/OCCA [24]. TheNekRS kernels started as a fork from libParanu-mal in late 2018 and were tailored and expanded tomeet the specific requirements of large-scale turbu-lent flow simulations in complex domains (e.g., asillustrated in Figure 10). It retains access to thestandard Nek5000 interface, which allows users toleverage existing user-specific source code such asstatistical analysis tools for turbulence.

Several recent developments in NekRS haveled to significant performance gains on Summit.These include (i) An accelerator-oriented variantof gslib [33] that selects from several communi-cation strategies, including pack/unpack on thehost or device, and GPU-direct or host-basedcommunication. Runtime-adaptation picks thefastest strategy. (ii) Chebyshev-accelerated addi-tive Schwarz smoothing. This approach combinesthe standard Nek5000 additive Schwarz methodwith the Chebyshev-accelerated Jacobi smooth-ing provided in libParanumal (and, e.g., deal.ii[14]). Local Schwarz solves are performed with fast-diagonalization implemented with tensor contrac-tions that have a complexity that is on par withoperator evaluation. (iii) Projection-based initialguesses to avoid redundant iteration work in suc-cessive timesteps [43, 44]. The performance impactof these developments are described in detail in [34].

The advantage of basing NekRS on OCCA isclear from the results of Table 1, which demon-strates full Navier-Stokes performance results forNVIDIA and AMD GPUs. The table shows asingle-GPU comparison of the averaged-walltimeper timestep for the MI60, MI100, and A100, com-pared to a single V100 on Summit. We remarkthat extensive tuning has been applied for the V100,which has been the primary development platformfor NekRS. Despite this, the performance on theother GPUs is within the scope of what we would

7

System Device Backend tstep(s) RSummit V100 CUDA 8.51e-02 1

Tulip MI100 HIP 9.96e-02 0.85

Tulip MI60 HIP 1.41e-01 0.60

Tulip V100 CUDA 8.85e-02 0.96

Theta-GPU A100 CUDA 5.59e-02 1.52

Table 1: NekRS Navier-Stokes performance on a singleGPU of HPE/Tulip AMD InstinctTM MI100, AMD RadeonInstinctTM MI60, NVIDIA V100 PCIe and ALCF/Theta-GPU NVIDIA A100 SXM4 , compared to OLCF/SummitNVIDIA V100 SXM2, for turbulent pipe flow simulationwith Re = 19, 000, E = 6840, p = 6, and n = 2, 346, 120.Time per step in seconds (tstep) is averaged over 100 steps.R is the ratio of tstep on Summit V100 to that on othersystems.

expect for these nodes. The A100 performs remark-ably well, with speedup 1.5× of the performance ofthe V100 and at a near-strong-scale-limit value ofn = 2.22M gridpoints on a single GPU. Despite thefact that the AMD GPU code has seen less tuningthan the NVIDIA code the fact that the perfor-mance is on par illustrates the portability providedby the OCCA base.

3. Discussion

In this section we share some of the porting ex-periences on the CEED project, and discuss someof the GPU lessons we have learned.

Overall, we have found that porting to GPU ar-chitectures is a disruptive process, similar to thetransition from serial to MPI parallel programming.As such, we recommend to start a new code forthe GPU port, if possible, (as with NekRS and lib-Paranumal) as opposed to incrementally porting anexisting code (as with MFEM). We also recommendtaking advantage of GPU-accelerated libraries, suchas libCEED, when applicable. For low-order ap-plications, the CEED software also provides accessto new classes of algorithms (high-order methods)that can take better advantage of GPU hardwarecompared to traditional low-order approaches [16].

For new codes, we have found the use of OCCAand its OKL language a pragmatic choice that hasallowed us to make quick progress in capturing e.g.the capabilities of Nek5000 in libParanumal with-out choosing a specific manufacturer’s GPU pro-gramming model since OCCA translates OKL codeinto CUDA, HIP, OpenCL, or OpenMP at runtimefor native just-in-time (JIT) compilation. OCCAhas also been instrumental in the exploration of thehigh-order algorithmic space, as different versions

of the CEED kernels can be easily implemented,modified and tested with it. Examples are providedwith the OCCA distribution that demonstrate thesimplicity of the API and kernel language [45].

For the porting of existing codes, we have foundthat integration of kernels at the for-loop level, aswith Kokkos and RAJA, has several important ben-efits. For example, in MFEM, the original code wastransformed to use a new for-loop abstraction de-fined as a set of MFEM FORALL macros, in order totake advantage of various backends supported viathe new macros. This approach allows for gradualcode transformations that are not too disruptive forboth MFEM developers and users. Existing appli-cations based on MFEM are able to continue towork as before with easy transition to acceleratedkernels. This approach also allows interoperabil-ity with other software components and external li-braries that can be used in conjunction with MFEM(e.g., hypre, PETSc, SUNDIALS). The main chal-lenge in this transition to kernel-centric implemen-tation is the need to transform existing algorithmsto take full advantage of the increased levels of par-allelism in the accelerators while maintaining goodperformances on standard CPU architectures.

An important aspect of GPU programming is theneed to manage memory allocation and transfersbetween the CPU (host) and the accelerator (de-vice). This can be a frequent source of bugs andinefficiencies in complex applications. For examplein MFEM, a special Memory class was introducedto manage a pair of host and device pointers andprovides a simple interface for copying or movingthe data when needed. An important feature ofthis Memory class is the ability to work with exter-nally allocated host and/or device pointers whichis essential for interoperability with other libraries.The Memory has also been useful in the porting ofMFEM-based applications, see Section 4.4.

Finally, the optimization of GPU kernels reallyrequires understanding of the GPU hardware andits multi-level memory hierarchy as well as ad-vanced techniques such as code generation and JITcompilation. To illustrate these points, we de-scribe in the rest of the section the sequence ofdevelopments that led to the cuda-gen backend oflibCEED, which achieves close to peak performancefor high-order operator evaluation with tensor finiteelements on NVIDIA GPUs.

The first libCEED CUDA backend was the ref-erence backend cuda-ref, which established ablueprint for GPU porting in libCEED. The main

8

difficulty at this point was to handle all the run-time aspects of libCEED. Being able to produceefficient GPU kernels relies on the compiler know-ing as much as possible during compilation, whichconflicts with the generality of the libCEED ap-proach where users are free to specify e.g., polyno-mial order and number of quadrature points usedat runtime. For this reason, it was critical to useJIT compilation to generate on the fly GPU ker-nels with as much information as we could provideat runtime. In general, we believe that JIT compi-lation will play an important role in HPC in the fu-ture, and is essentially a requirement for high-orderapplications with runtime order selection.

The second libCEED CUDA backend,cuda-shared, focused on optimizing each GPUkernel individually to achieve peak performance.The tensor kernels are highly memory bound, sothe challenge was to use the different memorybandwidths efficiently. Typically, the bottlenecksare the local/shared memory bandwidths andthe memory access patterns to the data. If wecompare the cuda-ref and cuda-shared backendsin Figure 9, we see that for low orders (1 to 3) theperformance of the cuda-ref and cuda-shared

backends are similar, the cuda-ref kernels do notyet saturate the local/shared memory bandwidth.However, for orders higher to 4, we observe thatthe cuda-ref backend performance deteriorateswith the order. This is due to local/shared memorybandwidth getting more and more saturated. Onthe other hand, the cuda-shared manages bycareful memory accesses and unrolling loops tocontinue saturating the global memory bandwidthand thus achieves high performance for each GPUkernel.

The final, and best performing backend,cuda-gen uses a code generation approach, basedon the cuda-shared backend, to generate at run-time (with JIT compilation) a unique optimizedGPU kernel representing the whole operator. Sincethe cuda-ref, cuda-shared are decomposing thematrix free operators in a sequence of GPU kernels,they require the storage, in global memory, of un-necessary temporary results in between each kernellaunch. Fusing GPU kernels prevents these unnec-essary data storage and movements between kernellaunches resulting in a 2-3 time speedup over thecuda-shared backend, see Figure 9, and around 5time speedup over the reference backend cuda-ref

on the CEED benchmark problem BP3.

4. Applications

The CEED effort includes the development of al-gorithms, software, simulation and modeling, per-formance analysis and optimization for CEED-engaged applications. While we are focused on ex-ascale applications in the ECP, CEED is also ex-tending its contribution to a broader range of engi-neering and science application areas, such as nu-clear energy, wind energy, fusion, solid mechanics,additive manufacturing, internal combustion, andrecent extension to weather modeling and aerosoltransport research. In this section, we demonstratethe impact of the CEED-developed open sourcecodes, Nek5000/RS and MFEM with full simula-tion capability, scaling on various acceleration ar-chitectures (including the full scaling performanceon Summit GPUs), in DOE’s ExaSMR, ExaWind,NEAMS, MARBL, and ExaAM projects.

4.1. ExaSMR

ExaSMR’s target geometry is a small modularreactor assembly comprising 37 bundles, each hav-ing a 17×17 array of rods, which totals to ∼10,000long communicating channels. For development, weconsider two geometries: a very long single 17×17bundle, and a collection of 37 such bundles thatare shorter in length. We analyzed NekRS perfor-mance behaviors for both geometries. Detailed per-formance for various geometries is discussed in [34].Here we present the baseline performance of thelong 17×17 bundle, illustrated in Fig. 10(a), hav-ing the ratio between the characteristic length Land the rod diameter D as L/D ≈ 288.

Table 2 demonstrates strong- and weak-scalingruns out to 175 million elements on Summit—roughly twelve times larger than 15M-element thatwere “hero calculations” on Mira as recently as2020. We measured the average-walltime per stepin seconds, tstep, using 101-200 steps for simula-tions with ReD = 5000. For the strong scaling, weused E = 175, 618, 000 and N = 7, totaling 60 bil-lion grid points. We observe the 17×17 rod-bundlecase continues to scale well to all of Summit, usingn/P = 2.1M with 70% parallel efficiency from thebase of 1810 nodes using n/P = 5.5M , where Pis the number of V100s. We see 80% efficiency issustained for n/P = 2.6M . The weak scaling usesthe meshes increased by 120, 440, 1100, 2200, 4400,and 6340 layers in the streamwise (z) direction, ex-truded from a two-dimensional 17×17 mesh havingE = 27, 700 spectral elements. Weak-scaling this

9

103 104 105 106 1070

1

2

3·109

Degrees of freedom

Th

rou

ghp

ut

(DoF

sp

erse

con

d)

cuda-ref

p = 1p = 2p = 3p = 4p = 5p = 6p = 7p = 8

103 104 105 106 1070

1

2

3·109

Degrees of freedom

cuda-shared

p = 1p = 2p = 3p = 4p = 5p = 6p = 7p = 8

103 104 105 106 1070

1

2

3·109

Degrees of freedom

cuda-gen

p = 1p = 2p = 3p = 4p = 5p = 6p = 7p = 8

Figure 9: Performance of the cuda-ref (left) cuda-shared (center) and cuda-gen (right) backends of libCEED for the CEEDbenchmark problem BP3 on an NVIDIA V100 GPU.

Figure 10: Nek5000/RS applications: (a) ExaSMR’s 17×17 fuel rod configuration and turbulent flows profile, (b) ExaWind’satmospheric boundary layer modeling and analysis, (c) NEAMS’s pebble-bed reactor configurations with simulation demon-strating turbulent flows past 3344 pebbles in an annulus and 44257 pebbles in a cylinder.

problem from 271 to 4608 nodes (1626 to 27648GPUs) sustains more than 80% parallel efficiencythroughout, using 2.1M grid points per GPU.

We note that the pressure iteration counts, pi,are relatively very low for the 17×17 bundle com-pared to the pebble cases, which have pi ∼ 8 for thesame timestepper and preconditioner. The geomet-ric complexity of the 17×17 rod-bundle is relativelymild compared to the pebble case and also the syn-thetic initial condition does not quickly transitionto full turbulence. We expect higher pressure iter-ation counts (e.g., pi ∼ 4–8) once fully turbulentflow is established for this case.

4.2. ExaWind

Efficient simulation of atmospheric boundarylayer flows (ABL) is important for the study ofwind farms, urban canyons, and basic weather mod-eling. In collaboration with the ECP ExaWindteam, we identified a well-documented test case,

the Global Energy and Water Cycle ExperimentAtmospheric Boundary Layer Study (GABLS), todemonstrate the suitability of high-order methodsfor large eddy simulations (LES) of the ABL. Initialconvergence results of an LES study with Nek5000are shown in Figure 10(b). We have initiated aperformance study for this problem on Theta-GPU(A100s) and Summit (V100s) for an E = 32768spectral element mesh with N = 7 (i.e., n=11.2M).Our simulations represent turbulent flows on thephysical domain [400m × 400m × 400m] withgeostrophic wind speed of 8m/s in x-direction andreference potential temperature of 263.5K with no-slip boundary as well as wall functions based onlog-law at the lower wall, otherwise periodic bound-ary conditions. Single-node scaling shows the 80%strong-scale limit to be 1.8M points/GPU for boththe V100 and A100, with the A100 running at.055s/step and 1.55 times faster than the V100.

10

ExaSMR application performance: 17 × 17 fuel rods simulation

case node gpu E N E/gpu n/gpu vi pi tstep(s) R Rideal Peff (%)1810 10860 175618000 7 16171 5.5M 4 2 1.855e-01 1.00 1.00 100

strong 2536 15216 175618000 7 11542 3.9M 4 2 1.517e-01 1.22 1.40 873620 21720 175618000 7 8085 2.7M 4 2 1.120e-01 1.65 2.00 824180 25080 175618000 7 7002 2.4M 4 2 1.128e-01 1.64 2.30 714608 27648 175618000 7 6351 2.1M 4 2 1.038e-01 1.78 2.54 70

case node gpu E N E/gpu n/gpu vi pi tstep(s) R Rideal Peff (%)87 522 3324000 7 6367 2.1M 4 2 8.57e-02 1.00 1.00 100320 1920 12188000 7 6347 2.1M 4 2 8.67e-02 0.98 1.00 98

weak 800 4800 30470000 7 6347 2.1M 4 2 9.11e-02 0.94 1.00 941600 9600 60940000 7 6347 2.1M 4 2 9.33e-02 0.91 1.00 913200 19200 121880000 7 6347 2.1M 4 2 9.71e-02 0.88 1.00 884608 27648 175618000 7 6351 2.1M 4 2 1.03e-01 0.83 1.00 83

Table 2: ExaSMR: NekRS strong and weak scaling performed on Summit, using 6 GPUs per node, for simulating turbulentflow in the 17 × 17 rod-bundle of Figure 10(a), right, with ReD = 5000. Time per step in seconds (tstep), velocity iterationcount (vi), and pressure iteration count (pi), are all averaged over 100 steps. R is the ratio of tstep of 1810 nodes to that ofothers for strong scaling and tstep of 87 nodes to that of others for weak scaling, provided with the ideal ratio, Rideal and theparallel efficiency, Peff .

4.3. NEAMS

Figure 10(c) demonstrates turbulent flows past3344 pebbles in an annulus (left) and 44257 peb-bles in a cylinder (right) that were computed withNekRS on Summit using 840 GPUs and 1788 GPUs,respectively. The cylinder case has 13M elementsof order N = 7, for a total of 4.4B grid points.The annulus configuration is a prototype for pebble-bed reactor configurations that are being studiedby the DOE’s Nuclear Energy Advanced Modelingand Simulation project. We have developed a novelmeshing strategy for generating high-quality hexa-hedral element meshes that ensure accurate repre-sentation of densely packed spheres for these geome-tries [46]. The meshing algorithm includes Voronoitessellation, edge collapse, facet projection onto thespheres, and mesh smoothing with quality measure-ments. These simulations strong-scale well and theNEAMS target configuration of an annulus with300,000 pebbles will require about 30B grid points,which is well within the current performance enve-lope on Summit.

4.4. MARBL

MARBL is a next-gen multi-physics simu-lation code being developed at LLNL. Thecode provides multi-material radiation-magneto-hydrodynamics with applications in inertial con-finement fusion (ICF), pulsed power and equationof state/material strength experiments as part ofthe NNSA ATDM program. One of the cen-tral components of MARBL is the BLAST package[47], which uses an ALE formulation to simulate

conservation laws of mass, momentum, and energyin a moving material frame. The BLAST packageutilizes high-order finite element discretizations ofphysical processes on a high-order (curved) mov-ing mesh. BLAST’s finite element discretizationinfrastructure is entirely based on the MFEM li-brary. Therefore, the GPU port of BLAST makesextensive use of on the matrix-free approach andGPU support via MFEM. In this section we providespecifics about the major GPU kernels in BLAST,and the impact of the CEED project in these GPUdevelopment efforts.

Memory management. Since MARBL/BLAST isbased on MFEM, it directly uses the high-levelmemory management interface for reading andwriting device data. In addition, the MARBL teamhas enhanced the MFEM’s memory manager ca-pabilities by introducing the Umpire [48] memorymanager providing access to memory pools. Thisapproach enables the following benefits: substan-tially reduces slowdowns caused by cudaMalloc per-formance; sharing of device memory buffers insideMARBL to reduce the total device usage; and shar-ing overall temporary memory between other exter-nal packages in MARBL that use Umpire.

Lagrangian phase. In this phase the multi-materialcompressible Euler equations are solved on a mov-ing curved mesh [49, 50]. The optimization ofthe needed mass and force operators has beenaided by the matrix-free methods that were in-troduced by the CEED-developed Laghos miniapp[51], which models the main computational ker-nels of Lagrangian hydrodynamics. The GPU ker-

11

nels for these methods were implemented by theMARBL team and reside in the BLAST code. Thelatest Laghos GPU implementations of these ker-nels give an alternative that might be used in thefuture, based on performance tests. A key CEEDbenefit provided to MARBL is the ability to drop inreplacements for these expensive kernels as they be-come available. Physics-specific quadrature pointcomputations were implemented by the MARBLteam, making use of the RAJA nested parallel loopabstractions combined with MFEM’s GPU capabil-ities, including GPU-friendly data structures, smalldense matrix kernels, and use of shared memory.This phase also requires the computation of a hy-perviscosity [52] coefficient, which involves consecu-tive applications of a Laplacian operator. This pro-cedure has been ported on the GPU by applyingdirectly the MFEM’s optimized diffusion kernels.

Remesh phase. The mesh optimization phase ofBLAST is based on the Target-Matrix Optimiza-tion Paradigm (TMOP), where the mesh optimiza-tion problem is posed as a variational minimizationof a nonlinear functional [53, 54]. The developmentof the GPU port was performed in MFEM’s meshoptimization miniapp, and then directly ported toMARBL, as both codes use the same core TMOPalgorithms.

Remap phase. The remap algorithm in BLASThas two main components, namely, velocity remap,which is solved by a continuous Galerkin advec-tion discretization, and remap of other fields, whichis modeled by flux-limited discontinuous Galerkin(DG) advection [55, 56]. Using the MFEM infras-tructure, the MARBL developers have developedcustom GPU code for matrix-free DG advectionremap. It is expected that this approach will beimproved significantly by the future work in theCEED-developed Remhos miniapp, as it containsnovel matrix-free DG remap methods [57]. The con-tinuous Galerkin advection solve is also fully GPUported. Similarly to the CG mass matrix inversionin the Lagrangian phase, the remap GPU code isimplemented inside MARBL, and the alternative toswitching to the optimized MFEM kernels will beexplored. In Figure 11 we present a recent study ofMARBL that compares node-to-node throughputof several CPU machines at LLNL (a CommodityTechnology System (CTS ), Astra and Magma) ver-sus the LLNL Sierra machine, showing clear advan-tage of the GPU executions. Table 3 shows timingsof the three main phases of the application, along

with the final speedup of the matrix-free CPU vsGPU kernels.

Figure 11: 3D Triple-point problem throughput test inMARBL. Comparison of 3 CPU-based systems versus theNVIDIA V100 GPU-based Sierra.

Phase FA CPU PA CPU PA GPU speedupTime Loop 3854.16 2866.54 221.03 12.9Lagrange 1773.68 1098.42 69.73 15.7Remesh 557.98 366.24 42.67 8.5Remap 1513.99 1393.34 100.95 13.8

Table 3: CPU and GPU timings on 3 nodes of the LLNL’srzgenie (36 tasks per node) and rzansel (4 GPUs per node)machines. FA is traditional full assembly, while PA is matrix-free partial assembly. This is a full 3D high-order ALE sim-ulation with 224,160 elements.

4.5. ExaConstit

ExaConstit [58] is a general implicit quasi-staticnon-linear solid mechanics velocity-based finite el-ement application built on the MFEM framework[12]. This code is being developed at LLNL forthe ExaAM project in the ECP, with the goals ofconnecting local additive manufactured microstruc-tures to local macroscopic properties by means ofcrystal plasticity finite element methods. As partof a larger workflow of the ExaAM workflow tosimulate the additive manufacturing process, Ex-aConstit and its constitutive library, ExaCMech[59], needed be refactored to run on the GPU. Thisrefactoring was required to run the hundreds tothousands of high-fidelity simulations on exascalehardware in a timely manner for the larger work-flow. ExaCMech was ported over to the GPU usingRAJA forall loops wrapped around the entire large

12

constitutive kernel. Within ExaConstit, the dom-inant computational cost lies within the linearizedsystem solve within a Newton-Raphson scheme.Therefore,the primary focus had been transitioningfrom a traditional full assembly method over to par-tial and element assembly methods. This transitionrequired the physics/constitutive calculations to becompletely separated from the assembly methodand called in a separate setup phase. The setupphase is now responsible for calculating the updatedstress and material tangent stiffness matrix. After-words, these values can be incorporated into any ofthe runtime selected linear assembly methods. Ini-tial partial and element assembly formulations arebased on [60] and [61], respectively. In order tokeep compute kernels backend agnostic, MFEM’sforall abstraction macros, which make use of theRAJA backends, and memory management capa-bilities were leveraged within ExaConstit for a vastmajority of the compute kernels. Within ExaCon-stit, a few reduction operations were also convertedto RAJA reduction policies to take advantage of theGPU as these are not available within the MFEMAPI. The end result of this refactoring is a ∼14.5xspeed-up when using the GPU element assemblyover the CPU full assembly on Summit. For anExaAM challenge problem sized linear hexahedronmesh of 6.7 million elements that undergoes 5%monotonic strain, the GPU port and assembly im-provements result in a runtime decrease of roughly35 node-hours down to 2.5 node-hours on 8 nodes ofSummit for just the ExaConstit stage. The resultsof these simulations will then be used determine theproperties being used in the part-scale simulation ofthe larger ExaAM workflow being run on exascalehardware. Finally from a physics point of view,these improvements are also enabling the ExaAMteam to study the highly complex deformation pro-cesses that occur within additively manufacturedmicrostructures, such as those shown in Figure 12,at an unprecedented level of fidelity of typical crys-tal plasticity methods.

5. Conclusions

In this paper we described the development ofGPU-oriented algorithms for high-order finite ele-ment discretizations in the ECP CEED projects.We presented the current GPU capabilities ofseveral CEED components, including libCEED,MAGMA, MFEM, libParanumal and Nek, whichcan now run efficiently on both NVIDIA and AMD

Figure 12: A representative microstructure within an ad-ditive manufactured part over a 500 micron volume cubeand discretized into 27 million linear hexahedron elements.The highly heterogeneous effective plastic shearing rate isplotted along with the microstructure where each crystal isrepresented as a different color.

GPUs. We also discussed some of the challengesof porting to exascale GPU architectures and pre-sented application results that use the CEED-developed GPU technologies.

Acknowledgments

This research is supported by the Exascale Com-puting Project (17-SC-20-SC), a collaborative effortof two U.S. Department of Energy organizations(Office of Science and the National Nuclear Secu-rity Administration) responsible for the planningand preparation of a capable exascale ecosystem, in-cluding software, applications, hardware, advancedsystem engineering and early testbed platforms, insupport of the nation’s exascale computing imper-ative.

The research used resources of the Argonne Lead-ership Computing Facility, which is supported bythe U.S. Department of Energy, Office of Science,under Contract DE-AC02-06CH11357. This re-search also used resources of the Oak Ridge Lead-ership Computing Facility at Oak Ridge NationalLaboratory, which is supported by the Office ofScience of the U.S. Department of Energy underContract DE-AC05-00OR22725. Work performedunder the auspices of the U.S. Department of En-ergy under Contract DE-AC52-07NA27344 (LLNL-JRNL-816034).

References

[1] Center for Efficient Exascale Discretizations, ExascaleComputing Project, DOE, ceed.exascaleproject.org.

13

[2] T. Kolev, P. Fischer, M. Min, J. Dongarra, J. Brown,V. Dobrev, T. Warburton, S. Tomov, M. S. Shep-hard, A. Abdelfattah, V. Barra, N. Beams, J.-S. Camier, N. Chalmers, Y. Dudouit, A. Karakus,I. Karlin, S. Kerkemeier, Y.-H. Lan, D. Medina,E. Merzari, A. Obabko, W. Pazner, T. Rathnayake,C. W. Smith, L. Spies, K. Swirydowicz, J. Thomp-son, A. Tomboulides, V. Tomov, Efficient exas-cale discretizations: High-order finite element meth-ods, Int. J. HPC App. (2021) 1–26doi:10.1177/10943420211020803.

[3] H. Kreiss, J. Oliger, Comparison of accurate methodsfor the integration of hyperbolic problems, Tellus 24(1972) 199–215. doi:10.3402/tellusa.v24i3.10634.

[4] I. Babuska, M. Suri, The p and h−p versions of the finiteelement method, basic principles and properties, SIAMReview 36 (4) (1994) 578–632. doi:10.1137/1036141.

[5] S. Orszag, Spectral methods for problems in complexgeometry, J. Comput. Phys. 37 (1980) 70–92. doi:10.

1016/0021-9991(80)90005-4.[6] D. Gottlieb, S. Orszag, Numerical Analysis of Spec-

tral Methods: Theory and Applications, SIAM-CBMS,Philadelphia, 1977.

[7] D. Arndt, N. Fehn, G. Kanschat, K. Kormann, M. Kro-nbichler, P. Munch, W. A. Wall, J. Witte, Exadg:High-order discontinuous galerkin for the exa-scale, in:H.-J. Bungartz, S. Reiz, B. Uekermann, P. Neumann,W. E. Nagel (Eds.), Software for Exascale Computing -SPPEXA 2016-2019, Springer International Publishing,Cham, 2020, pp. 189–224.

[8] P. D. Bello-Maldonado, P. F. Fischer, Scalable low-order finite element preconditioners for high-order spec-tral element Poisson solvers, SIAM Journal on Scien-tific Computing 41 (5) (2019) S2–S18. doi:10.1137/

18M1194997.[9] C. Canuto, P. Gervasio, A. Quarteroni, Finite-element

preconditioning of G-NI spectral methods, SIAM Jour-nal on Scientific Computing 31 (6) (2010) 4422–4451.doi:10.1137/090746367.

[10] D. Moxey, R. Amici, M. Kirby, Efficient matrix-freehigh-order finite element evaluation for simplicial ele-ments, SIAM Journal on Scientific Computing 42 (3)(2020) C97–C123. doi:10.1137/19M1246523.

[11] T. Sun, L. Mitchell, K. Kulkarni, A. Klockner, D. A.Ham, P. H. Kelly, A study of vectorization for matrix-free finite element methods, The International Journalof High Performance Computing Applications 34 (6)(2020) 629–644. doi:10.1177/1094342020945005.

[12] R. Anderson, J. Andrej, A. Barker, J. Bramwell, J.-S. Camier, J. C. V. Dobrev, Y. Dudouit, A. Fisher,T. Kolev, W. Pazner, M. Stowell, V. Tomov, I. Akker-man, J. Dahm, D. Medina, S. Zampini, MFEM: A mod-ular finite element library, Computers & Mathematicswith Applicationsdoi:10.1016/j.camwa.2020.06.009.

[13] M. Kronbichler, W. A. Wall, A performance compar-ison of continuous and discontinuous galerkin meth-ods with fast multigrid solvers, SIAM Journal on Sci-entific Computing 40 (5) (2018) A3423–A3448. doi:

10.1137/16M110455X.[14] M. Kronbichler, K. Ljungkvist, Multigrid for matrix-

free high-order finite element computations on graphicsprocessors, ACM Trans. Parallel Comput. 6 (1) (2019)1–32. doi:10.1145/3322813.

[15] J. W. Lottes, P. F. Fischer, Hybrid multigrid/Schwarzalgorithms for the spectral element method, J. Sci.

Comput. 24 (2005) 45–78.[16] P. Fischer, M. Min, T. Rathnayake, S. Dutta, T. Kolev,

V. Dobrev, J.-S. Camier, M. Kronbichler, T. Warbur-ton, K. Swirydowicz, J. Brown, Scalability of high-performance PDE solvers, Int. J. HPC App. 34 (5)(2020) 562–586. doi:10.1177/1094342020915762.

[17] J. Brown, A. Abdelfattah, V. Barra, N. Beams, J. S.Camier, V. Dobrev, Y. Dudouit, L. Ghaffari, T. Kolev,D. Medina, W. Pazner, T. Ratnayaka, J. Thomp-son, S. Tomov, libCEED: Fast algebra for high-orderelement-based discretizations, Journal of Open SourceSoftware 6 (63) (2021) 2945. doi:10.21105/joss.

02945.[18] A. Abdelfattah, V. Barra, N. Beams, J. Brown, J.-S.

Camier, V. Dobrev, Y. Dudouit, L. Ghaffari, T. Kolev,D. Medina, W. Pazner, T. Ratnayaka, J. L. Thompson,S. Tomov, libCEED user manual (Jul. 2021). doi:10.

5281/zenodo.5077489.[19] G. Karniadakis, S. Sherwin, Spectral/hp Element

Methods for Computational Fluid Dynamics., OxfordUniversity Press, Oxford, 2005.

[20] P. E. Vos, S. J. Sherwin, R. M. Kirby, From h to p ef-ficiently: Implementing finite and spectral/hp elementmethods to achieve optimal performance for low-andhigh-order discretisations, Journal of ComputationalPhysics 229 (13) (2010) 5161–5181. doi:10.1016/j.

jcp.2010.03.031.[21] M. Ainsworth, G. Andriamaro, O. Davydov, Bernstein-

Bezier finite elements of arbitrary order and opti-mal assembly procedures, SIAM Journal on ScientificComputing 33 (6) (2011) 3087–3109. doi:10.1137/

11082539X.[22] R. C. Kirby, Fast simplicial finite element algo-

rithms using Bernstein polynomials, Numerische Math-ematik 117 (4) (2011) 631–652. doi:10.1007/

s00211-010-0327-2.[23] K. Swirydowicz, N. Chalmers, A. Karakus, T. Warbur-

ton, Acceleration of tensor-product operations for high-order finite element methods, The International Jour-nal of High Performance Computing Applications 33 (4)(2019) 735–757. doi:10.1177/1094342018816368.

[24] D. S. Medina, A. St-Cyr, T. Warburton, OCCA: Aunified approach to multi-threading languages, arXivpreprint arXiv:1403.0968.

[25] MAGMA: Matrix algebra on gpu and multicore archi-tectures, icl.utk.edu/magma.

[26] A. Abdelfattah, M. Baboulin, V. Dobrev, J. J. Don-garra, C. W. Earl, J. Falcou, A. Haidar, I. Karlin, T. V.Kolev, I. Masliah, S. Tomov, High-Performance Ten-sor Contractions for GPUs, in: International Confer-ence on Computational Science, ICCS, 6-8 June 2016,San Diego, California, USA, 2016, pp. 108–118. doi:

10.1016/j.procs.2016.05.302.[27] N. Beams, A. Abdelfattah, S. Tomov, J. Don-

garra, T. Kolev, Y. Dudouit, High-Order Finite Ele-ment Method using Standard and Device-Level BatchGEMM on GPUs, in: 11th Workshop on Latest Ad-vances in Scalable Algorithms for Large-Scale Sys-tems,Proceedings. To appear, 2020.

[28] R. D. Hornung, J. A. Keasler, The RAJA PortabilityLayer: Overview and Status, Tech. Rep. LLNL-TR-661403, LLNL (Sep. 2014).

[29] N. Chalmers, A. Karakus, A. P. Austin, K. Swiry-dowicz, T. Warburton, libParanumal: a performanceportable high-order finite element library (2020). doi:

14

10.5281/zenodo.4004744.URL github.com/paranumal/libparanumal

[30] D. Medina, Okl: a unified language for parallel archi-tectures, Ph.D. thesis, Rice University (2015).

[31] N. Chalmers, T. Warburton, Portable high-order finiteelement kernels i: Streaming operations, arXiv preprintarXiv:2009.10917.

[32] N. Chalmers, T. Warburton, streamParanumal:Streaming microbenchmarks for high-order finite ele-ment methods.URL github.com/paranumal/streamparanumal

[33] gslib: Gather-scatter library (2020).URL github.com/Nek5000/gslib

[34] P. Fischer, S. Kerkemeier, M. Min, Y. Lan, M. Phillips,T. Rathnayake, E. Merzari, A. Tomboulides,A. Karakus, N. Chalmers, T. Warburton, Nekrs,a gpu-accelerated spectral element navier-stokes solver,CoRR abs/2104.05829. arXiv:2104.05829.URL https://arxiv.org/abs/2104.05829

[35] A. Melander, E. Strøm, F. Pind, A. Engsig-Karup, C.-H. Jeong, T. Warburton, N. Chalmers, J. S. Hesthaven,Massive parallel nodal discontinuous galerkin finite el-ement method simulator for room acoustics, Tech. rep.(2020).

[36] Nek: Open source, highly scalable and portable spectralelement code, nek5000.mcs.anl.gov (2020).

[37] NekCEM: Scalable high-order computational electro-magnetic code, github.com/NekCEM/NekCEM (2020).

[38] P. F. Fischer, K. Heisey, M. Min, Scaling limits forPDE-based simulation, in: 22nd AIAA ComputationalFluid Dynamics Conference, 2015, p. 3049.

[39] Deville, M.O., P. Fischer, E. Mund, High-order meth-ods for incompressible fluid flow, Cambridge UniversityPress, Cambridge, 2002.

[40] M. Otten, J. Gong, A. Mametjanov, A. Vose,J. Levesque, P. Fischer, M. Min, An MPI/OpenACCimplementation of a high order electromagneticssolver with GPUDirect communication, The Inter-national Journal of High Performance ComputingApplication 30 (3) (2016) 320–334. doi:10.1177/

1094342015626584.[41] J. Gong, S. Markidis, E. Laure, M. Otten, P. Fischer,

M. Min, Nekbone performance on gpus with openaccand cuda fortran implementations, special issue on sus-tainability on ultrascale computing systems and appli-cations, J. Supercomputing 72 (11) (2016) 4160–4180.doi:10.1007/s11227-016-1744-5.

[42] E. Otero, J. Gong, M. Min, P. Fischer, P. Schlatter,E. Laure, OpenACC acceleration for the PN −PN−2 al-gorithm in Nek5000, Journal of Parallel and DistributedComputing 132 (2019) 69–78. doi:10.1016/j.jpdc.

2019.05.010.[43] P. Fischer, Projection techniques for iterative solution

of Ax = b with successive right-hand sides, ComputerMethods in Applied Mechanics and Engineering 163(1998) 193–204. doi:10.1016/S0045-7825(98)00012-7.

[44] A. P. Austin, N. Chalmers, T. Warburton, Initialguesses for sequences of linear systems in a gpu-accelerated incompressible flow solver, arXiv preprintarXiv:2009.10863.

[45] OCCA: Lightweight performance portability library,libocca.org (2020).

[46] Y.-H. Lan, P. Fischer, E. Merzari, M. Min, All-hexmeshing strategies for densely packed spheres, The 29thInternational Meshing Roundtable.

[47] R. W. Anderson, V. A. Dobrev, T. V. Kolev, R. N.Rieben, V. Z. Tomov, High-order multi-material ALEhydrodynamics, SIAM J. Sci. Comp. 40 (1) (2018) B32–B58. doi:10.1137/17M1116453.

[48] D. Beckingsale, M. McFadden, J. Dahm, R. Pankajak-shan, R. Hornung, Umpire: Application-focused man-agement and coordination of complex hierarchical mem-ory, IBM Journal of Research and Development (2019)1–10doi:10.1147/JRD.2019.2954403.URL ieeexplore.ieee.org/document/8907404

[49] V. A. Dobrev, T. V. Kolev, R. N. Rieben, High-ordercurvilinear finite element methods for Lagrangian hy-drodynamics, SIAM J. Sci. Comp. 34 (5) (2012) B606–B641. doi:10.1137/120864672.

[50] V. A. Dobrev, T. V. Kolev, R. N. Rieben, V. Z. To-mov, Multi-material closure model for high-order finiteelement Lagrangian hydrodynamics, Internat. J. Nu-mer. Methods Engrg. 82 (10) (2016) 689–706. doi:

10.1002/fld.4236.[51] Laghos: High-order Lagrangian hydrodynamics

miniapp, github.com/ceed/Laghos (2020).[52] P. D. Bello-Maldonado, T. V. Kolev, R. N. Rieben,

V. Z. Tomov, A matrix-free hyperviscosity formula-tion for high-order ALE hydrodynamics, Comput. Flu-idsdoi:10.1016/j.compfluid.2020.104577.

[53] V. Dobrev, P. Knupp, T. Kolev, K. Mittal, V. Tomov,The Target-Matrix Optimization Paradigm for high-order meshes, SIAM Journal on Scientific Computing41 (1) (2019) B50–B68. doi:10.1137/18M1167206.

[54] V. A. Dobrev, P. Knupp, T. V. Kolev, K. Mittal, R. N.Rieben, V. Z. Tomov, Simulation-driven optimizationof high-order meshes in ALE hydrodynamics, Comput.Fluids 208. doi:10.1016/j.compfluid.2020.104602.

[55] R. W. Anderson, V. A. Dobrev, T. V. Kolev, R. N.Rieben, Monotonicity in high-order curvilinear finiteelement arbitrary Lagrangian–Eulerian remap, Inter-nat. J. Numer. Methods Engrg. 77 (5) (2015) 249–273.doi:10.1002/fld.3965.

[56] R. W. Anderson, V. A. Dobrev, T. V. Kolev,D. Kuzmin, M. Q. de Luna, R. N. Rieben, V. Z. To-mov, High-order local maximum principle preserving(MPP) discontinuous Galerkin finite element methodfor the transport equation, J. Comp. Phys. 334 (2017)102–124. doi:10.1016/j.jcp.2016.12.031.

[57] H. Hajduk, D. Kuzmin, T. V. Kolev, R. Abgrall,Matrix-free subcell residual distribution for Bernsteinfinite element discretizations of linear advection equa-tions, Comput. Methods Appl. Mech. Eng. 359. doi:

10.1016/j.cma.2019.112658.[58] R. A. Carson, S. R. Wopschall, J. A. Bramwell, Exa-

Constit (Aug 2019). doi:10.11578/dc.20191024.2.URL github.com/LLNL/ExaConstit

[59] N. R. Barton, R. A. Carson, S. R. Wopschall, U. N. N. S.Administration, ECMech (12 2018). doi:10.11578/dc.20190809.2.URL github.com/LLNL/ExaCMech

[60] A. K. Gupta, B. Mohraz, A method of computingnumerically integrated stiffness matrices, InternationalJournal for Numerical Methods in Engineering 5 (1)(1972) 83–89. doi:10.1002/nme.1620050108.

[61] A. K. Gupta, Efficient numerical integration of elementstiffness matrices, International Journal for NumericalMethods in Engineering 19 (9) (1983) 1410–1413. doi:

10.1002/nme.1620190910.

15