IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …over, in [9], the authors discussed the...

A Guide for Achieving High Performance withVery Small Matrices on GPU: A Case Studyof Batched LU and Cholesky Factorizations

Azzam Haidar, Ahmad Abdelfattah , Mawussi Zounon , Stanimire Tomov,

and Jack Dongarra, Fellow, IEEE

Abstract—We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix

computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms.

We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before

introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very

small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to

achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for

designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations.

Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific

simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are

incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6� for thefactorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

Index Terms—Batched computation, GPUs, variable small sizes

Ç

1 INTRODUCTION

ALTHOUGH it might seem like an attractive idea to focusthe efforts of the high-performance computing (HPC)community on addressing large-scale problems, the experi-ence of the research community over the last few years hasclearly shown that applications that use many small matri-ces or tensors cannot make efficient use of modern HPC sys-tems and the associated vendor-optimized linear algebralibraries. The existing libraries have been designed for largematrices, and—historically—the scope of vendor librarieshas been too narrow to address matrix computation prob-lems. Consequently, the performance that these librariesdeliver tends to be inadequate. Moreover, there are goodreasons to believe that neither improved compiler technol-ogy nor autotuning will make any significant headway onthis problem. This lack of coverage by current library infra-structure is especially alarming because of the number of

applications from important fields that fit this profile,including deep learning [1], data mining [2], astrophys-ics [3], image and signal processing [4], [5], hydrodynam-ics [6], quantum chemistry [7], and computational fluiddynamics (CFD) and the resulting partial differential equa-tions (PDEs) through direct and multifrontal solvers [8],to name a few. Dramatically better performance on theseapplications can be achieved by using software that canrepetitively execute small matrix/tensor operations grou-ped together in “batches.” However, the near total lack ofsuch capabilities in existing software has forced users torely on different inefficient solutions.

For example, in the NWChem package used in chemistryproblems, the computation of the two-electron integrals andthe Fock matrix becomes a bottleneck when many integralsof small size have to be computed independently, whichshows the necessity of optimized batched libraries. More-over, in [9], the authors discussed the optimization ofNWChem for Intel’s MIC architecture and highlighted theneed for tensor computations of about 200–2,000 matricesfrom 10� 10 to 40� 40 in size. In his dissertation, DavidOzog discussed NWChem’s Tensor Contraction Engine(TCE) and revealed how strongly it relies on the perfor-mance of general matrix-matrix multiplication (GEMM) inthe computation of the tensor contraction. In the summaryof the work done by [10], the authors noted that smallmatrix computations have a severe bottleneck, and special-ized libraries are required. The need for efficient librariesfor thousands of small matrix computations was also dis-cussed at NVIDIA’s GPU Technology Conference in 2016by Mniszewski et al. [11] in the context of their research on

� A. Haidar, A. Abdelfattah, and S. Tomov are with the Innovative ComputingLaboratory, University of Tennessee, Knoxville, TN 37996.E-mail: {haidar, ahmad, tomov}@icl.utk.edu.

� M. Zounon is with the University of Manchester, Manchester M13 9PL,United Kingdom. E-mail: [email protected].

� J. Dongarra is with the Innovative Computing Laboratory, University ofTennessee, Knoxville, TN 37996 and with the Oak Ridge National Labora-tory, Oak Ridge, TN 37830 and also with the University of Manchester,Manchester M13 9PL, United Kingdom. E-mail: [email protected].

Manuscript received 25 Aug. 2017; revised 27 Nov. 2017; accepted 12 Dec.2017. Date of publication 15 Dec. 2017; date of current version 6 Apr. 2018.(Corresponding author: Mawussi Zounon.)Recommended for acceptance by R. Cumplido.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2017.2783929

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 5, MAY 2018 973

1045-9219� 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0001-5054-4784https://orcid.org/0000-0001-5054-4784https://orcid.org/0000-0001-5054-4784https://orcid.org/0000-0001-5054-4784https://orcid.org/0000-0001-5054-4784https://orcid.org/0000-0002-6955-1500https://orcid.org/0000-0002-6955-1500https://orcid.org/0000-0002-6955-1500https://orcid.org/0000-0002-6955-1500https://orcid.org/0000-0002-6955-1500mailto:mailto:mailto:

quantummolecular dynamics, where the dense matrix com-putation can be performed as an independent set of compu-tations. Another important motivation is that the matrixpolynomial can be computed in a batch fashion as a set ofsmall, independent computations [12].

Deep Learning communities have also showed a signifi-cant interest in computations involving many small matri-ces. NVIDIA [13] already highlighted the need for a batchedGEMM routine and has also started providing some of thebatched routines (e.g., GEMM, triangular solver matrix[TRSM], LU, QR, inversion) for fixed size in their cuBLASlibrary. Nervana [14], which is one of the pioneers of deeplearning, demonstrated the critical need for batched matrixcomputation kernels for high-performance deep learningsoftware. Intel has also provided batched GEMM andbatched TRSM routines for fixed matrix sizes.

For the block-mass, weighted Hessian in the moleculardynamics simulation [15], computing the eigendecomposi-tion involves computing the eigendecomposition of manysmall, independent matrices, which can be viewed as abatched eigendecomposition. In addition to the packagecited above, we note that in the GROMACS package [16],because the particle-particle and mesh interactions are inde-pendent, batched computation can be used to speed up andoverlap the expensive global communication in the Particle-mesh Ewald (PME). Also, in [17], the authors noted theneed for optimized and hardware-aware basic linear alge-bra subprograms (BLAS) to perform many independentcomputations inside the GROMACS MD simulation.

The approach behind Flash Principle Component Analy-sis (FlashPCA) performs a large number of eigendecompo-sitions across many samples. Also, in combustion andastrophysics supernova applications [3], [18], [19], [20], [21],the study of a thermonuclear reaction networks (XNet pack-age) requires the solution of many sparse linear systems ofaround 150� 150. Furthermore, the need for batched rou-tines can be illustrated in radar signal processing [5], wherea batch of 200� 200 QR decompositions is needed, as wellas in hydrodynamic simulations [6], where thousands ofmatrix-matrix and matrix-vector (GEMV) products of matri-ces of around 100� 100 are needed.

Although the brief introduction above shows some viableapproaches, it mostly highlights the keen awareness of theneed for batched libraries that can enable small matrix appli-cations to finally start exploiting the full power of currentand future hybrid platforms. Some vendors have started toprovide some batched functionalities in their numericallibraries (e.g., NVIDIA’s CUBLAS and Intel’s Math KernelLibrary [MKL]). Additionally, some open-source librariesfrom the HPC community (e.g., the Matrix Algebra on GPUand Multicore Architectures [MAGMA] library [22]) havealso started to deliver batched routines [23], [24], [25]. Whileperformance has been improving with these contributions,there is still a lack of understanding of how to design, imple-ment, analyze, and optimize batched routines to exploitmodern architectures at full efficiency. The goal of this paperis to develop a theoretical analysis and a methodology forhigh-performance linear solvers. As test cases, we take theCholesky and LU factorizations and show how the proposedmethodology enables us to achieve performance close to thetheoretical upper bound.

2 CONTRIBUTIONS

The primary goal of this paper is to propose a frameworkdesign for batched algorithms and to study their efficientimplementations.WebelievethisstudywillhelpHPCapplica-tion developers more effectively harness GPUs and achieveperformance close the theoretical peak of the hardware. Ourprimarycontributionstothisendarelistedbelow.

� In addition to efficient implementations that exhibita good speedup, we provide a detailed analysis ofoptimization techniques. We also present a collectionof best practices to help users understand anddevelop batched computation kernels in a simpleand efficient fashion.

� We propose a methodology for designing a perfor-mance model that provides insight into the per-formance spectrum of batched kernels. The mainadvantage of this model is that it helps predict theachievable performance of the kernels with increasedaccuracy.

� We investigated a model that simplifies the autotun-ing process by considering hardware informationand representative experiments that enable us toconsiderably reduce the configuration/parametriza-tion search space. We expect this contribution todrastically reduce the significant autotuning timerequired by complex GPU kernels.

� We propose a modular design that relies on standarddata formats and interfaces to enable a straightfor-ward connection with mainstream application code.

3 RELATED WORK

At the time of writing, there are no specifications for the com-putation of batched small linear algebra problems. Conse-quently, the Linear Algebra PACKage (LAPACK), which isthe de-facto standard library for linear algebra problems,doesn’t provide such a functionality. However at the requestof users, NVIDIA added a few batch kernels to CUBLAS6.5 [26]. Those additions include a batched version of bothBLAS and LAPACK. For BLAS kernels, batched GEMM andbatched TRSM have been released. More effort has been putinto the direction of LAPACK kernels, resulting in a batchedversion of LU and QR factorizations, matrix inversion, and aleast squares solver. Similarly, in response to customerdemand, Intel’s MKL team released a batched GEMM,while—at the time of writing—AMD does not provide anybatched operations. Vendor efforts on batched BLAS mayhave a tremendous impact and enhance the portability ofHPC applications, much like optimized BLAS did [27] whenreleased. While real-world applications may require solvinga batch of smallmatrices of different dimensions, the batchedkernels developed by NVIDIA and Intel are limited tothe case where the matrix problems in the batch are of thesame dimension. NVIDIA’s release of four major batchedLAPACK–based routines is significant; however, they didnot address the problem of portability and device-specificredesigns of the batched LAPACK algorithms.

Batched linear algebra concepts could be applied to mul-ticore CPUs as well. Indeed, small problems can be solvedefficiently on a single core (e.g., using vendor-supplied

974 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 5, MAY 2018

libraries like MKL [28] or the AMD Core Math Library[ACML] [29]), because the CPU’s memory hierarchy wouldsupport a “natural” data reuse (small enough problems canfit into small, fast memory). Besides memory reuse, to fur-ther speed up the computation, vectorization can be addedto use the supplementary single instruction, multiple data(SIMD) processor instructions—either explicitly as in theIntel Small Matrix Library (SML) [30], or implicitly throughthe vectorization in BLAS. Batched factorizations can thenbe efficiently computed for multicore CPUs by having a sin-gle core factorize a single problem at a time.

For higher-level routines, prior work has concentrated onachieving high performance for large problems using hybridalgorithms [31]. The motivation came from the fact that theGPU’s compute power cannot be used on a panel factoriza-tion as efficiently as it can on trailing matrix updates [32]. Asa result, various hybrid algorithms were developed—wherethe panels are factorized on the CPU, while the GPU is usedfor trailing matrix updates (mostly GEMMs) [33], [34]. Forlarge-enough problems, the panel factorizations and associ-ated CPU-GPU data transfers can be overlapped with GPUwork. For small problems, however, this is not possible, andour experience has shown that hybrid algorithms would notbe as efficient as they are for large problems.

To compensate for the lack of support for batched opera-tions, application developers implemented customizedbatched kernels using various approaches. For example, tar-geting very small problems (no larger than 128� 128), Villaet al. [35], [36] designed a GPU kernel for a batched LU fac-torization, where a single CUDA thread, or a single threadblock, was used to solve one system at a time. Similar tech-niques, including the use of single CUDA thread warp for asingle factorization, were investigated by Wainwright [37]for LU factorization with full pivoting on small matrices ofup to 32� 32. These problems are small enough to fit in theGPU’s shared memory (48 KB on a K40 GPU) and thus canbenefit from data reuse. Unfortunately, the results showedthese strategies do not exceed the performance of memory-bound kernels like GEMV.

4 METHODOLOGY, ANALYSIS, AND DISCUSSION

In this section, we present our methodology for analyzinghigh-performance batched linear algebra computations anddiscuss the insight and theory required to design, implement,and optimize algorithms to run efficiently on modern GPUarchitectures. We also provide algorithm design guidance toensure an efficient and effortless portability of batched linearalgebra kernels across a large range of GPU architectures.

It is a common misconception that algorithm analysis isonly useful for squeezing the last possible 5 percent ofperformance from very small matrix applications. Inother words, when forgoing this analysis, one sacrificesonly 5 percent in performance while avoiding a rigorousalgorithm analysis. While it is true that some specific multi-core CPU applications do not gain much from algorithmanalysis, accelerator-based applications have far morepotential, since their underlying principles are fundamen-tally different from conventional CPUs.

CPUs have accumulated design complexity that enablesthem to optimize instruction streams by looking ahead

hundreds of instructions at a time. Consequently, CPUs canresolve data dependence hazards, predict branch decisions,and buffer cache/memory requests efficiently. The majorityof these features are missing from GPU “cores,” which—forthe sake of accuracy—should be called “processing units.”Intel Xeon Phi coprocessors are only marginally better withtheir basic cache coherency, but they still require program-mer/compiler-directed scheduling to use their in-order exe-cution at full efficiency. Thus, many factors and constraintsshould be studied carefully to achieve relevant insight andprovide a design framework that could benefit the researchand development community. One of the main issues asso-ciated with working on small matrices is that the overallexecution time is dominated by data transfer, because thetime required to process a small matrix on modern GPUs isnegligible. Put differently, the computation of small matri-ces follows the trend of memory-bound algorithms, and theperformance strongly correlates with data transfer ratherthan floating-point operations (FLOPs).

On CPUs, very small matrices are more likely to remain inat least the L2 cache. With such a configuration, using eachcore to solve one problem at a time is enough to achieve rea-sonably high performance. This makes the development ofbatched algorithms easy to handle and optimize with a rela-tively small effort from the CPUs. However, for GPUs thecache size is very small (48–64KB for on newerGPUs), whichmakes batched linear algebra kernel implementation morechallenging. Building and implementing an efficient GPUbatched algorithm kernel requires an understanding andanalysis of many factors, whichwe address in this work.

In addition, we demonstrate that, because GPU applica-tion design puts the focus on maximizing data throughputrather than minimizing processing latency, the commonpractice of optimizing a sequential implementation first andthen optimizing the parallel version no longer applies. Forwell designed kernels focused on large matrix-matrix typeoperations, autotuning helps a lot in achieving a good perfor-mance. However, autotuning is becoming overrated in theGPU programming community, where smaller matricesreign supreme. A common misinterpretation of recent auto-tuning studies is to believe that an efficient autotuningframework is enough to harness a GPU’s performance. How-ever, when it comes to operations like solving a batch of verysmall matrix problems, even the most powerful autotuningframework, without the correct algorithm design, will fail toprovide a good performance. That said, if one designs thekernel using both an algorithmic analysis and a good under-standing of the underlying hardware, the autotuning frame-work can be simplified and tested/implemented. Unlikemost autotuning approaches, though, our strategy does notrequire hundreds of thousands of runs followed by resultinterpretations to provide an efficient kernel.

Finally, we analyze the effect of reshaping the data stor-age into a non-conventional storage to design highly opti-mized kernels for batches of small linear algebra operations.

4.1 Theoretical Analysis and PerformanceRoofline Model

In this section, we discuss the theoretical analysis of algo-rithms designed for batches of very small matrix problems.A detailed study based on Cholesky and LU factorization

HAIDAR ET AL.: A GUIDE FOR ACHIEVING HIGH PERFORMANCEWITH VERY SMALL MATRICES ON GPU: A CASE STUDY OF BATCHED... 975

algorithms are used for the sake of illustration. The rooflinemodel for processing very large matrices can be easily pre-sented without much complexity. On the other hand, work-ing out an accurate roofline model for matrices of small tomedium size involves a lot of complexities related to boththe algorithm and the hardware itself.

The roofline model for large size matrix computationshas been widely studied, and a remarkable discussion ofmatrix computation roofline models by James Demmel canbe found in [38]. In general, large matrix computation algo-rithms can be classified into three categories, listed below.

(1) Compute-Intensive Algorithms. The first category incl-udes compute-intensive algorithms, which are char-acterized by a high arithmetic intensity. Arithmeticintensity is the ratio of total FLOPs to total datamovement (bytes). In this case, the computation timeis dominant compared to the data transfer time.Consequently, a good implementation could easilyoverlap the communication time with computation,which leads to a roofline model mostly defined bythe arithmetic intensity. The upper bound is thenlimited by the computation’s peak performance. Asan example, an optimized GEMM kernel for largematrices can achieve performance close to that of themachine’s peak.

(2) Memory/Bandwidth-Bound Algorithms. The second cat-egory includes low arithmetic intensity algorithms—also known as memory-bound algorithms orbandwidth-bound algorithms. Matrix-vectors andvector-vector operations are typical examples ofthese algorithms.

(3) Compute-Intensive, Bandwidth-Bound Algorithms. Thethird category deals with algorithms that lie some-where between the previous two. These algorithmsrequire a detailed analysis in order to evaluate theirperformance upper bound. This is the case, forexample, when applying matrix-matrix operationson small matrices. While a matrix-matrix operationitself has a high arithmetic intensity, owing to thesignificant data transfer latency, the time to processthe small matrices may be closer to the time requiredto move the matrices. Since processing very smallmatrices falls in this category, we provide moredetails on assessing its performance upper boundand give some recommendations for medium-sizematrix computations.

The theoretical bound of floating-point performance iscomputed as follows: Pmax ¼ FLOPs=Tmin, where FLOPs isthe number of floating-point operations, and Tmin the mini-mum time to solution. The Tmin can be defined as

Tmin ¼ minðTRead þ TCompute þ TWriteÞ: (1)

Depending on the implementation and the type of algo-rithm, there might be some overlap between the data trans-fer step (read and write) and the computation steps.However, on modern high-performance architectures thatare capable of achieving many FLOPs per cycle per core, thetime required to read and write very small size matrices isabout 1–2 orders of magnitude higher than the computation

time. The time to read or write is predefined by the band-width, while the computation time is determined by boththe speed of the hardware and the efficiency of the imple-mentation. Thus, for very small matrix operations, anyalgorithm—even the historically compute-intensive matrix-matrix multiplication—is bounded by the data movement,and its upper-bound performance can be easily derived. Fora generic description, and for matrices of medium size(larger than 32� 32), we assume that the algorithm is ele-gantly designed. This means that the algorithm has the bestcomputation/communication overlap, even though it is lesslikely to be the case for very small matrices. In such a situa-tion, one can easily predict the minimal time Tmin and thusthe performance upper bound.

Let us take LU factorization as an example. The LU fac-torization algorithm reads an n� n matrix, meaning n2 ele-ments, and processes 23n

3 FLOPs and writes back n2

elements. On the other hand, in double precision, an NVI-DIA P100 GPU can perform 5,300 gigaFLOP/s, while itsmaximum achievable read/write bandwidth is about600 GB/s. On the P100 hardware, the time to transfer onebyte is approximately 9� higher than the time required tocomplete one FLOP. Consequently, the time to complete abatch of small matrices operations will be dominated by thedata transfer time. In other words, even with a full overlap,the minimal time will be bounded by the data movementtime, making the computation time negligible. To approxi-mate the roofline upper bound, we can define

Pupper bound ¼ FLOPs=Tdata transfer ¼ FLOPs� bM

; (2)

where FLOPs is the number of floating-point operations, bis the achievable bandwidth of the hardware, and M is thesize of the data to be moved (load and store).

4.2 The Occupancy and Bandwidth Analysis

GPU occupancy has been treated as a performance indicatorfor many years. Unfortunately, although the occupancy iscorrelated to performance, it does not necessarily imply thata code that achieves a high occupancy will also deliver highperformance.

The importance of the occupancy parameter and its cor-relation to high performance varies between compute-intensive kernels and bandwidth-bound kernels. For exam-ple, with about 12 percent occupancy, a compute-intensive,matrix-matrix multiplication achieves performance close tothat of the machine’s peak. The performance of a compute-intensive kernel is determined by the amount of data reuse(i.e., a high arithmetic intensity). More over, occupancy isnot the most determinant factor in achieving good perfor-mance. In fact, the occupancy is defined as the ratio of activethreads running over the total number of active threadssupported by the hardware (2,048 threads for the NVIDIAP100 GPU). For example, if there are 336 thread blocks (TBs)executing simultaneously on each of the P100’s 56 streamingmultiprocessors (SMXs), and each uses 256 threads, then theoccupancy is ð6� 256Þ=2048 ¼ 75%. A bandwidth-bound,matrix-vector kernel attains about 3 percent of the machinepeak, while it reaches about 95 percent of the achievablebandwidth with an occupancy between 60–90 percent.


Since it is a bandwidth-bound kernel, it is always preferableto use bandwidth as the metric rather than FLOPs per sec-ond (FLOP/s).

As shown by the representative experiment in Fig. 1, theachievable bandwidth strongly correlates to the number ofthreads per TB and to the number of TBs running simulta-neously per SMX. In other words, the achievable bandwidthis strongly correlated to the total number of threads runningon an SMX. We use the metric “per SMX” since the GPUhardware specifications are mostly defined by the numberof SMXs. However, for the sake of clarity, when we show6 TBs/SMX, we mean that there are 56� 6 ¼ 336 TBsrunning on the whole P100 GPU. This will be one of theimportant data points to consider when implementing amemory-bound algorithm or when designing a batchedalgorithm for small-size matrices. Note that any type ofalgorithm that operates on small amounts of data is consid-ered to be bandwidth-bound as shown in Section 4.1.

The number of threads per TB is mostly related to thealgorithmic design of the target application. For example, aCholesky factorization proceeds column by column, fromleft to right. One column is processed at a time followed bythe update of all the right-side columns, while the columnson the left remain untouched. These algorithm details, alongwith the matrix size, are very important and should guidethe design choices when considering thread configurations.

To design a kernel for a batched Cholesky factorization,one has the choice of using a 1-D or a 2-D grid of threads foreach matrix factorization. For anm�m square matrices fac-torization, the 2-D configuration is the most intuitive choicesince the implementation can be straightforward, with anm�m thread mapping, resulting in a 1 : 1 thread-to-matrixentry. Unfortunately, this will require using heavyweightTBs, which are relatively expensive to manage. Put differ-ently, using a 2-D configuration will result in limited TBsper SMX, consequently inducing a low-bandwidth situa-tion, as illustrated by the experiment depicted in Fig. 1. Inaddition, the Cholesky algorithm is sequential by column,and it is only during the update steps that the whole 2-Dgrid can be involved (i.e., during a column process, all ofthe threads that are not involved will be in an idle state, los-ing a lot of resources). Another penalty associated with the2-D configuration is the synchronization. In fact, a 2-D

configuration is more likely to exceed the warp size (32threads), and barriers will be required when accessingshared data.

To avoid the drawbacks exhibited by the 2-D configura-tion, we propose a design based on a 1-D configuration. Forexample, for an m�m matrix, we use a TB with m threads.This means more work per thread and therefore more roomfor parallelism.

4.3 An Analytical Study of the Algorithmic Design

This section is dedicated to the efficient implementation ofthe Cholesky factorization kernel based on a design thatuses a 1-D grid of threads. Designing this kernel involves arelatively high level of complexity. The design is criticalbecause a non-optimal decision could be penalizing interms of autotuning time.

Simply put, a basic kernel design could consist of the fol-lowing steps. (1) Load the whole matrix into the sharedmemory or into the register. (2) Perform all necessary com-putations. (3) Finally, write the result back to the mainmemory. This approach is the most common for compute-intensive kernels used to solve large matrix problems. How-ever, this design decision is not an attractive option for solv-ing thousands of small, independent matrix operations. Onmodern GPUs, the shared memory size is 64 KB per SMX.This means that, in double precision, each SMX cannot holdmatrices larger than 80� 80. Therefore, a shared-memorykernel that implements the algorithm, which consists ofloading the whole matrix into the shared memory and per-forming the factorization, will be limited to solving matricesthat are 80� 80 or smaller in double precision and160� 160 or smaller in single precision. Similarly, if we usethe register to hold the matrix, we will also be limited tomatrices that are 128� 128 or smaller in double precision(the register file size is about 256 KB, while about half ofthat will be used for internal variables and parameters). Inaddition, using the full shared memory or too many regis-ters will severely limit the number of active TBs per SMX.

According to our representative experiments illustratedin Fig. 1, having 1–2 TBs per SMX will result in lowbandwidth. Since small-size matrix computations arebandwidth-bound, using a few TB per SMX is a bad designdecision that can lead to poor performance. To achievegood performance from very small–size matrices (up to32� 32), one option is to use roughly 8þKB of sharedmemory per TB, with 7 TBs per SMX. However, when thematrix size exceeds 32� 32, another design option shouldbe investigated.

For the sake of simplicity, we rely on the shared-memoryimplementation to describe the design for matrices largerthan 32� 32. The register-based version is very similar.Later, we will revisit both the shared memory and the regis-ter versions for LU factorization. To address matrices largerthan 32� 32, the key idea is to divide the whole matrix intoblock columns—also known as “panels.” For example, ann� n matrix will be divided into blocks of n� ib, whereib—called the “block size”—is the number of columns perblock. This modification helps us avoid loading the wholematrix directly into shared memory and instead allows usto load it panel by panel. This is known as a “blockingtechnique.” Since the panel factorization is mainly

Fig. 1. The achievable bandwidth based on the number of TBs per SMXand the number of threads in each TB on the NVIDIA P100 GPU. Note:to obtain the total number of TBs running on the whole GPU, the x axisvalue should be multiplied by 56 since the P100 has 56 SMXs.


sequential, splitting the factorization into panels is a reason-able design decision.

There are many versions of the Cholesky factorization.Here, we discuss the right-looking, the left-looking, and thetop-looking variants. As an example, we show the analysis ofthe left-looking and the right-looking variants. In the right-looking variant illustrated in Algorithm 1, thematrix is factor-ized, per panel, from left to right. At each step, the currentpanel is processed followed by an immediate update of thepanel at the right side of the current panel. In the left-lookingvariant described in Algorithm 2, at each step the updatefrom previous panels (left side) are applied to the currentpanel before performing the computations on the currentpanel. In other words, the updates are applied as late as pos-sible, while on the right-looking algorithm, the updates areapplied as soon as possible. We refer the reader to [39] forfurther details on the different Cholesky implementations.

Algorithm 1. The Right-Looking Fashion

for i 0 tom Step ib do// Panel factorize Ai:m;i:iþibPOTF2 Ai:iþib;i:iþib;TRSM Aiþib:m;i:iþib ¼ Aiþib:m;i:iþib �A�1i:iþib;i:iþib;// Update trailing matrix Aiþib:m;iþib:mSYRK Aiþib:m;iþib:m ¼ Aiþib:m;iþib:m �Aiþib:m;i:iþib �ATiþib:m;i:iþib;

end

Algorithm 2. The Left-Looking Fashion

for i 0 tom Step ib doif (i > 0) then// Update current panel Ai:m;i:iþibSYRK Ai:iþib;i:iþib ¼ Ai:iþib;i:iþib �Ai:iþib;0:i �ATi:iþib;0:i;GEMM Aiþib:m;i:iþib ¼ Aiþib:m;i:iþib �Aiþib:m;0:i �ATi:iþib;0:i;

end// Panel factorize Ai:m;i:iþibPOTF2 Ai:iþib;i:iþib;TRSM Aiþib:m;i:iþib ¼ Aiþib:m;i:iþib �A�1i:iþib;i:iþib;

end

To analyze the roofline bound of the right-looking variant,let’s calculate the volume of the data transfer. At each itera-tion, applying the updates from the current panel requiresloading and storing back all the panels at the right of thecurrent one. For the sake of exposure, the panels at the rightof the current panel will be referred to as the “trailingmatrix.” If the shared memory is used to hold the currentpanel, then a register will be dedicated to the storage of aportion of the trailing matrix to update and vice versa. Themain drawback exhibited by this algorithm is the unneces-sarily large amount of data movement. In fact, the first panelwill be loaded once, only for its factorization. The secondpanel will be loaded once to apply the updates for the firstpanel, and the second panel will be loaded the second timefor its own factorization; in general, the kth panel will beloaded k times. In terms of communication volume, at itera-tion k, the current panel will be of size ðn� ðk� 1ÞibÞ � ib,and the trailing matrix will be of size ðn� ðk� 1ÞibÞ�ðn� k� ibÞ. The sum gives ðn� ðk� 1ÞibÞ2. However, sincethe Cholesky factorization is applied to a symmetric posi-tive definite matrix, only half of the matrix needs to be

considered, which reduces the communication volume to12 ðn� ðk� 1ÞibÞ2. Since each panel is loaded and then storedback, we can infer that the volume of data transfer can bemultiplied by two, which means that the total communica-tion volume at the kth iteration is ðn� ðk� 1ÞibÞ2. Assum-ing that the matrix is divided into p same-size panels(n ¼ p� ib), the total volume of communications for theright looking variant is

V ¼ data readþ data written

¼Xp

k¼1n� ðk� 1Þibð Þ2;

¼Xp

k¼1ðp� ib� ðk� 1ÞibÞ2;

¼ ib2Xp

k¼1ðpþ 1� kÞ2;

¼ ib2 p3

3þ p

2

2þ p6

� �;

� ib2 p3

3¼ ib2 n

3

3ib3;

� 13

n3

ib:

(3)

In contrast, to the right-looking algorithm, the left-lookingvariant loads a panel, applies the updates from previouspanels (left side of the panel), factorizes, and stores back.With respect to the right-looking design, the current panel isstored into the shared memory, portions of the matrix on itsleft loaded into a register, in order to perform its update. Atthe kth iteration, the volume of data involved in loading thecurrent panel and storing it back is ðn� ðk� 1ÞibÞ � ib, whilethe volume of data required for moving the updates fromprevious panels is ðn� ðk� 1ÞibÞ � ðk� 1Þib. Thus, the vol-ume of communication at iteration k is kðn� ðk� 1ÞibÞib.Since n ¼ p� ib, we get kðp� kþ 1Þib2: As consequence,the total amount of communications for the left lookingvariant is

V ¼ data readþ data written

¼ ib2Xp

k¼1kðp� kþ 1Þ;

¼ ib2Xp

k¼1ðkðpþ 1Þ � k2Þ;

¼ ib2 ðpþ 1Þ ðpþ 1Þp2

� p3

3þ p

2

2þ p6

� �� ;

¼ ib2 16p3 þ 1

2p2 þ 1

3p

� �� 1

6ib2p3;

� 16

n3

ib:

(4)

Since very small size matrix processing is bandwidth-bound, and the performance depends on the volume of datatransfer, we can now derive the roofline performance upperbound for both the right-looking and the left-looking variants.The Cholesky factorization of an n� n matrix consumesabout 13n

3 FLOPs. Thus, with respect to Equation (2), in


double precision, we can expect the right-looking versionto have an asymptotic performance upper bound of13n

3 � 3ibb8n3¼ ibb8 . Using the same method, the left-looking

variant will be bounded by ibb4 , in double precision, whichmeans that—in theory—the left-looking variant couldachieve twice the performance of the right-looking imple-mentation, in the context of very small size matrices. Todecide whether to use the right-looking or the left-lookingalgorithm, a traditional approach consists of prototypingboth approaches and going through long autotuningsweeps before assessing the performance of the two algo-rithms. However, based on our effective algorithm analysis,we proved that the right-looking variant is not suitable forvery small matrices.

At this point, the primary question is the effectiveness ofour algorithm analysis. In an ideal scenario, the bandwidthb � 600 GB=s. With ib ¼ 8, the performance left-looking ker-nel is bounded by 1,200 gigaFLOP/s. On the other hand, theexperimental results of our kernel based on the left-lookingdesign is depicted in Fig. 2a. As illustrated by the chart, theperformance obtained is far from the 1,200 gigaFLOP/s per-formance upper bound. Achieving half of the upper boundperformance does not necessary mean that the computationis expensive or sequential. It is important to note that theresults displayed have been intensively autotuned, andonly the best results have been reported.

A careful study of the design, along with the informationreported in Fig. 1, could—in fact—allow an accurate guessof most of the results displayed in Fig. 2a. For the sake ofillustration, let us take n ¼ 512 and ib ¼ 8 as examples.Based on the algorithm characteristics, the code requires

about n� ibþ ib2 elements to be stored in shared memory,which is about 32.5 KB using 512 threads. As result, only1 TB can run per SMX, meaning our achievable bandwidthin this condition is about 420 GB/s. Consequently, a reason-able performance upper bound is 840 gigaFLOP/s. Thisdemonstrates the effectiveness of our implementation.However, more investigations may provide additionalinformation that will help us understand why we did notmake it close to the upper bound. An advanced analysis ofthe Cholesky factorization revealed that, for the first panel,the algorithm requires all 512 threads to work. However,for the next panel, the number of threads required decreases

by ib and so on until the last panel, where only ib threadshave to work. This is an interesting clue for understandingwhy we failed to reach the performance upper bound. Sincethe real bandwidth is a function of the number of threadsand the number of TBs, we achieve 420 GB/s with 1 TB perSMX, when 512 threads are working. But by using only 128threads with 1 TB per SMX, the bandwidth decreases up to200 GB/s. Thus, if we reformulate our performance analysisbased on these details, we can display the upper bound cor-responding to each ib. With this, we realized in advancethat ib ¼ 8 or ib ¼ 10 will be among the optimal configura-tions. This shows that our model can also serve as a base toprune the autotuning search space. Consequently, the auto-tuning process is simplified considerably.

As shown in Fig. 1, when a small number of threads areused in a TB (e.g., 32 threads) it is beneficial to run morethan 8 TBs per SMX in order to extract a high bandwidth.Unfortunately, run more TBs we have to decrease the valueof ib, which is more likely to decrease our roofline bound.Therefore, ib ¼ 8 is the best scenario for the current design.For example, for n ¼ 512, we need to allow more than 1 TBper SMX when, say, only 64 threads are working. By study-ing the algorithm, we found that the shared memoryrequirement to reserve n� ibþ ib2 elements, where n is thesize of the matrix, is the constraint that does not allow morethan 1TB per SMX. However, for the factorization of aremaining portion of size 64� 64, only 64 threads arerequired, and in terms of memory, only 64� ibþ ib2 isrequired, not the entire 512� ibþ ib2. To optimize theshared memory usage, instead of allocating n� ibþ ib2 forentire factorization, we revisited the algorithm to enablelaunching a kernel for every panel with the appropriatememory requirement. For example, when the remainingportion is 64� 64, we need 64� ibþ ib2 as a shared memoryspace, which will allow for an ib ¼ 8 achieving up to 14 TBsper SMX using 64 threads each. Such a configuration isbandwidth friendly, and we can expect to extract more than550 GB/s—instead of 100 GB/s with the previous design.This improvement is beyond the insight one can gain fromautotuning.

When implementing this third design, we used the samekernel proposed above (design of Algorithm 2 left-looking),but we now use an optimal shared memory allocated at

Fig. 2. Kernel design and autotuning of the Cholesky factorization.


each step. Thus, we designed a fused GPU kernel that per-forms the four routines of one iteration of Algorithm 2. Sucha design will minimize the memory traffic, increase the datareuse from shared memory, and reduce the overhead oflaunching multiple kernels. Our optimized and customizedfused kernel performs the update (SYRK and GEMM opera-tions) and keeps the updated panel in shared memory to beused by the factorization step. The cost of the left-lookingalgorithm is dominated by the update step, (SYRKand GEMM). The panel C, illustrated in Fig. 3, is updatedas C ¼ C �A�BT . In order to decrease its cost, weimplemented a double buffering scheme as described inAlgorithm 3. We prefix the data array with “r” to specifythe register and “s” to specify the shared memory. We pre-fetch data from A into the register array, rAk, while a multi-plication is being performed between the register array rAkkand the array sB stored in shared memory. Since the matrixB is the shaded portion of A, our kernel avoids reading itfrom the global memory and transposes it in situ to theshared memory, sB. Once the update is finished, the factori-zation (POTF2 and TRSM) is performed as one operationon the panel C, held in shared memory.

Algorithm 3. The Fused Kernel Correspond to OneIteration of Algorithm 2.

rAk Aði:m;0:lbÞ; rC 0;for k 0 tom� i Step lb dorAkk rAk;sB rAkði:lb;k:kþlbÞ inplace transpose;barrierðÞ;rA1 Aði:m;kþlb:kþ2lbÞ prefetching;rC rC + rAkk�sB multiplying;barrierðÞ;

endsC rA1 - rC;factorize sC;

This implementation achieves close to the peak band-width, and—according to our model—in double precisionPupper bound ¼ ibb4 ¼ 1; 200 gigaFLOP/s. The performance ofthis implementation is depicted in Fig. 2b. Not only did theperformance improve considerably, but we achieved perfor-mance very close to the predicted theoretical peak, whichhighlights both the efficiency of the implementation and theaccuracy of our model.

The objective of this example is to learn and understandthe expectation of a design without necessarily delving intothe implementation and autotuning efforts. There is also apractical lesson from this study in that we now know

autotuning strategies can only help one get close to the theo-retical peak of the algorithm being designed. To achieve rea-sonable performance, one should investigate the design thathas the highest theoretical bound before investing in auto-tuning. This should reduce the effort of researchers/devel-opers in exploiting modern GPUs at full efficiency andprovide an accurate performance spectrum.

4.4 Analysis and Design of Batched LUFactorization of Very Small–Size Matrices

In this section, we analyze another example of kernel designfor batched computations of very small matrices ranging insize from 2� 2 to 32� 32. Using the LU factorization as anexample in this study, the analysis performed in this section issimilar to the experiments we defined for Cholesky. Since thetarget sizes are very small (less than 32� 32), it is beneficial tokeep thewholematrix in sharedmemory or in the register.

To accurately estimate the number of TBs that we shouldrun simultaneously, we need a reliable estimate of theamount of shared memory required. This is also true for thenumber of registers. To ensure good performance, only alimited number of registers can be allocated per thread. Infact, for the number of registers per threads, beyond a cer-tain bound, the number of TBs to run simultaneously willbe reduced to only a few, which is detrimental to the band-width. To start our performance analysis, we first evaluatedthe amount of shared memory required in cases where theshared memory implementation is beneficial. We did thesame study for the register version (i.e., where the wholematrix is stored in the registers). The left y-axis of Fig. 4shows the amount of shared memory (in KBs) required forthe shared memory design (SH-MEM). The right y-axis ofFig. 4 shows the number of registers per thread required bythe register design (REG-v1).

Using the data from Fig. 4, we try to predict the perfor-mance bound of each design (SH-MEM and REG-v1) toselect the most suitable candidate.

Based on the design options and hardware constraints,the optimal number of TBs per SMX can be approximated.For the SH-MEM case, the estimated optimal ratio of TBsper SMX is illustrated in Fig. 5 (orange curve). With respectto hardware constraints, the maximum number of TBs per

Fig. 3. left-looking Cholesky factorization.

Fig. 4. The amount of shared memory (in KB) required by the SH-MEMdesign (left y-axis) and the amount of registers per thread required bythe REG-v1 design (right y-axis) for the LU factorization in double preci-sion for matrices ranging from 2� 2 to 32� 32 on an NVIDIA P100 GPU.


SMX is 32 (depicted in grey). Consequently, any efficientimplementation of the SH-MEM design would need to setthe TBs per SMX ratio based on the minimum of the hard-ware constraint (i.e., 32) and the estimate obtained from thealgorithm analysis. The SH-MEM design is implemented,with the results illustrated by the blue curve of Fig. 4, wherethe executed number of TBs per SMX were measured usingthe NVIDIA profiler tools. The effectiveness of our design isillustrated by the fact that the measured data matches themodel estimations. However, our objective is not limited todetermining the optimal TBs per SMX but rather to deducethe possible performance peak based on the number of TBsper SMX and to figure out the best implementation to opti-mize (i.e., SH-MEM versus REG-v1). To this end, we did thesame study for the REG-v1 design and presented (Fig. 6) thenumber of optimal TBs per SMX (orange curve) computedfrom data on Fig. 4. Similarly, we also measured the exe-cuted TBs per SMX during a real run of the code (bluecurve) to assess our prediction. A comparison betweenFigs. 5 and. 6 reveals that both versions allow and use thesame number of TBs per SMX for matrices up to 20� 20,above which the REG-v1 design allows a higher number ofTBs per SMX. Consequently, for matrices ranging from2� 2 to 20� 20, the same performance can be expectedfrom both designs. However, for matrices beyond 20� 20,the REG-v1 design should be preferred.

Fig. 7 shows the performance obtained (in gigaFLOP/s)using the two designs. This experimental result is consistent

with our analysis (i.e., for matrices ranging from 2� 2 to20� 20, the SH-MEM design and the REG-v1 design exhibitsimilar performance). Also, as expected, for matrices largerthan 20� 20, the REG-v1 design outperformed the SH-MEM design. Such consistency should be of a great interestto the community, and the proposed model is an excellenttool for understanding and analyzing the design of algo-rithms prior to implementation. The model also allows us tounderstand our code and let us find the correct optimizationpath in order to achieve performance close to the theoreticalupper bound.

A further comparison of data from Figs. 5 and. 6, revealsthat the hardware constraint of the maximal number of TBsper SMX was reached for matrices smaller than 12� 12. Fol-lowing up, we focused on REG-v1, because it provided bet-ter performance. The hard constraint we found for theoptimization was that we could not go beyond 32 TBs perSMX. However, for matrices smaller than 12� 12, the TBsused less than 32 threads; consequently, they were using asub-optimal amount of bandwidth. A higher bandwidthcould be achieved by increasing the number of workingthreads. This was possible by revisiting our design and allo-cating a 2-D grid of threads, where each 1-D grid operatedon an independent matrix. This workaround allowed us toreach the maximal ratio of TBs per SMX. Now, one TBwould operate on Dim:y matrices. For example, we couldassign 128 threads (32� 4) to a TB and make it operate onfour independent matrices. This configuration increased thetotal number of registers per SMX (4� in this example).

Using data from Fig. 4, we computed the number of TBsper SMX, as illustrated in Fig. 6 (purple curve). The informa-tion depicted in Figs. 1 and 6 has been decisive in evaluatingthe optimal configuration of 2-D threads for matricessmaller than 16� 16. With this in mind, we can say confi-dently that the 32� 4 threads configuration would helpachieve even better performance. However, for matriceslarger than 16� 16, the new version of the register design(REG-v2) will be more likely to run less than 6 TBs perSMX, leading to an upper bandwidth of less than 10 TBs perSMX of 32 threads each; consequently, the performance willbe lower when compared to REG-v1. This is not surprisingsince REG-v2 is designed specifically for matrices smallerthan 16� 16.

In Fig. 8, we show the previous register design (REG-v1)and the latest register design (REG-v3), where we used the

Fig. 5. Analysis of the number of TBs per SMX estimated. Real run forthe SH-MEM design of the LU factorization on an NVIDIA P100.

Fig. 6. Analysis of the number of TBs per SMX estimated. Real run forthe REG-v1 design of the LU factorization on an NVIDIA P100.

Fig. 7. Performance comparison of SH-MEM versus REG-v1 for the LUfactorization on an NVIDIA P100.


2-D grid for matrices smaller than 16� 16 and kept theREG-v1 design for matrices larger than 16� 16. We alsocompared our implementation to the cuBLAS 8.0 batchedLU solver. The first observation is that the results match theexpectations of our analysis. The second point is related tothe efficiency of our implementation since it outperforms thecuBLAS batched LU solver by about 3� on 32� 32matrices.

This paper does not aim to provide details on theadvancement of the LU factorization algorithm. However,we would like to mention that it is possible to improve theperformance by delaying the algorithm’s swapping process,as reported in [40]. This optimization does not affect thenumber of registers or shared memory required and canimprove performance by 10 percent (purple curve of Fig. 8).

Finally, after all possible algorithm analysis and optimiza-tion, we found it reasonable to apply autotuning strategies.The autotuning experiments showed that an improvementof only about 5 percent could be obtained on top of ouralgorithm analysis.

4.5 The Interleaved Data Layout

We also investigated alternatives to conventional data stor-age to assess possible improvements. Based on the analysisoutlined in this work, we believe that there may be a littleperformance gain for matrices smaller than 20� 20.

When the matrices are very small, it becomes more chal-lenging to have a coalesced memory read, and as the dimen-sions of the matrices become smaller, it eventually becomesimpossible to have any coalesced reads at all for matricessmaller than 16� 16. The easiest andmost basic way to solvethis problem is to reorder the dimensions in an interleavedfashion (e.g., the first element of Matrix1 will be followed bythe first element of Matrix2 and so on, eventually movingthrough all elements of every matrix). In this case, one warpreads 32 elements, with the same row and column index in32 consecutive matrices. It is true that data is now 128-bytealigned, but this kind of storagewill not allow for an efficientimplementation on a GPU for matrices larger than 16� 16 indouble precision and for matrices larger than 32� 32 in sin-gle precision. The reason being: 32 threads will be workingon 32 different matrices, thereby making it impossible tohold data in shared memory or in the registers. On the P100for example, this will limit the amount of shared memoryavailable for the panel of the Cholesky factorization to 2 KB.

This results in bad performance except for the sizes men-tioned above, where the performance obtained from such adesign was close to that obtained with the standard design.This kind of design might be interesting for a GEMV orTRSV type of operation, where the matrix is read only once.Recent studies on optimized batched BLAS kernels designedfor multicore architectures have shown promising resultsover the classical approach of solving one problem per coreat a time [41], [42].

5 CONCLUSION AND FUTURE REMARKS

This paper presented a model and an analysis on how todesign GPU kernels for very small matrix computations. Weprovided a detailed study of the optimization process, andwe also demonstrated how a detailed performance analysiscould help considerably reduce developer efforts and manhours required to design an efficient GPU kernel. The pro-posed work will also simplify the autotuning process,where—instead of generating, running, and analyzing tens ofthousands of configurations—one can dramatically decreasethis number to a small subset. We showed a Cholesky factori-zation and an LU factorization as case studies and showedhow we were able to reach the theoretical peak performanceof these kernels. The model is designed specifically for verysmall matrices, where the computation is memory-bound,and high performance can be achieved through a set of opti-mizations different from those used in the more standardlarge matrix computations. Future directions include study-ing other algorithms of interest to the scientific communityand discoveringmore detailed optimization techniques in thearea of deep learning, where little research has been con-ducted from this perspective.

ACKNOWLEDGMENTS

This material is based upon work supported by the NationalScience Foundation under Grants No. CSR 1514286 and No.OAC 1740250, NVIDIA, and the Department of Energy.

REFERENCES[1] S. Chetlur, et al., “cuDNN: Efficient primitives for deep learning,”

CoRR, vol. abs/1410.0759, 2014. [Online]. Available: http://arxiv.org/abs/1410.0759

[2] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos, “Tensorsfor data mining and data fusion: Models, applications, and scal-able algorithms,” ACM Trans. Intell. Syst. Technol., vol. 8, no. 2,pp. 16:1–16:44, Oct. 2016. [Online]. Available: http://doi.acm.org/10.1145/2915921

[3] O. Messer, J. Harris, S. Parete-Koon, and M. Chertkow, “Multicoreand accelerator development for a leadership-class stellar astro-physics code,” in Proc. 11th Int. Conf. Appl. Parallel Sci. Comput.:State-of-the-Art Scientific Parallel Comput., 2012, pp. 92–106.

[4] J. M. Molero, E. M. Garz�on, I. Garc�ıa, E. S. Quintana-Ort�ı, andA. Plaza, “Efficient Implementation of Hyperspectral AnomalyDetection Techniques on GPUs and Multicore Processors,” IEEEJ. Selected Topics Applied Earth Observations Remote Sensing, vol. 7,no. 6, pp. 2256–2266, Jun. 2014, doi: 10.1109/JSTARS.2014.2328614.

[5] M. Anderson, D. Sheffield, and K. Keutzer, “A predictive modelfor solving small linear algebra problems in GPU registers,” inProc. IEEE 26th Int. Parallel Distrib. Process. Symp., 2012, pp. 2–13.

[6] T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra,“A step towards energy efficient computing: Redesigning a hydro-dynamic application on CPU-GPU,” in Proc. IEEE 28th Int. ParallelDistrib. Process. Symp., 2014, pp. 972–981.

Fig. 8. Performance comparison of the two-register design of the LU fac-torization on an NVIDIA P100.


http://arxiv.org/abs/1410.0759http://arxiv.org/abs/1410.0759http://doi.acm.org/10.1145/2915921http://doi.acm.org/10.1145/2915921http://dx.doi.org/10.1109/JSTARS.2014.2328614

[7] A. A. Auer, et al., “Automatic code generation for many-bodyelectronic structure methods: The tensor contraction engine,”Molecular Phys., vol. 104, no. 2, pp. 211–228, 2006.

[8] S. N. Yeralan, T. A. Davis, and S. Ranka, “Sparse mulitfrontal QRon the GPU,” University of Florida Technical Report, Gainesville,FL, USA, 2013. [Online]. Available: http://faculty.cse.tamu.edu/davis/publications_files/qrgpu_paper.pdf

[9] H. Shan, S. Williams, W. de Jong, and L. Oliker, “Thread-level par-allelization and optimization of NWChem for the intel MICarchitecture,” in Proc. 6th Int. Workshop Program. Models Appl. Mul-ticores Manycores, 2015, pp. 58–67. [Online]. Available: http://doi.acm.org/10.1145/2712386.2712391

[10] J. R. Hammond, S. Krishnamoorthy, S. Shende, N. A. Romero, andA. D. Malony, “Performance characterization of global addressspace applications: A case study with NWChem,” ConcurrencyComput.: Practice Experience, vol. 24, no. 2, pp. 135–154, 2012.[Online]. Available: http://dx.doi.org/10.1002/cpe.1881

[11] M. J. C. S. M. Mniszewski, C. F. A. Negre, and A. M. N. Niklasson,“Distributed graph-based density. matrix calculation for quan-tum. molecular dynamics using GPUS,” 2016. [Online]. Available:http://on-demand.gputechconf.com/gtc/2016/presentation/s6481-susan-mnis%zewski-graph-based-density-matrix-calculation-quantum-molecular-dynamics-using% -gpus.pdf

[12] A. M. N. Niklasson, et al., “Graph-based linear scaling electronicstructure theory,” J. Chemical Phys., vol. 144, no. 23, 2016, Art. no.234101. [Online]. Available: http://dx.doi.org/10.1063/1.4952650

[13] S. Chetlur, et al., “cuDNN: Efficient primitives for deep learning,”CoRR, vol. abs/1410.0759, 2014. [Online]. Available: http://arxiv.org/abs/1410.0759

[14] Going beyond full utilization: The inside scoop on nervanas wino-grad kernels. 2016. [Online]. Available: https://www.nervanasys.com/winograd-2/

[15] J. C. Sweet, R. J. Nowling, T. Cickovski, C. R. Sweet, V. S. Pande,and J. A. Izaguirre, “Long timestep molecular dynamics on thegraphical processing unit,” J. Chemical Theory Comput., vol. 9,no. 8, pp. 3267–3281, 2013. [Online]. Available: http://dx.doi.org/10.1021/ct400331r

[16] S. P�all, E. Lindahl, and B. Hess, “Efficient multi-level heterogeneousparallelization of molecular dynamics in gromacs.” (2015). [Online].Available: http://www.cyi.ac.cy/cscpostersession/Pall.pdf

[17] S. P�all, M. J. Abraham, C. Kutzner, B. Hess, and E. Lindahl,“Tackling exascale software challenges in molecular dynamicssimulations with GROMACS,” CoRR, vol. abs/1506.00716, 2015.[Online]. Available: http://arxiv.org/abs/1506.00716

[18] J. H. Chen, et al., “Terascale direct numerical simulations of turbu-lent combustion using S3D,” Comput. Sci. Disc., vol. 2, pp. 1–31,2009.

[19] M. W. Guidry, J. J. Billings, and W. R. Hix, “Explicit integration ofextremely stiff reaction networks: Partial equilibrium methods,”Comput. Sci. Discovery, vol. 6, no. 1, 2013, Art. no. 015003. [Online].Available: http://stacks.iop.org/1749–4699/6/i=1/a=015003

[20] B. Brock, A. Belt, J. J. Billings, and M. Guidry, “Explicit integrationwith GPU acceleration for large kinetic networks,” Jan. 2015.[Online]. Available: http://arxiv.org/abs/1409.5826

[21] W. Raphael and Friedrich-Karl, “Silicon burning II: Quasi-equilibrium and explosive burning,” ApJ, vol. 511, pp. 862–875,Feb. 1999.

[22] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linearalgebra for hybrid GPU accelerated manycore systems,” ParellelComput. Syst. Appl., vol. 36, no. 5–6, pp. 232–240, 2010.

[23] T. Dong, A. Haidar, S. Tomov, and J. Dongarra, “A fast batchedCholesky factorization on a GPU,” in Proc. 2014 Int. Conf. ParallelProcess., Sep. 2014, pp. 432–440.

[24] T. Dong, A. Haidar, P. Luszczek, A. Harris, S. Tomov, andJ. Dongarra, “LU Factorization of Small Matrices: AcceleratingBatched DGETRF on the GPU,” in Proc. 16th IEEE Int. Conf. HighPerform. Commun., Aug. 2014, pp. 157–160.

[25] A. Haidar, P. Luszczek, S. Tomov, and J. Dongarra, “Towardsbatched linear solvers on accelerated hardware platforms,”in Proc. 20th ACM SIGPLAN Symp. Principles Practice ParallelProgram., 2015, pp. 261–262.

[26] CUBLAS 6.5, Jan. 2015, available at http://docs.nvidia.com/cuda/cublas/

[27] J. Dongarra, et al., “A proposed API for batched basic linear alge-bra subprograms,” Manchester Institute for Mathematical Scien-ces, The University of Manchester, Manchester, U.K., Apr. 2016.[Online]. Available: http://eprints.ma.man.ac.uk/2464/

[28] Intel Math Kernel Library, 2014. [Online]. Available: http://software.intel.com/intel-mkl/

[29] ACML - AMD Core Math Library. 2014. [Online]. Available:http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-librar% y-acml

[30] Intel Pentium III Processor - Small Matrix Library, 1999. [Online].Available: http://www.intel.com/design/pentiumiii/sml/

[31] S. Tomov, R. Nath, and J. Dongarra, “Dense linear algebra solversfor multicore with GPU accelerators,” in Proc. IEEE Int. Symp.Parallel Distrib. Process. Workshops Phd Forum, 2014, https://www.sciencedirect.com/science/article/pii/S2352711015000059

[32] V. Volkov and J. W. Demmel, “LU, QR and Cholesky factoriza-tions using vector capabilities of GPUs,” University of California,Berkeley, CA, USA, Tech. Rep. UCB/EECS-2008–49, May 13 2008.

[33] E. Agullo, et al., “Faster, cheaper, better – a hybridization method-ology to develop linear algebra software for GPUs,” in GPU Com-puting Gems. Burlington, MA, USA: Morgan Kaufmann, Sep. 2010,vol. 2.

[34] J. Dongarra, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, andA. YarKhan, “Model-driven one-sided factorizations on multicoreaccelerated systems,” Int. J. Supercomputing Frontiers Innovations,vol. 1, no. 1, pp. 85–115, Jun. 2014.

[35] V. Oreste, M. Fatica, N. A. Gawande, and A. Tumeo, “Power/per-formance trade-offs of small batched LU based solvers on GPUs,”in Proc. 19th Int. Conf. Parallel Process., Aug. 2013, vol. 8097,pp. 813–825.

[36] V. Oreste, N. A. Gawande, and A. Tumeo, “Accelerating subsur-face transport simulation on heterogeneous clusters,” in Proc.IEEE Int. Conf. Cluster Comput., Sep. 2013, pp. 1–8.

[37] I. Wainwright, “Optimized LU-decomposition with full pivotfor small batched matrices,” gTC’13 – ID S3069, Apr. 2013.[Online]. Available: http://on-demand.gputechconf.com/gtc/2013/presentations/S3069-LU-Decomp%osition-Small-Batched-Matrices.pdf

[38] S. Williams, A. Waterman, and D. Patterson, “RoofLine: Aninsightful visual performance model for multicore architectures,”Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009. [Online].Available: http://doi.acm.org/10.1145/1498765.1498785

[39] H. Ltaief, S. Tomov, R. Nath, and J. Dongarra, “Hybrid multicorecholesky factorization with multiple GPU accelerators.”

[40] A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra,“Factorization and inversion of a million matrices using GPUs:Challenges and countermeasures,” Procedia Comput. Sci., vol. 108,pp. 606–615, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050917308785

[41] J. Dongarra, S. Hammarling, N. J. Higham, S. D. Relton,P. Valero-Lara, and M. Zounon, “The design and performance ofbatched BLAS on modern high-performance computing systems,”Procedia Comput. Sci., vol. 108, pp. 495–504, 2017.

[42] J. J. Dongarra, S. Hammarling, N. J. Higham, S. D. Relton, andM. Zounon, “Optimized batched linear algebra for modernarchitectures,” in Proc. 23rd Int. Conf. Parallel Distrib. Comput. San-tiago de Compostela, 2017. pp. 511–522. [Online]. Available:https://doi.org/10.1007/978–3-319-64203-1_37

Azzam Haidar holds a research scientist positionin the Innovative Computing Laboratory, theUniversity of Tennessee. His research revolvesaroundParallel Linear Algebra for Scalable Distrib-uted Heterogeneous Architectures such as multi-core CPUs and accelerators (Intel Xeon-Phi,NVIDIA and AMDGPUs). His goal is to create soft-ware that simplifies development of applicationsthat achieve high-performance and portability.Such programming models rely on asynchronousand out-of-order scheduling of operations. These

concepts are the basis for scalable and efficient software for Computa-tional Linear Algebra and applications. Another research interest includethe development/implementation of numerical algorithms and softwarefor large scale parallel sparse problems in order to develop hybridapproaches combining direct and iterative algorithms to solve systemsof linear algebraic equations with large sparse matrices. Contact him [email protected]


http://faculty.cse.tamu.edu/davis/publications_files/qrgpu_paper.pdfhttp://faculty.cse.tamu.edu/davis/publications_files/qrgpu_paper.pdfhttp://doi.acm.org/10.1145/2712386.2712391http://doi.acm.org/10.1145/2712386.2712391http://dx.doi.org/10.1002/cpe.1881http://on-demand.gputechconf.com/gtc/2016/presentation/s6481-susan-mnis%zewski-graph-based-density-matrix-calculation-quantum-molecular-dynamics-using% -gpus.pdfhttp://on-demand.gputechconf.com/gtc/2016/presentation/s6481-susan-mnis%zewski-graph-based-density-matrix-calculation-quantum-molecular-dynamics-using% -gpus.pdfhttp://on-demand.gputechconf.com/gtc/2016/presentation/s6481-susan-mnis%zewski-graph-based-density-matrix-calculation-quantum-molecular-dynamics-using% -gpus.pdfhttp://on-demand.gputechconf.com/gtc/2016/presentation/s6481-susan-mnis%zewski-graph-based-density-matrix-calculation-quantum-molecular-dynamics-using% -gpus.pdfhttp://on-demand.gputechconf.com/gtc/2016/presentation/s6481-susan-mnis%zewski-graph-based-density-matrix-calculation-quantum-molecular-dynamics-using% -gpus.pdfhttp://dx.doi.org/10.1063/1.4952650http://arxiv.org/abs/1410.0759http://arxiv.org/abs/1410.0759https://www.nervanasys.com/winograd-2/https://www.nervanasys.com/winograd-2/http://dx.doi.org/10.1021/ct400331rhttp://dx.doi.org/10.1021/ct400331rhttp://www.cyi.ac.cy/cscpostersession/Pall.pdfhttp://arxiv.org/abs/1506.00716http://stacks.iop.org/1749--4699/6/i=1/a=015003http://arxiv.org/abs/1409.5826http://docs.nvidia.com/cuda/cublas/http://docs.nvidia.com/cuda/cublas/http://eprints.ma.man.ac.uk/2464/http://software.intel.com/intel-mkl/http://software.intel.com/intel-mkl/http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-librar% y-acmlhttp://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-librar% y-acmlhttp://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-librar% y-acmlhttp://www.intel.com/design/pentiumiii/sml/https://www.sciencedirect.com/science/article/pii/S2352711015000059https://www.sciencedirect.com/science/article/pii/S2352711015000059http://on-demand.gputechconf.com/gtc/2013/presentations/S3069-LU-Decomp%osition-Small-Batched-Matrices.pdfhttp://on-demand.gputechconf.com/gtc/2013/presentations/S3069-LU-Decomp%osition-Small-Batched-Matrices.pdfhttp://on-demand.gputechconf.com/gtc/2013/presentations/S3069-LU-Decomp%osition-Small-Batched-Matrices.pdfhttp://on-demand.gputechconf.com/gtc/2013/presentations/S3069-LU-Decomp%osition-Small-Batched-Matrices.pdfhttp://doi.acm.org/10.1145/1498765.1498785http://www.sciencedirect.com/science/article/pii/S1877050917308785http://www.sciencedirect.com/science/article/pii/S1877050917308785https://doi.org/10.1007/978--3-319-64203-1_37

Ahmad Abdelfattah received the BSc and MScdegrees in computer engineering from the AinShams University, Egypt, and the PhD degree incomputer science from the King Abdullah Univer-sity of Science and Technology (KAUST), in2015, where he was a member of the ExtremeComputing Research Center (ECRC). He is cur-rently a research scientist in the Innovative Com-puting Laboratory, the University of Tennessee.He works on optimization techniques for manydense linear algebra algorithms at differentscales. Contact him at [email protected]

Mawussi Zounon received the PhD degree incomputer science and applied mathematics fromthe University of Bordeaux, for his contribution tonumerical fault tolerant strategies for large sparselinear algebra solvers with a special focus on Kry-lov subspace methods. He is a research associ-ate in the Numerical Linear Algebra group, theUniversity of Manchester. His research interestsinclude the parallel algorithms, numerical algo-rithms in linear algebra, computer architures, andfault tolerance. Contact him at [email protected]

Stanimire Tomov received the MS degree incomputer science from the Sofia University,Bulgaria, and the PhD degree in mathematicsfrom the Texas A&M University. He is a researchdirector in ICL and research assistant professorin the Electrical Engineering and ComputerScience, University of Tennessee. His researchinterests include the parallel algorithms, numeri-cal analysis, and high-performance scientificcomputing (HPC). Currently, his work is concen-trated on the development of numerical linear

algebra software, and in particular MAGMA, for emerging architecturesfor HPC. Contact him at [email protected]

Jack Dongarra holds an appointment with theUniversity of Tennessee, OakRidge National Lab-oratory, and theUniversity of Manchester. He spe-cializes in numerical algorithms in linear algebra,parallel computing, use of advanced-computerarchitectures, programming methodology, andtools for parallel computers. He was awarded theIEEE Sid Fernbach Award in 2004; in 2008 hewas the recipient of the first IEEE Medal of Excel-lence in Scalable Computing; in 2010 he was thefirst recipient of the SIAM Special Interest Group

on Supercomputing’s award for Career Achievement; in 2011 he was therecipient of the IEEE Charles Babbage Award; and in 2013 he receivedthe ACM/IEEE Ken Kennedy Award. He is a fellow of the AAAS, ACM,IEEE, and SIAM and a foreign member of the Russian Academy ofScience and amember of the USNational Academy of Engineering.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description >>> setdistillerparams> setpagedevice

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …over, in [9], the authors discussed the...

Documents

Transcript of IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …over, in [9], the authors discussed the...