arXiv:1905.03136v2 [cs.MS] 18 Feb 2020

Performance Engineering for Real andComplex Tall & Skinny MatrixMultiplication Kernels on GPUs

Journal TitleXX(X):1–12©The Author(s) 2020Reprints and permission:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/ToBeAssignedwww.sagepub.com/

SAGE

Dominik Ernst1, Georg Hager1, Jonas Thies2, and Gerhard Wellein1

AbstractGeneral matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matricesbut often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLASimplementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. Wedescribe the challenges and key properties of an implementation that can achieve perfect performance. We furtherevaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mappingscheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning.This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 ofmaximum performance for the rest on an NVIDIA Volta GPGPU.

Keywordsperformance engineering, complex, tall & skinny, matrix multiplication, CUDA, GPU

Introduction

Tall & Skinny Matrix MultiplicationsThe general matrix-matrix multiplication (GEMM) isan essential linear algebra operation used in manynumerical algorithms and hardware vendors usually supplyan implementation that is perfectly optimized for theirhardware. In case of NVIDIA, this is part of CUBLAS(NVIDIA (2019a)). However, since these implementationsare focused on mostly square matrices, they often performpoorly for matrices with unusual shapes.

This paper covers two types of matrix multiplications withtall & skinny matrices, i.e., matrices that are much tallerthan they are wide. We define skinny as having in the rangeof [1,64] columns, and tall as having more than 106 rows.Both types of multiplications involve the two tall & skinnymatrices A and B, with sizes K×M and K×N, respectively,and K being the long dimension. The small dimensions Mand N form a small matrix C with size M×N.

The two variants are shown in Figures 1 and 2: The Tall& Skinny Matrix Transposed times Tall & Skinny Matrix(TSMTTSM) multiplication AT B = C and the Tall & SkinnyMatrix times Matrix (TSMM) multiplication AC = B.

We are interested in a highly efficient implementation ofthese operations using double precision real and complexdata types on the NVIDIA Volta GPGPU, used nowadays inmany HPC systems.

ApplicationRow-major tall & skinny matrices are the result of combiningseveral vectors to block vectors. Block Vector Algorithms arelinear algebra algorithms that compute on multiple vectorssimultaneously for improved performance. For instance,by combining multiple, consecutive sparse matrix-vector(SpMV) multiplications to a sparse matrix-multiple-vector

Figure 1. The TSMTTSM operation AT B = C with A and Bbeing tall & skinny matrices. Note that A is transposed in theillustration.

Figure 2. The TSMM operation AT B = C with A and B beingtall & skinny matrices.

(SpMMV) multiplication, the matrix entries are loaded onlyonce and used for the multiple vectors, which reduces

1Erlangen Regional Computing Center (RRZE), 91058 Erlangen,Germany2German Aerospace Center (DLR), Simulation and Software Technology

Corresponding author:Dominik Ernst, Erlangen Regional Computing Center, Martensstraße 1,91058 Erlangen, GermanyEmail: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

arX

iv:1

905.

0313

6v2

[cs

.MS]

18

Feb

2020

2 Journal Title XX(X)

the overall memory traffic and consequently increasesperformance of this memory-bound operation. This has firstbeen analytically shown in Gropp et al. (1999) and is usedin many applications; see, e.g., Rohrig-Zollner et al. (2015);Kreutzer et al. (2018).

The simultaneous computation on multiple vectors canalso be used to gain numerical advantages. This has beenshown for block vector versions of the Lanzcos algorithm(see Cullum and Donath (1974)), of the biconjugategradient algorithm (see O’Leary (1980)), and of theJacobi-Davidson Method (see Rohrig-Zollner et al. (2015)),each of which use block vectors to compute multipleeigenvectors simultaneously. Many such algorithms requiremultiplications of block vectors. For example, both theTSMTTSM (AT B) and TSMM (AC) occur in classicalGram-Schmidt orthogonalization of a number of vectorsrepresented by B against an orthogonal basis A.

Roofline ModelWe use the roofline model by Williams et al. (2009) toobtain an upper limit for the performance of these kernels.In all cases, each of the three matrices has to be transferredbetween the memory and the chip at least once. Even thoughthe directions of data transfers differ between the kernels,the total data volume does not, as GPUs generally do notneed a write-allocate transfer. Therefore the computationalintensity is the same for both kernels if M and N are the same.2MNK floating point operations are performed in a matrix-matrix multiplication, so for double precision the arithmeticintensity assuming K�M,N and M = N is

ID =2MNK

(MK +NK +MN)×8flopbyte

K�M,N≈ 2MN

(M+N)×8flopbyte

M=N=

M8

flopbyte

.

(1)

In this symmetric case, the arithmetic intensity growslinearly with M. We will show measurements only forthis symmetric case, although the nonsymmetric caseis not fundamentally different, with the intensity beingproportional to the harmonic mean of both dimensionsand consequently dominated by the smaller number. If theachievable memory bandwidth is bs (see below), the modelpredicts Pmax =min

(I×bs,Ppeak

)as the as an absolute upper

performance limit. In the case of complex numbers, the datavolume increases by 2× and the number of floating-pointoperations by 4×, resulting in a doubled arithmetic intensityIZ = M

4 flop/byte.With proper loop optimizations in place, the GEMM

is usually considered a classic example for a compute-bound problem with high arithmetic intensity. However, atM,N = 1, the arithmetic intensity of 1/8flop/byte, is farto the left of the roofline knee of modern compute devices(typical values ranging from 5 flop/byte to 17 flop/byte)and strongly memory bound. This is not surprising giventhat a matrix multiplication with M,N = 1 is the sameas a scalar product. At the other end of the consideredspectrum, at M,N = 64, the arithmetic intensity is 8flop/byte,which is close to the roofline knee of a V100 GPU (see

Figure 3. Percentage of roofline-predicted performanceachieved by CUBLAS for the TSMTTSM kernel in the rangeM = N ∈ [1,64], complex (Z) and real (D) double precision, on aTesla V100-PCIe-16GB

Figure 4. Percentage of roofline-predicted performanceachieved by CUBLAS for the TSMM kernel in the rangeM = N ∈ [1,64], complex (Z) and real (D) double precision, on aTesla V100-PCIe-16GB

below). Therefore the performance character of the operationchanges from extremely memory bound at M,N = 1 tosimultaneously memory and compute bound at M,N = 64.An implementation with perfect performance thus needsto fully utilize the memory bandwidth at all sizes andadditionally reach peak floating point performance for thelarge sizes. The very different performance characteristicsmake it hard to write an optimal implementation for bothends of the spectrum, i.e., different optimizations andspecialization is required for both cases.

It is possible to judge the quality of an implementation’sperformance as the percentage of the roofline limit. Thismetric is shown for CUBLAS in Figures 3 and 4, wherethe ratio of measured and roofline performance is plottedas a function of the matrix width. There is very littleperformance improvement headroom for CUBLAS’ TSMMimplementation for real-valued matrices, but there is someopportunity for complex matrices. For the TSMTTSM kernel,there is a 2× to 50× gap to the upper limit, apart fromM,N = 1, where NVIDIA obviously implemented a specialcase handling. Similarly to the BLAS nomenclature, we usethe shorthand “D” for double precision real values and “Z”for double precision complex values.

Prepared using sagej.cls

Ernst et al. 3

ContributionThis paper presents the necessary implementation techniquesto achieve near-perfect performance for two tall & skinnymatrix-matrix multiplication variants on an NVIDIA V100GPGPU with real- and complex-valued matrices.

To this end, two parallel reduction schemes areimplemented and analyzed as to their suitability for smallmatrices.

A code generator is implemented that produces code forspecific matrix sizes and tunes many configuration optionsspecifically to that size. This allows to exploit regularitywhere size parameters allow it, while still generating theleast possible overhead where they do not. As a result,our implementation outperforms state-of-the-art vendorimplementations for most of the parameter range.

Related WorkThis work is an extended version of Ernst et al. (2019).In comparison to that paper we have added a differentvariant of matrix-matrix multiplication (TSMM), added amore in depth performance analysis, extended the analysisto double precision complex data types, and examined a newTSMTTSM thread mapping scheme.

CUBLAS is NVIDIA’s BLAS implementation. TheGEMM function interface in BLAS only accepts column-major matrices. Treating the matrices as transposed columnmajor matrices and executing ABT for the TSMTTSMoperation and CA for TSMM are equivalent operations.

CUTLASS (NVIDIA (2019b)) is a collection of primitivesfor multiplications especially of small matrices, whichcan be composed in different ways to form products oflarger matrices. One of these is the splitK kernel, whichadditionally parallelizes the inner summation of the matrixmultiplication for increase parallelism for the TSMTTSMkernel. We adapted the “06_splitK_gemm” code samplefrom the library for benchmarking.

HardwareIn this work we use NVIDIA’s V100-PCIe-16GB GPGPU(Volta architecture) with CUDA 10.0. The hardware data wascollected with our own CUDA micro benchmarks, which areavailable at Ernst (2019) together with more detailed data.

Memory Bandwidth. Whereas the TSMM operation hasa read and a write stream and fits well to the “scale”kernel from the STREAM benchmarks (McCalpin (1995)),the TSMTTSM is read-only. We thus use a thread-local sumreduction to estimate the achievable memory bandwidthbs (see Table 1). Read-only has a much higher maximumceiling of about 880 Gbyte/s, compared to 820 Gbyte/s fora “scale” kernel. Maximum bandwidth is only attainablewith sufficient parallelism, either through high occupancyor instruction level parallelism (ILP) in the form of multipleread streams, achieved here through unrolling.

Floating-Point Throughput. The V100 can execute one32-wide double precision (DP) floating point multiply add(FMA) per cycle on each of its 80 streaming multiprocessors(SMs) and runs at a clock speed of 1.38 GHz for a DP peak of80×32×2×1.38Gflop/s = 7066Gflop/s. One SM quadrantcan process one instruction that is 32 warp lanes wide everyfour cycles at a latency of eight cycles. Full throughput

Table 1. Measured memory bandwidth on a TeslaV100-PCIe-16GB of a read-only kernel with different amount ofload parallelism (ILP) and occupancies

ILP, Gbyte/soccupancy 1 4 16

1 block, 4 warps 3.0 10.1 16.36.25% 228 629 81512.5% 419 824 877

25% 681 872 88450% 834 884 887

100% 879 891 877

can already be achieved with a single warp per quadrant ifinstructions are independent.

L1 Cache. The L1 cache plays is instrumental in achievingthe theoretically possible arithmetic intensity. Though loadand DP FMA instructions have the same throughput of1/cy/SM, the actual L1 cache bandwidth of one 128-bytecache line per cycle means that the actual load instructionthroughput is dependent on the number of touched cachelines. For example, a 32-wide, unit-stride DP load touches2 cache lines and therefore takes two cycles. For that accesspattern, the floating point to load instruction ratio would needto be at least 2:1 to attain peak performance.

Shared Memory. The Shared Memory uses the samephysical structure on the chip as the L1 cache. It has the samebandwidth, but lower access latency than the L1 cache.

General Implementation Strategies

Code Generation

A single implementation cannot be suitable for all matrixsizes. In order to engineer the best code for each size, someform of meta programming is required. C++ templates allowsome degree of meta programming but are limited in theirexpressiveness or require convoluted constructs. Usually thecompiler unrolls and eliminates short loops with knowniteration count in order to reduce loop overhead, combineaddress calculations, avoid indexed loads from arrays for thethread-local results, deduplicate and batch loads, and muchmore. Direct generation of the intended code offers morecontrol, however. For example, when using a thread countper row that is not a divisor of the matrix width, some threadswould need to compute fewer results than others. Guardingif statements have to be added around these computationsthat could exceed the matrix size. These can be omittedwherever it is safe, i.e. all threads compute a valid value,in order to not compromise performance for even, dividingthread mappings. We therefore use a code generating scriptin python, which allows to prototype new techniques muchquicker and with more control. Many different parameterscan be configured easily and benchmarked automatically, forexample whether leap frogging and unrolling (see below)are used, how the reduction is performed, and what threadmapping to set. The same reasoning for code generation ismade by Herrero and Navarro (2006), where it is used togenerate small matrix multiplication kernels for CPUs.



f o r m = 0 . . .M:f o r n = 0 . . . N:

f o r k = 0 . . . K:C[m] [ n ] += A[ k ] [m] * B [ k ] [ n ]

Listing 1. Naive matrix-matrix multiplication (MMM) pseudocode for C = AT B.

Figure 5. Illustration of the iteration space of the TSMTTSMoperation C = AT B

Thread Mapping OptionsThe parallelization scheme, i.e., the way in which work ismapped to GPU threads, plays an important role for data flowin the memory hierarchy. The canonical formulation of anMMM is the three-level loop nest shown in Listing 1.

The iteration space of an MMM can be visualized as acuboid spanned by the outer product of the two matricesbeing multiplied. For the TSMTTSM (Figure 5), the matricesA and B span the cube, and reduction along the long axis Kresults in the matrix C. For the TSMM (Figure 6), the cube isflipped on its side, so the the matrices A and C span the cubeand a reduction along the short side M results in B.

This representation allows to visualize the locality of datatransfers. Looking at a slice of the cube perpendicular to thelong K axis spanned by one row of A and B, as depicted inFigures 7–9, shows all the data uses and computations. Eachsuch slice contains M ×N cells, which correspond to oneFMA each, and requires the transfer of one row each of A andB, causing a data transfer of M+N elements. The arithmeticintensity associated with the computations in one slice isthe same as for the whole MMM kernel. We assume perfectcaching, i.e., that A and B are transferred from memory justonce and reused as many times as necessary throughout thecalculation.

The fastest way to reuse values is to use a register andhave the thread the register belongs to perform all requiredoperations on this data. Data used by multiple threads can(preferably) be shared in the L1 cache for threads in thesame thread block or in the L2 cache otherwise. Thisworks only if some spatial and temporal access locality

Figure 6. Illustration of the iteration space of the TSMMoperation B = AC

c l o c a l [ : ] [ : ] = 0

f o r ( k = t h r e a d I d ; k < K; k += g r i d S t r i d e )f o r m = 0 . . .M:

f o r n = 0 . . . N:c l o c a l [m] [ n ] += A[ k ] [m] * B [ k ] [ n ]

f o r m = 0 . . .M:f o r n = 0 . . . N:

g l o b a l r e d u c t i o n ( c l o c a l [m] [ n ] )

Listing 2. TSMTTSM pseudo code, with the K loopparallelized as a grid stripe loop.

is in place. Therefore, the mapping of cells, i.e., work, tothreads determines which thread requires what data for itscomputations and the locality of data access.

TSMTTSMFor the TSMTTSM, the two outer loops, which arecompletely independent and therefore well parallelizable, areusually the target of an implementation focused on squarematrices. For skinny matrices, these loops are much too shortto yield enough parallelism for a GPU. In consequence, theloop over the long K dimension has to be parallelized as well,which also involves parallelizing the sum inside the loop.There are many more terms in the parallel reduction thanthreads, so that each thread can first serially compute a threadlocal partial sum, which is afterwards reduced to a total sum,like in Listing 2. Here, a so called grid stride loop, describedat Harris (2013), is used to map rows to threads.

For data locality, the two small loops have to be movedinto the K loop. Since they are short loops with constantloop trip count, they can be unrolled completely, which alsoallows instead of indexing into a local array, to map theintermediates to local variables, like in Listing 3. Dependingon whether and how the two small loops are parallelized,each thread computes only some of these MN intermediates.Figures 7 to 10 visualize this by showing a slice of themultiplication cube and which values a single thread would


Ernst et al. 5

f o r ( k = t h r e a d I d ; k < K; k += g r i d S t r i d e )c0 0 += A[ k ] [ 0 ] * B[ k ] [ 0 ]c0 1 += A[ k ] [ 0 ] * B[ k ] [ 1 ]c1 0 += A[ k ] [ 1 ] * B[ k ] [ 0 ]c1 1 += A[ k ] [ 1 ] * B[ k ] [ 1 ]

Listing 3. TSMTTSM pseudo code with parallelized K loop,after unrolling the two inner loops (here shown exemplarily forM = N = 2) and mapping array entries to variables. The globalreduction is omitted for brevity.

Figure 7. TSMTTSM: Parallelization over K only

Figure 8. TSMTTSM: Parallelization over the K and N loop

f o r ( k = t h r e a d I d / N; k < K;k += g r i d S t r i d e / N)

n = t h r e a d I d % Nf o r m = 0 . . .M:

c [m] [ n ] += A[ k ] [m] * B [ k ] [ n ]

Listing 4. TSMTTSM pseudo code, with the K and N loopparallelized. The global reduction is omitted.

compute. The number of loads that each thread has to do arethe affected values in the row of A and B, also visible in theillustrations, while each highlighted cell in the slice standsfor one line in the loop body of Listing 3, which correspondsto one FMA operations and one intermediate variable.

Since the L1 cache is not able to deliver one operand perFMA instruction, a high FMA-to-load ratio is desirable. Thiscan be achieved by maximizing the area and the “squareness”of the area that is computed by a single thread. At thesame time, more intermediate results per thread increase theregister count, which can limit the occupancy and eventuallylead to spilling.

The approach of only parallelizing the K loop (shown inListing 2 and Figure 7) easily achieves this goal. While itmaximizes the arithmetic intensity already in the L1 cache,the MN intermediate results occupy 2MN registers, so themaximum of 256 registers per thread is already exceeded atM,N > 11, causing spilling and poor performance.

Figure 9. TSMTTSM: Parallelization over K and tiling of thetwo inner loops, here with tile size 2×3

midx = ( t h r e a d I d x / N) % Mnidx = t h r e a d I d x % Nf o r ( . . . )

f o r tm = 0 . . . TM:f o r t n = 0 . . . TN:

m = midx * TM + tmn = n idx * TN + t nc [ tm ] [ t n ] += A[ k ] [m] * B [ k ] [ n ]

Listing 5. TSMTTSM pseudo code, with tiled M and N loopusing tile sizes TM and TN . The global reduction and rowcalculation in the K loop is omitted.

Figure 10. TSMTTSM: Parallelization over K and transposedtiling of the two inner loops, here with tile size 2×3

Parallelizing one of the inner loops as well (Listing 4)leads to the pattern shown in Figure 8. The amount ofregisters required is only M here, so there is no spillingeven at M,N = 64. However, the narrow shape results in anFMA/load ratio below 1 (i.e., a low arithmetic intensity inthe L1 cache), as values from A are used just once per load.

A better approach, which combines manageable registerrequirements with a more square form of the tile is tosubdivide the two smaller loops into tiles (see Listing 5and Figure 9). This mapping also allows for much moreflexibility, as the tile sizes can be chosen small enough toavoid spilling or reach a certain occupancy goal but alsolarge enough to create a high FMA/load ratio. Tile sizes thatare not divisors of the small loop dimensions can be coveredby generating guarding statements for tile entries that couldpossibly overlap to only be executed by threads with a tileindex that does not extend beyond the border of the slice.This is helpful for matrix dimensions that have few divisors,e.g., prime numbers.

Mapping a continuous range of values to a thread leads tostrided loads, which can be detrimental to performance. Thesame entry in two consecutive threads’ partitions is alwaysas far apart as the tile side length. A more advantageous,continuous load pattern can be achieved by transposing the



threads’ tiles, as shown in Figure 10. The correspondingvalues in a tile are now consecutive.

Leap FroggingOn NVIDIA’s GPU architectures, load operations canoverlap with each other. The execution will only stall atan instruction that requires an operand from an outstandingload. The compiler maximizes this overlap by moving allloads to the beginning of the loop body, followed by thefloating-point (FP) instructions that consume the loadedvalues. Usually at least one or two of the loads come frommemory and thus take longer to complete than other queuedloads, so that execution stalls at the first FP instruction. Away to circumvent this stall is to load the inputs one loopiteration ahead into a separate set of next registers, whilethe computations still happen on the current values. At theend of the loop, the next values become the current valuesof the next loop iteration by assignment. These assignmentsare the first instructions that depend on the loads and thus thecomputations can happen while the loads are still in flight.

Global ReductionAfter each thread has serially computed its partial, thread-local result, a global reduction is required, which isconsidered overhead. Its runtime depends only on thethread count, though, whereas the time spent in the serialsummation grows linearly with the row count and thereforebecomes marginal for large row counts. However, as shownby Thies et al. (2019), the performance at small row countscan still be relevant, as the available GPU memory may beshared by more data structures than just the two tall & skinnymatrices, limiting the data set size.

Starting with the Pascal architecture, atomic addoperations for double precision values are available forglobal memory, making global reductions more efficient thanon older systems. Each thread can just use an atomicAddof its partial value to update the final results. The throughputof global atomicAdd operations is limited by the amountof contention, which grows for smaller matrix sizes. Weimprove on this global atomic reduction variant with alocal atomic variant that reduces the amount of globalatomicAdd operations by first computing thread-block-local partial results using shared memory atomics. Thisis followed by a global reduction of the local results.Additionally, we opportunistically reduce the amount oflaunched threads for small row counts.

TSMM

Thread MappingIn contrast to the TSMTTSM kernel, the summation isdone along the short M axis, with no need for a globalreduction. Though the short sum could be parallelized,this is not necessary in this case, as the other two loopdimensions supply sufficient parallelism. The visualizationsin Figures 11-13 show slices perpendicular to the M axis,since this dimension will not be parallelized.

The first option is to only parallelize over the long Kdimension as shown in Figure 11. Each entry in A wouldbe loaded once and then reused out of a register. The N

Figure 11. TSMM: Parallelization over K, a single threadcomputes a full result row of B. Slice perpendicular to the Maxis, the (long) K axis extends “indefinitely” on both sides.

Figure 12. TSMM: Parallelization over K and N, two threadscompute two results each. Slice horizontal to the M axis, the Kaxis extends “indefinitely” on both sides.

sums that each thread computes require a 2N registers,which is not a prohibitive number even at N = 64 but stilldoes reduce occupancy. A more severe disadvantage are thestrided stores. As each thread produces and stores a full rowof B, the addresses stored to by the different threads are farapart, leading to partially written cache lines. This in turncauses a write-allocate read stream of the result matrix Bto ensure fully consistent cache lines, thereby reducing thearithmetic intensity of the kernel.

This can be avoided by parallelizing the N loop. Eachthread computes a single result of the output row of B.Because consecutive threads compute consecutive results,cache lines are always written fully and no write-allocatestream is necessary. The disadvantage is a low compute/loadratio. Each value from A is loaded and used just once in eachthread.

A more balanced approach is to have a smaller groupof threads compute on each result row, with a few resultscomputed by each thread. Each value loaded from A isreused multiple times, once for each result computed by thisthread. Using a transposed mapping as shown in Figure 12,each thread does not compute consecutive elements; resultscomputed by threads are interleaved, so that consecutiveelements are written and the amount of partial writes isreduced. This works best if the thread count is a multipleof four, which corresponds to the L1 cache line managementgranularity of 32 bytes. If N is not a multiple of four, thewrites will necessarily be misaligned, with some cache linesbeing cut. Larger thread counts slightly reduce the impact ofcut cache lines.


Ernst et al. 7

Figure 13. TSMM: Parallelization over K and N and 2×unrolling, two threads compute two results each and on tworows of B in a single iteration. Slice horizontal to the M axis, theK axis extends “indefinitely” on both sides.

Data from COur discussion of thread mappings and data transfers so farhas ignored the entries of the matrix C. These values are thesame for every index of the K loop. The fastest would beto load all entries of C into registers and reuse them fromthere, but this strategy would quickly exceed the number ofavailable registers even at moderate M and N. Since they areaccessed frequently and all threads in a thread block accesssimilar values, the contents of C should continuously stay inthe L1 cache, making reloads of these values a question of L1cache bandwidth and not memory latency. Each load fromC loads between one to three 128-byte cache lines, whichwould then be used for a single FMA. This is higher thanthe sustainable ratio of one 128-byte cache line per FMA. Asolution is to reuse each value loaded from C by unrolling theK loop and pulling the unrolled iterations inside the M loop.Each iteration over K loads the same values of C, which cansubsequently be used for multiple iterations per load.

The loads from C can also be sped up by using theshared memory to cache these loads. Threads in a threadblock collaboratively load the contents of C into the sharedmemory at the beginning of the kernel. The loop over Kis parallelized with a grid stride loop, where only as manythreads as necessary for full occupancy are launched. Eachkernel instantiation then computes on multiple rows of B.Therefore, loading C into shared memory can be amortizedover many rows.

On the V100, the shared memory has the same bandwidthas the L1 cache, given that they occupy the same hardwarestructure. However, shared memory accesses guaranteecache hits, as they avoid conflict misses with other data. Theyalso have a lower latency, since no tags have to be checked.

Results: TSMTTSM

Transposition and Leap FroggingAn exhaustive search was used to find the best tile sizeand configuration for each matrix size. The simpler mappingschemes are subsets of the tiled mapping. E.g., the mappingin Figure 8 corresponds to a tilesize of M × 1. Figure 14shows the performance of the four configurations of usingleap frogging and a transposed mapping. The performanceagrees with the roofline prediction (dashed line) perfectlyuntil M,N = 20. Until M,N = 36, the best performancestays within 95% of the limit. Beyond that, the growing

1 8 16 24 32 40 48 56 64M=N

0

1000

2000

3000

4000

5000

GFlo

p/s

RooflineLF: no, trans: noLF: yes, trans: noLF: no, trans: yesLF: yes, trans: yes

Figure 14. Performance comparison of real-valueddouble-precision TSMTTSM vs. quadratic tile size withK = 229/M on the V100 across the four different permutationsof using leap frogging (LF) and transposed mapping (trans).The best performance for each matrix size and configuration isshown. The arithmetic peak performance of the device is7.066Tflop/s.

Figure 15. Performance of TSMTTSM for M,N = 32 andK = 229/M vs. tile sizes in M and N directions, usingreal-valued double-precision matrices, with leap frogging andtransposed mapping. The two white lines are defined by2× (TMTN +2(TM +TN)+8) = R, with R = 128,256 to markapproximate boundaries of register usage.

arithmetic intensity does not translate into a proportionalspeedup anymore, although the performance is still about afactor of two away from peak. The best variants plateau atabout 4700 Gflop/s, or 2⁄3 of peak. Both variants using leapfrogging are clearly faster, but the transposed mapping isonly a bit faster if leap frogging is used. This is in contrast toexperiences with the Kepler GPU architecture, where stridedloads are slower, and this kind of transformation is morebeneficial. The best tile size changes when leap frogging isused as it requires more registers.

Tile SizesFigure 15 shows the dependence of performance on the tilesizes TM and TN for the case M,N = 32 with leap froggingand transposed mapping. Performance drops off sharply



if the tile sizes become too large and too many registersare used. The number of registers can be approximatedby 2× (TMTN + 2(TM + TN) + 8), which accounts for thethread-local sums (TMTN), loaded values (TM + TN), andeight registers for other purposes. Leap frogging introduces afactor of two for the number of loaded values (for current andnext values), and double precision values generally requiretwo 32-bit registers for an overall factor of two. The graphshows the iso-lines of 128 and 256 registers, which representthe occupancy drop from 25% to 12.5% at 128 registers andthe onset of spilling at 256 registers.

The best-performing tile sizes generally sit on or justbelow these lines, maximizing the area of the tile for a givenoccupancy. The dimensions are largely symmetric but notperfectly so, as threads are mapped to tiles in M directionfirst. There are clear patterns favoring powers of two as thoseare divisors of the matrix size 32 and avoid both the overheadof guarding statements and idle threads.

AnalysisAccording to the roofline model, at M = N = 64 the upperperformance limit is

P =648

flopbyte×880 Gbyte/s = 7060Gflop/s , (2)

which is almost exactly the PPeak of 7066Gflop/s. However,our implementation cannot realize the roofline-predictedperformance, and instead tops out at 4766Gflop/s≈ 2/3PPeak.The reason for the limitation is memory latency, which canbe shown by a simple model: Whereas the memory latencyfor an idle memory interface measured with a pointer chasingbenchmark (see Ernst (2019)) is only 435 cy, this latencyincreases as the load on the memory interface increases. Forthe values in Table 1, it is possible to calculate correspondinglatency values according to Little’s Law via

T` =f N×8byte

b, (3)

with f being the clock frequency, N the thread count and bthe memory bandwidth. For the unloaded case in the firstrow of Table 1 (ILP=1), the latency according to (3) isT` ≈ 470cy, which matches the measured pointer chasinglatency quite well. The bandwidth of b = 681 Gbyte/s at25% occupancy in the fourth row roughly correspondsto the highest observed memory bandwidth, based on thecomputational intensity, for M,N = 64, and result in T` ≈664cy of memory latency.

The best tile size without leap frogging is 11 × 8,which requires 11× 8 = 88 FMA operations. These canbe computed on a single quadrant in 88× 4cy = 352cy.At this large tile size, the register requirements of at least2×11×8 = 176 registers allow to run only eight warps, i.e.,two warps per quadrant, simultaneously on a SM. One warpdoing 352 cy of compute work finishes earlier than the otherwarp waiting for 664 cy for data from memory. It will thenalso wait for the next data to be loaded, which is a period oftime where none of the two warps are issuing floating pointoperations, and therefore counts as wasted cycles.

Leap frogging does improve the situation, as even witha single warp the memory latency and compute times can

Figure 16. TSMTTSM performance vs. occupancy of real,correct kernels and two modified (incorrect) kernels at tile sizesof 4×8 and 8×8, respectively. The first modification reducesregister count, while the second kernel additionally reduces thedata set so that it resides in L1 cache. Green circles mark thepoint with the highest performance of the unmodified kernels.(Real-valued double-precision matrices)

overlap. However, additional registers are required to holdthe data for the next iteration, which either necessitatessmaller tile sizes or reduces occupancy, both of which arebad for overlapping. Overall, leap frogging is still beneficial,though.

Figure 16 shows an experiment that gives insight into therelationship of latency and occupancy. A modification ofthe generated kernels allows testing the impact of higheroccupancies even for kernels with larger tile sizes, wherethe high register requirements usually limit the occupancy tothe minimum of eight warps per SM. Instead of computingTM ×TN intermediate results, all summands are summed upin just two accumulators. This does of course not computethe correct results any more, but all the instructions andloaded operands are the same, while reducing the registercount so that 32 warps per SM can run concurrently. Anothermodification to the generated kernel introduces a division ofthe K loop row index by a large constant. In consequence,all loop iterations compute on data of very few rows,which makes almost all accesses L1 cache hits with thecorresponding much smaller latency. Repeatedly using thesame row is done in such a contrived way in order to preventthe compiler from pulling the loads in front of the loop.

With tile sizes of 8 × 4 and 8 × 8, as used in thisexperiment, 16 and 8 warps per SM can run concurrently.At these occupancies (green circles in Figure 16), therespective real kernels (circle symbols) performance ishighest, as the maximum possible number of thread blocksrun concurrently. With an increased number of launchedthread blocks, the unmodified kernels’ performance does notincrease anymore, as additional thread blocks do not runconcurrently but are scheduled in a “second wave” of threadblocks. An imbalance in the number of thread blocks perwave leads to fluctuating performance.

The kernels modified for higher occupancy (trianglesymbols) have the same performance as the unmodifiedkernels up to these points, but allow to see the hypotheticalspeedup if more thread blocks could run concurrently, which


Ernst et al. 9

would be possible on a hypothetical V100 with 4× largerregister files.

The performance increase is linear in all cases up to fourwarps per SM, as this is the minimum to fill all four quadrantsof a SM. For both tile sizes, the L1 load kernels (squaresymbols) profit somewhat from a second warp on eachquadrant to overlap the remaining latency and overhead butquickly saturate at ceilings of 6080Gflop/s and 5700Gflop/s,respectively, which is not a latency effect any more. Thereason for these lower roofs remains open, but we suspectthat it be rooted in limited instruction throughput. We noticedthat the gap to the device peak performance matches onemissing DP FP operation per four non-DP FP operations, i.e.,integer and load instructions. DP FP operations are supposedto execute on separate execution units, and so we can onlyspeculate whether there is a restriction in co-issuing DP FPoperations with integer and load instructions.

The two experiments with the normal, higher latencyfrom memory (triangular symbols) need many more warpsto overlap their longer latency to eventually saturate at thesame level as the L1 load kernels. At least two to threetimes larger register files would be required to get there.At the same time, it also shows how devastating it wouldbe if the register files were half as large, a situation that isnot dissimilar to the older Kepler GPU architecture, wheredouble the number of execution units were backed by asimilar sized register file. The larger tile size saturates morequickly, because it amortizes the same latency over twice thenumber of floating-point operations. Note that in the end,both tile sizes have a similar real world performance, as thehigher possible occupancy of 16 warps per SM compared to8 warps per SM balances the smaller amount of work periteration.

This simple model also helps to explain the rather smallbenefits from using the transposed mapping. The transposedmapping changes the load pattern to contiguous blocksinstead of long strides. This in turn reduces the numberof touched cache lines, and increases the rate at whichthe L1 cache can serve the outstanding loads after thedata has arrived from memory. However, this rate is onlyreally a limiter at low FMA/load ratios, or at the beginningof the floating-point operation phase, where the FP unitsstill wait for enough registers being filled for uninterruptedoperation. The transposed mapping therefore only gives asmall speedup in phase that is mostly not the limiter, but atthe same time also makes smaller tile sizes more feasible.

On the other hand, the strided access patterns of thenontransposed mapping touch most cache lines already onthe first load, and therefore already cause most cache misseswith the first load. Subsequent loads are cache hits. With thetransposed mapping, with its contiguous blocks of addressesper load, cache misses are postponed until later loads, whichstarts the memory latency penalty later. That is why theconfiguration using the transposed mapping without leapfrogging performs the worst (see Figure 14). However, incombination with leap frogging it is faster than the twovariants with the nontransposed mapping.

Comparison with LibrariesBoth CUBLAS’ and CUTLASS’ performance is far belowthe potential performance, except for M,N = 1, where

1 8 16 24 32 40 48 56 64M=N

0

20

40

60

80

100

% o

f roo

fline D

ZCUBLAS DCUBLAS ZCUTLASS D

Figure 17. TSMTTSM percentage of roofline-predictedperformance for real (D) and complex (Z) double-precision datain comparison with CUBLAS and CUTLASS.

Figure 18. Global reduction impact for TSMTTSM:Performance when using each of the two global reductionvariants as the percentage of the performance of a kernelwithout a global reduction, using two different matrix widths andtile sizes. (Real-valued double-precision matrices).

CUBLAS seems to have a special detection for the scalarproduct corner case. The utilization of potential performanceincreases as matrices become wider, which makes them moresquare and compute bound, bringing them closer to morestandard scenarios.

In contrast, the presented implementation shows fullefficiency for narrow, clearly memory bandwidth limitedmatrices, and utilization slightly drops off as matricesbecome more compute bound. For complex-valued matrices,the TSMTTSM becomes compute bound already at M,N =32. Instruction throughput becomes the limiter much earlierinstead of memory bandwidth and latency, which is why theutilization drops earlier. With increasing matrix size, it fullysaturates the previously explained lower ceiling due to ourspeculated co-issue limitation between double-precision FPinstructions and integer instructions.

Impact of ReductionsFigure 18 shows the relative performance of our TSMTTSMimplementation versus row count with respect to a baselinewithout any reduction for a selection of inner matrix sizesand tile sizes, choosing either of the two reduction methods



1 8 16 24 32 40 48 56 64M=N

0

1000

2000

3000

4000

5000GF

lop/

sRooflineMemoryRegistersShared Memory

Figure 19. TSMM performance comparison at K = 229/Mamong different sources for the matrix C, showing thebest-performing configuration of each method and matrix width.(Real-valued double-precision matrices).

described in Section . As expected, the impact of thereduction generally decreases with increasing row count.The method with only global atomics is especially slowfor the narrower matrices (M,N = 4). Many threads writingto a small amount of result values leads to contention andcauses a noticeable impact even for a matrix filling the devicememory (K = 108). The local atomic variant drasticallyreduces the number of writing threads, resulting in lessthan 10% overhead even for the smallest sizes and near-perfect performance for K > 106. For the wider matrices,the difference is smaller. The global atomic version is not asslow because writes spread out over more result values andthe local atomic variant is not as fast because the larger tilesize requires more work in the local reduction. Both variantsincur less than 10% overhead just above K = 104, a pointwhere only about 0.2% of the GPU memory is used.

Results: TSMMThe described methods and parameters open up a large spaceof configurations. Each of the Figures 19, 20, and 21 shows across section of each configuration option by displaying thebest-performing value for each choice for that configurationoption.

Source of CThe data in Figure 19 demonstrates that trying to keep thevalues of matrix C in registers works well only for smallM,N. The increasing register pressure at larger sizes reducesoccupancy, which is especially bad if multiple results arecomputed per thread.

Reloading values from shared memory consistently hasa small performance advantage especially for sizes that arenot multiples of four, due to a smaller penalty becauseof misaligned loads. Each additional cache line that getstouched because of misalignment costs an additional cycle.

UnrollingAlthough there is little improvement with further unrollingbeyond 2×, as Figure 20 shows, unrolling at least once shows

1 8 16 24 32 40 48 56 64M=N

0

1000

2000

3000

4000

5000

GFlo

p/s

Roofline1x2x3x4x

Figure 20. TSMM performance comparison of differentdegrees of unrolling at K = 229/M, showing the best-performingconfiguration for each unrolling depth (1, . . . ,4) and matrix width.(Real-valued double-precision matrices).

1 8 16 24 32 40 48 56 64M=N

0

1000

2000

3000

4000

5000GF

lop/

sRooflineM threads1 threads2 threads4 threads8 threads16 threads

Figure 21. TSMM performance comparison of different threadcounts per row at K = 229/M, showing the best-performingconfiguration for each thread count and matrix width.(Real-valued double-precision matrices).

a clear speedup compared to no unrolling. Without unrolling,the shared memory bandwidth would limit the performancedue to the high ratio of shared memory loads to FP DPinstructions, and its latency could not be hidden as wellwith FP DP instructions from further iterations. Generally, asimilar reasoning as for the TSMTTSM kernel applies, wherecomputing more results per thread and higher unrollingcounts increase the number of floating-point operations periteration but also decrease the occupancy that would beneeded to overlap the memory latency.

Thread CountFewer threads per row mean more work per thread. Forlarge matrix sizes, this can result in huge kernels with highregister requirements, which is why Figure 21 does not showmeasurements for the whole matrix size range for one andtwo threads per row. These two thread counts are the slowestvariants, as they show the effects of strided writes the most.With four threads writing consecutive values, there is at leasta chance of writing a complete 32-byte cache line sector. The


Ernst et al. 11

1 8 16 24 32 40 48 56 64M=N

20

40

60

80

100%

of r

oofli

ne

DZcublas Dcublas Z

Figure 22. TSMM percentage of roofline-predictedperformance for real (D) and complex (Z) double-precision datain comparison with CUBLAS.

difference between 4, 8 or 16 threads is not large, althoughthe larger thread counts perform slightly more consistently(i.e., with less fluctuation across M).

The performance analysis for TSMM shows a clearpreference for the small matrix dimension M = N to be amultiple of four. For this case, all writes of computed datato the matrix B are aligned to 4× 8byte = 32byte, whichis the management granularity for L1 cache lines and thecache line length for the L2 cache. With this alignment,cache lines are fully written and there is no overhead forwrite allocation from memory. Misalignment is the majorperformance hurdle for matrix widths that are not multiplesof four.

Comparison with LibrariesFigure 22 shows that, except for very small M,N, CUBLASperforms very well for the real-valued TSMM kernel. Withincreasing width, the development in utilization is verysimilar to the presented implementation. Our solution workssimilarly well for complex values matrices, which is notthe case for CUBLAS. Here, a strong performance drop formedium-wide matrices can be observed.

Conclusion and Outlook

We have shown how optimize the performance for two typesof multiplication of double-precision, real and complex tall& skinny matrices on a V100 GPU. With matrices narrowerthan 32 columns, perfect performance in accordance with aroofline performance model could be achieved. Over the restof the skinny range up to a width of 64, between 60% and2⁄3 of the potential performance was attained. We used a codegenerator on top of a range of suitable thread mapping andtiling patterns, which enabled an exhaustive parameter spacesearch. Two different ways to achieve fast, parallel device-wide reductions for long vectors have been devised in orderto ensure a fast ramp-up of performance already for shortermatrices. An in-depth performance analysis was providedto explain observed deviations from the roofline limit. Ourimplementation outperforms the vendor-supplied CUBLAS

and CUTLASS libraries by far or is on par with them formost of the observed parameter range.

In future work, in order to push the limits of the currentimplementation, shared memory could be integrated into themapping scheme to speed up the many loads, especiallyscattered ones, that are served by the L1 cache.

The presented performance figures were obtained byparameter search. An advanced performance model, cur-rently under development, could be fed with code char-acteristics such as load addresses and instruction countsgenerated with the actual code and then used to eliminatebad candidates much faster. It will also support a betterunderstanding of performance limiters.

Prior work by us in this area is already part of the sparsematrix toolkit GHOST (Kreutzer et al. (2016)) and we planto integrate the presented work there as well.

Funding

This work was supported by the ESSEX-II project in the DFGPriority Programme SPPEXA.

References

Cullum J and Donath WE (1974) A block Lanczos algorithmfor computing the q algebraically largest eigenvalues anda corresponding eigenspace of large, sparse, real symmetricmatrices. In: 1974 IEEE Conference on Decision and Controlincluding the 13th Symposium on Adaptive Processes. pp. 505–509. DOI:10.1109/CDC.1974.270490.

Ernst D (2019) CUDA Microbenchmarks. http://tiny.cc/

cudabench [Online; accessed 02-Feb-2020].Ernst D, Hager G, Thies J and Wellein G (2019) Performance

engineering for a tall & skinny matrix multiplication kernel onGPUs. CoRR abs/1905.03136. URL http://arxiv.org/

abs/1905.03136. To be published in Proc. PPAM 2019.Gropp WD, Kaushik DK, Keyes DE and Smith BF (1999) Towards

realistic performance bounds for implicit CFD codes. In:Proceedings of Parallel CFD’99. Elsevier, pp. 233–240.

Harris M (2013) CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops. http://tiny.cc/cuda-stride [Online;accessed 02-Feb-2020].

Herrero JR and Navarro JJ (2006) Compiler-optimized kernels: Anefficient alternative to hand-coded inner kernels. In: GavrilovaML, Gervasi O, Kumar V, Tan CJK, Taniar D, Lagana A,Mun Y and Choo H (eds.) Computational Science and ItsApplications - ICCSA 2006. Berlin, Heidelberg: SpringerBerlin Heidelberg. ISBN 978-3-540-34080-5, pp. 762–771.

Kreutzer M, Ernst D, Bishop AR, Fehske H, Hager G, NakajimaK and Wellein G (2018) Chebyshev filter diagonalization onmodern manycore processors and GPGPUs. In: Yokota R,Weiland M, Keyes D and Trinitis C (eds.) High PerformanceComputing. Cham: Springer International Publishing. ISBN978-3-319-92040-5, pp. 329–349.

Kreutzer M, Thies J, Rohrig-Zollner M, Pieper A, ShahzadF, Galgon M, Basermann A, Fehske H, Hager G andWellein G (2016) GHOST: Building blocks for highperformance sparse linear algebra on heterogeneous systems.International Journal of Parallel Programming : 1–27DOI:10.1007/s10766-016-0464-z. URL http://dx.doi.org/

10.1007/s10766-016-0464-z.


http://tiny.cc/cudabench

http://tiny.cc/cudabench

http://arxiv.org/abs/1905.03136

http://arxiv.org/abs/1905.03136

http://tiny.cc/cuda-stride

http://dx.doi.org/10.1007/s10766-016-0464-z

http://dx.doi.org/10.1007/s10766-016-0464-z


McCalpin JD (1995) Memory bandwidth and machine balance incurrent high performance computers. IEEE Computer SocietyTechnical Committee on Computer Architecture (TCCA)Newsletter : 19–25.

NVIDIA (2019a) CUBLAS reference. https://docs.

nvidia.com/cuda/cublas [Online; accessed 05-May-2019].

NVIDIA (2019b) CUTLASS. https://github.com/

NVIDIA/cutlass [Online; accessed 05-May-2019].O’Leary DP (1980) The block conjugate gradient algorithm and

related methods. Linear Algebra and its Applications 29:293 – 322. DOI:http://dx.doi.org/10.1016/0024-3795(80)90247-5. URL http://www.sciencedirect.com/

science/article/pii/0024379580902475. SpecialVolume Dedicated to Alson S. Householder.

Rohrig-Zollner M, Thies J, Kreutzer M, Alvermann A, PieperA, Basermann A, Hager G, Wellein G and Fehske H (2015)Increasing the performance of the jacobi–davidson method byblocking. SIAM Journal on Scientific Computing 37(6): C697–C722. DOI:10.1137/140976017. URL https://doi.org/10.1137/140976017.

Thies J, Rohrig-Zollner M, Overmars N, Basermann A, Ernst D andWellein G (2019) PHIST: a pipelined, hybrid-parallel iterativesolver toolkit. ACM Transactions on Mathematical Software .

Williams S, Waterman A and Patterson D (2009) Roofline:An insightful visual performance model for multicorearchitectures. Commun. ACM 52(4): 65–76. DOI:10.1145/1498765.1498785. URL http://doi.acm.org/

10.1145/1498765.1498785.


https://docs.nvidia.com/cuda/cublas

https://docs.nvidia.com/cuda/cublas

https://github.com/NVIDIA/cutlass

https://github.com/NVIDIA/cutlass

http://www.sciencedirect.com/science/article/pii/0024379580902475

http://www.sciencedirect.com/science/article/pii/0024379580902475

https://doi.org/10.1137/140976017

https://doi.org/10.1137/140976017

http://doi.acm.org/10.1145/1498765.1498785

http://doi.acm.org/10.1145/1498765.1498785

arXiv:1905.03136v2 [cs.MS] 18 Feb 2020

Documents

Transcript of arXiv:1905.03136v2 [cs.MS] 18 Feb 2020