Introduction to Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

102
Introduction to Communication-Avoiding Algorithms www.cs.berkeley.edu/~demmel/SC13_tutorial Jim Demmel EECS & Math Departments UC Berkeley

description

Introduction to Communication-Avoiding Algorithms www.cs.berkeley.edu /~ demmel /SC13_tutorial. Jim Demmel EECS & Math Departments UC Berkeley. Why avoid communication? (1/2). Algorithms have two costs (measured in time or energy): Arithmetic (FLOPS) Communication: moving data between - PowerPoint PPT Presentation

Transcript of Introduction to Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Page 1: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Introduction to Communication-Avoiding Algorithms

wwwcsberkeleyedu~demmelSC13_tutorial

Jim DemmelEECS amp Math Departments

UC Berkeley

2

Why avoid communication (12)

Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between

ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Why avoid communication (23)

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

3

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save time

Annual improvements

Time_per_flop Bandwidth Latency

Network 26 15

DRAM 23 559

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 2: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

2

Why avoid communication (12)

Algorithms have two costs (measured in time or energy)1 Arithmetic (FLOPS)2 Communication moving data between

ndash levels of a memory hierarchy (sequential case) ndash processors over a network (parallel case)

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Why avoid communication (23)

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

3

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save time

Annual improvements

Time_per_flop Bandwidth Latency

Network 26 15

DRAM 23 559

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 3: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Why avoid communication (23)

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

3

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save time

Annual improvements

Time_per_flop Bandwidth Latency

Network 26 15

DRAM 23 559

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 4: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Why Minimize Communication (22)

Source John Shalf LBL

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 5: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Why Minimize Communication (22)

Source John Shalf LBL

Minimize communication to save energy

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 6: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Goals

6

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 7: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

ldquoNew Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems On modern computer architectures communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor ASCR researchers have developed a new method derived from commonly used linear algebra methods to minimize communications between processors and the memory hierarchy by reformulating the communication patterns specified within the algorithm This method has been implemented in the TRILINOS framework a highly-regarded suite of software which provides functionality for researchers around the world to solve large scale complex multi-physics problemsrdquo

FY 2010 Congressional Budget Volume 4 FY2010 Accomplishments Advanced Scientific Computing Research (ASCR) pages 65-67

President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress

CA-GMRES (Hoemmen Mohiyuddin Yelick JD)ldquoTall-Skinnyrdquo QR (Grigori Hoemmen Langou JD)

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 8: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazaki hellipbull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle bull bebopcsberkeleyedu

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 9: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Summary of CA Algorithmsbull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication for linear algebra problems like Ax=b least squares Ax = λx SVD etc

bull New algorithms that attain these lower boundsbull Being added to libraries ScaLAPACK PLASMA

MAGMAbull Large speed-ups possible

bull Autotuning to find optimal implementationbull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 10: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 11: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 12: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

12

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 13: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

13

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent ge words_moved largest_message_size

bull Parallel case assume either load or memory balanced

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 14: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Lower bound for all ldquodirectrdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

14

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 15: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformationsbull Only a few sparse algorithms so farbull Lots of work in progressbull Case study Matrix Multiply15

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 16: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

16

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n for j = 1 to n

for k = 1 to n C(ij) = C(ij) + A(ik) B(kj)

= +

C(ij) A(i)

B(j)C(ij)

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 17: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

17

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(ij) into fast memory read column j of B into fast memory for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory

= +

C(ij) A(i)

B(j)C(ij)

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 18: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

18

Naiumlve Matrix Multiply

implements C = C + ABfor i = 1 to n read row i of A into fast memory hellip n2 reads altogether for j = 1 to n read C(ij) into fast memory hellip n2 reads altogether read column j of B into fast memory hellip n3 reads altogether for k = 1 to n C(ij) = C(ij) + A(ik) B(kj) write C(ij) back to slow memory hellip n2 writes altogether

= +

C(ij) A(i)

B(j)C(ij)

n3 + 3n2 readswrites altogether ndash dominates 2n3 arithmetic

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 19: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

19

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory for k = 1 to nb read block A(ik) into fast memory read block B(kj) into fast memory C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 20: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

20

Blocked (Tiled) Matrix Multiply

Consider ABC to be nb-by-nb matrices of b-by-b subblocks where b is called the block size assume 3 b-by-b blocks fit in fast memory for i = 1 to nb

for j = 1 to nb read block C(ij) into fast memory hellip b2 times (nb)2 = n2 reads for k = 1 to nb read block A(ik) into fast memory hellip b2 times (nb)3 = n3b reads read block B(kj) into fast memory hellip b2 times (nb)3 = n3b reads C(ij) = C(ij) + A(ik) B(kj) do a matrix multiply on blocks write block C(ij) back to slow memory hellip b2 times (nb)2 = n2 writes

= +

C(ij) C(ij) A(ik)

B(kj)b-by-bblock

2n3b + 2n2 readswrites ltlt 2n3 arithmetic - Faster

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 21: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Does blocked matmul attain lower boundbull Recall if 3 b-by-b blocks fit in fast memory of

size M then readswrites = 2n3b + 2n2

bull Make b as large as possible 3b2 le M so readswrites ge 2n3(M3)12 + 2n2

bull Attains lower bound = Ω (flops M12 )

bull But what if we donrsquot know M bull Or if there are multiple levels of fast memorybull How do we write the algorithm

21

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 22: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

How hard is hand-tuning matmul anyway

22

bull Results of 22 student teams trying to tune matrix-multiply in CS267 Spr09bull Students given ldquoblockedrdquo code to start with (7x faster than naiumlve)

bull Still hard to get close to vendor tuned performance (ACML) (another 6x)bull For more discussion see wwwcsberkeleyedu~volkovcs267sp09hw1results

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 23: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

How hard is hand-tuning matmul anyway

23

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 24: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Recursive Matrix Multiplication (RMM) (12)bull For simplicity square matrices with n = 2m

bull C = = A middot B = middot middot

=

bull True when each Aij etc 1x1 or n2 x n2

24

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11middotB11 + A12middotB21 A11middotB12 + A12middotB22

A21middotB11 + A22middotB21 A21middotB12 + A22middotB22

func C = RMM (A B n) if n = 1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 25: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Recursive Matrix Multiplication (RMM) (22)

25

func C = RMM (A B n) if n=1 C = A B else C11 = RMM (A11 B11 n2) + RMM (A12 B21 n2) C12 = RMM (A11 B12 n2) + RMM (A12 B22 n2) C21 = RMM (A21 B11 n2) + RMM (A22 B21 n2) C22 = RMM (A21 B12 n2) + RMM (A22 B22 n2) return

A(n) = arithmetic operations in RMM( n)

= 8 middot A(n2) + 4(n2)2 if n gt 1 else 1

= 2n3 hellip same operations as usual in different order

W(n) = words moved between fast slow memory by RMM( n)

= 8 middot W(n2) + 12(n2)2 if 3n2 gt M else 3n2

= O( n3 M12 + n2 ) hellip same as blocked matmul

ldquoCache obliviousrdquo works for memory hierarchies but not panacea

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 26: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 27: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

CARMA Performance Shared Memory

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(log)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 28: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Why is CARMA FasterL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 29: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Parallel MatMul with 2D Processor Layout

bull P processors in P12 x P12 gridndash Processors communicate along rows columns

bull Each processor owns nP12 x nP12 submatrices of ABCbull Example P=16 processors numbered from P00 to P33

ndash Processor Pij owns submatrices Aij Bij and Cij

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

P00 P01 P02 P03

P10 P11 P12 P13

P20 P21 P22 P23

P30 P31 P32 P33

C = A B

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 30: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

30

SUMMA Algorithm

bull SUMMA = Scalable Universal Matrix Multiply ndash Attains lower bounds

bull Assume fast memory size M = O(n2P) per processor ndash 1 copy of databull words_moved = Ω( flops M12 ) = Ω( (n3P) (n2P)12 ) = Ω( n2 P12 )bull messages = Ω( flops M32 ) = Ω( (n3P) (n2P)32 ) = Ω( P12 )

ndash Can accommodate any processor grid matrix dimensions amp layout

ndash Used in practice in PBLAS = Parallel BLASbull wwwnetliborglapacklawnslawn96100ps

bull Comparison to Cannonrsquos Algorithmndash Cannon attains lower boundndash But Cannon harder to generalize to other grids dimensions layouts and Cannon may use more

memory

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 31: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

31

SUMMA ndash n x n matmul on P12 x P12 grid

bull C(i j) is nP12 x nP12 submatrix of C on processor Pijbull A(ik) is nP12 x b submatrix of Abull B(kj) is b x nP12 submatrix of B bull C(ij) = C(ij) + Sk A(ik)B(kj)

bull summation over submatricesbull Need not be square processor grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 32: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

32

SUMMAndash n x n matmul on P12 x P12 grid

=i

j

A(ik)

k

k

B(kj)

C(ij)

For k=0 to nb-1

for all i = 1 to P12

owner of A(ik) broadcasts it to whole processor row (using binary tree)

for all j = 1 to P12

owner of B(kj) broadcasts it to whole processor column (using bin tree)

Receive A(ik) into Acol

Receive B(kj) into Brow

C_myproc = C_myproc + Acol Brow

Brow

Acol

bull Attains bandwidth lower boundbull Attains latency lower bound if b near maximum nP12

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 33: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Summary of dense parallel algorithms attaining communication lower bounds

bull Assume nxn matrices on P processors bull Minimum Memory per processor = M = O(n2 P)bull Recall lower bounds

words_moved = ( (n3 P) M12 ) = ( n2 P12 ) messages = ( (n3 P) M32 ) = ( P12 )

bull Does ScaLAPACK attain these boundsbull For words_moved mostly except nonsym Eigenproblembull For messages asymptotically worse except Cholesky

bull New algorithms attain all bounds up to polylog(P) factorsbull Cholesky LU QR Sym and Nonsym eigenproblems SVD

Can we do Better

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 34: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Can we do betterbull Arenrsquot we already optimalbull Why assume M = O(n2P) ie minimal

ndash Lower bound still true if more memoryndash Can we attain itndash Special case ldquo3D Matmulrdquo uses M = O(n2P23)

bull Dekel Nassimi Sahni [81] Bernsten [89] Agarwal Chandra Snir [90] Johnson [93] Agarwal Balle Gustavson Joshi Palkar [95]

bull Processors arranged in P13 x P13 x P13 gridbull Processor (ijk) performs C(ij) = C(ij) + A(ik)B(kj) where

each submatrix is nP13 x nP13

ndash M = O(n2P23) is P13 times the minimiumbull Not always that much memory availablehellip

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 35: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

25D Matrix Multiplication

bull Assume can fit cn2P data per processor cgt1bull Processors form (Pc)12 x (Pc)12 x c grid

c

(Pc)12

(Pc)12

Example P = 32 c = 2

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 36: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

k

j

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 37: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

25D Matmul on BGP 16K nodes 64K cores

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 38: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 39: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Limit c le P13 (3D algorithm) if starting with 1 copy of inputs

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 40: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Perfect Strong Scaling ndash in Time and Energy (22)

bull Perfect scaling extends to N-body Strassen hellipbull We can use these models to answer many questions including

bull What is the minimum energy required for a computationbull Given a maximum allowed runtime T what is the minimum

energy E needed to achieve itbull Given a maximum energy budget E what is the minimum runtime

T that we can attainbull The ratio P = ET gives us the average power required to run the

algorithm Can we minimize the average power consumedbull Given an algorithm problem size number of processors and target

energy efficiency (GFLOPSW) can we determine a set of architectural parameters to describe a conforming computer architecture

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 41: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 42: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 43: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 44: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 45: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

TSQR QR of a Tall Skinny matrix

45

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 46: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

TSQR QR of a Tall Skinny matrix

46

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= =

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11

= R01

R11

R01

R11

= Q02 R02

Output = Q00 Q10 Q20 Q30 Q01 Q11 Q02 R02

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 47: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

TSQR An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel

W =

W0

W1

W2

W3

R01R02

R00

R03

Sequential

W =

W0

W1

W2

W3

R00

R01

R01

R11

R02

R11

R03

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 48: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

TSQR Performance Resultsbull Parallel

ndash Intel Clovertownndash Up to 8x speedup (8 core dual socket 10M x 10)

ndash Pentium III cluster Dolphin Interconnect MPICHbull Up to 67x speedup (16 procs 100K x 200)

ndash BlueGeneLbull Up to 4x speedup (32 procs 1M x 50)

ndash Tesla C 2050 Fermibull Up to 13x (110592 x 100)

ndash Grid ndash 4x on 4 cities (Dongarra et al)ndash Cloud ndash ~2 map-reduces (Gleich and Benson)

bull Sequential ndash ldquoInfinite speeduprdquo for out-of-Core on PowerPC laptop

bull As little as 2x slowdown vs (predicted) infinite DRAMbull LAPACK with virtual memory never finished

bull SVD costs about the samebull Building block for QR of a general matrixbull Joint work with Grigori Hoemmen Langou Anderson Ballard Keutzer

others48

Data from Grey Ballard Mark Hoemmen Laura Grigori Julien Langou Jack Dongarra Michael Anderson

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 49: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

49

Wnxb =

W1

W2

W3

W4

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

=

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

P34middotL34middotU34

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

bull Go back to W and use these b pivot rows bull Move them to top do LU without pivotingbull Extra work but lower order term

bull Thm As numerically stable as Partial Pivoting on a larger matrix

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 50: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

LU Speedups from Tournament Pivoting and 25D

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 51: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

25D vs 2D LUWith and Without Pivoting

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 52: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 53: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

log 2 (

n2 p)

=

log 2 (

mem

ory_

per_

proc

)

Up to 29xfaster

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 54: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Other CA algorithmsbull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 54

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 55: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz appeared in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 56: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

vs

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

In practice how to best interleaveBFS and DFS isa ldquotuning parameterrdquo

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 57: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

57

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 58: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Symmetric Band Reduction

bull Grey Ballard and Nick Knightbull A QAQT = T where

ndash A=AT is bandedndash T tridiagonalndash Similar idea for SVD of a band matrix

bull Use alone or as second phase when A is densendash Dense Banded Tridiagonal

bull Implemented in LAPACKrsquos sytrdbull Algorithm does not satisfy communication lower bound

theorem for applying orthogonal transformationsndash It can communicate even less

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 59: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Conventional vs CA - SBR

Conventional Communication-Avoiding

Many tuning parametersRight choices reduce words_moved by factor Mbw not just M12

Touch all data 4 times Touch all data once

>
>

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 60: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Speedups of Sym Band Reductionvs LAPACKrsquos DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 61: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

61

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 62: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Performance of 25D APSP using Kleene

62

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 63: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

What about sparse matrices (23)

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 63

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 64: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

What about sparse matrices (33)

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

64

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 65: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Summary of Direct Linear Algebrabull New lower bounds optimal algorithms big speedups in theory and practicebull Lots of ongoing work on

ndash Algorithms bull LDLT QR with pivoting other pivoting schemes eigenproblems hellip bull All-pairs-shortest-path hellipbull Both 2D (c=1) and 25D (cgt1) bull But only bandwidth may decrease with cgt1 not latencybull Sparse matrices

ndash Platforms bull Multicore cluster GPU cloud heterogeneous low-energy hellip

ndash Software bull Integration into ScaLAPACK PLASMA MAGMAhellip

bull Integration of CTF into quantum chemistryDFT applications ndash Aquarius with ANL UT Austin on IBM BGQ Cray XC30ndash Qbox with LLNL IBM on IBM BGQndash Q-Chem work in progress

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 66: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 67: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Recall optimal sequential Matmul

bull Naiumlve code for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj)

bull ldquoBlockedrdquo code for i1 = 1bn for j1 = 1bn for k1 = 1bn for i2 = 0b-1 for j2 = 0b-1 for k2 = 0b-1 i=i1+i2 j = j1+j2 k = k1+k2 C(ij)+=A(ik)B(kj)

bull Thm Picking b = M12 attains lower bound words_moved = Ω(n3M12)bull Where does 12 come from

b x b matmul

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 68: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

New Thm applied to Matmulbull for i=1n for j=1n for k=1n C(ij) += A(ik)B(kj)bull Record array indices in matrix Δ

bull Solve LP for x = [xixjxk]T max 1Tx st Δ x le 1ndash Result x = [12 12 12]T 1Tx = 32 = e

bull Thm words_moved = Ω(n3Me-1)= Ω(n3M12) Attained by block sizes MxiMxjMxk = M12M12M12

i j k1 0 1 A

Δ = 0 1 1 B1 1 0 C

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 69: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

New Thm applied to Direct N-Bodybull for i=1n for j=1n F(i) += force( P(i) P(j) )bull Record array indices in matrix Δ

bull Solve LP for x = [xixj]T max 1Tx st Δ x le 1ndash Result x = [11] 1Tx = 2 = e

bull Thm words_moved = Ω(n2Me-1)= Ω(n2M1) Attained by block sizes MxiMxj = M1M1

i j1 0 F

Δ = 1 0 P(i)0 1 P(j)

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 70: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

N-Body Speedups on IBM-BGP (Intrepid)8K cores 32K particles

118x speedup

K Yelick E Georganas M Driscoll P Koanantakool E Solomonik

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 71: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

New Thm applied to Random Codebull for i1=1n for i2=1n hellip for i6=1n A1(i1i3i6) += func1(A2(i1i2i4)A3(i2i3i5)A4(i3i4i6)) A5(i2i6) += func2(A6(i1i4i5)A3(i3i4i6))bull Record array indices in matrix Δ

bull Solve LP for x = [x1hellipx7]T max 1Tx st Δ x le 1ndash Result x = [273717273747] 1Tx = 157 = e

bull Thm words_moved = Ω(n6Me-1)= Ω(n6M87) Attained by block sizes M27M37M17M27M37M47

i1 i2 i3 i4 i5 i61 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ = 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 72: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Approach to generalizing lower boundsbull Matmul for i=1n for j=1n for k=1n C(ij)+=A(ik)B(kj) =gt for (ijk) in S = subset of Z3

Access locations indexed by (ij) (ik) (kj)bull General case for i1=1n for i2 = i1m hellip for ik = i3i4 C(i1+2i3-i7) = func(A(i2+3i4i1i2i1+i2hellip)B(pnt(3i4))hellip) D(something else) = func(something else) hellip =gt for (i1i2hellipik) in S = subset of Zk

Access locations indexed by ldquoprojectionsrdquo eg φC (i1i2hellipik) = (i1+2i3-i7)

φA (i1i2hellipik) = (i2+3i4i1i2i1+i2hellip) hellip

bull Can we bound loop_iterations points in S given bounds on points in its images φC (S) φA (S) hellip

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 73: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

General Communication Bound

bull Thm Given a program with array refs given by projections φj then there is an e ge 1 such that

words_moved = Ω (iterationsMe-1) where e is the the value of a linear program minimize e = Σj ej subject to

rank(H) le Σj ejrank(φj(H)) for all subgroups H lt Zk

bull Proof depends on recent result in pure mathematics by ChristTaoCarberyBennett

bull Given S subset of Zk group homomorphisms φ1 φ2 hellip bound |S| in terms of |φ1(S)| |φ2(S)| hellip |φm(S)|

bull Thm (ChristTaoCarberyBennett) Given s1hellipsm

|S| le Πj |φj(S)|sj

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 74: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Is this bound attainable (12)

bull But first Can we write it downndash One inequality per subgroup H lt Zd but still finitely manyndash Thm (bad news) Writing down all inequalities in LP reduces

to Hilbertrsquos 10th problem over Q bull Could be undecidable open question

ndash Thm (good news) Another LP has same solution is decidable (but expensive so far)

ndash Thm (better news) Easy to write LP down explicitly in many cases of interest (eg when subscript are subsets of indices)

ndash Also easy to get upperlower bounds on e

bull Tarski-decidable to get superset of constraints (may get sHBL too large)

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 75: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Is this bound attainable (22)

bull Depends on loop dependenciesbull Best case none or reductions (matmul)bull Thm When all subscripts are subsets of indices

the solution x of the dual LP gives optimal tile sizes Mx1 Mx2 hellip

bull Ex Linear algebra n-body ldquorandom coderdquo join hellipbull Conjecture always attainable (modulo

dependencies) work in progress

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 76: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Ongoing Work

bull Identify more decidable casesndash Works for any 3 nested loops or 3 different subscripts

bull Automate generation of approximate LPsbull Extend ldquoperfect scalingrdquo results for time and

energy by using extra memorybull Have yet to find a case where we cannot attain

lower bound ndash can we prove thisbull Incorporate into compilers

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 77: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Outlinebull ldquoDirectrdquo Linear Algebra

bull Lower bounds on communication bull New algorithms that attain these lower bounds

bull Ditto for programs accessing arrays (eg n-body)bull Ditto for ldquoIterativerdquo Linear Algebra

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 78: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Goal minimize communicationndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computation 78

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 79: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3bull Works for any ldquowell-partitionedrdquo A

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 80: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 81: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 82: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 83: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 84: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]

bull Example A tridiagonal n=32 k=3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 85: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 86: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 87: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 88: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx] bull Sequential Algorithm

bull Example A tridiagonal n=32 k=3

Step 1 Step 2 Step 3 Step 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 89: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor communicates once with neighbors

Proc 1 Proc 2 Proc 3 Proc 4

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 90: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

1 2 3 4 hellip hellip 32

x

Amiddotx

A2middotx

A3middotx

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

bull Replace k iterations of y = Ax with [Ax A2x hellip Akx]bull Parallel Algorithm

bull Example A tridiagonal n=32 k=3bull Each processor works on (overlapping) trapezoid

Proc 1 Proc 2 Proc 3 Proc 4

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 91: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Same idea works for general sparse matrices

Communication Avoiding KernelsThe Matrix Powers Kernel [Ax A2x hellip Akx]

Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 92: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Minimizing Communication of GMRES to solve Ax=bbull GMRES find x in spanbAbhellipAkb minimizing || Ax-b ||2

Standard GMRES for i=1 to k w = A middot v(i-1) hellip SpMV MGS(w v(0)hellipv(i-1)) update v(i) H endfor solve LSQ problem with H

Communication-avoiding GMRES

W = [ v Av A2v hellip Akv ] [QR] = TSQR(W) hellip ldquoTall Skinny QRrdquo build H from R solve LSQ problem with H

bullOops ndash W from power method precision lost92

Sequential case words moved decreases by a factor of kParallel case messages decreases by a factor of k

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 93: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

ldquoMonomialrdquo basis [AxhellipAkx] fails to converge

Different polynomial basis [p1(A)xhellippk(A)x] does converge

93

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 94: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Speed ups of GMRES on 8-core Intel Clovertown

[MHDY09]

94

Requires Co-tuning Kernels

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 95: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

95

CA-BiCGStab

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 96: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Naive Monomial Newton Chebyshev

Replacement Its 74 (1) [7 15 24 31 hellip 92 97 103] (17)

[67 98] (2) 68 (1)

With Residual Replacement (RR) a la Van der Vorst and Ye

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 97: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Tuning space for Krylov Methods

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Classifications of sparse operators for avoiding communicationbull Explicit indices or nonzero entries cause most communication along with vectorsbull Ex With stencils (all implicit) all communication for vectors

bull Operationsbull [x Ax A2xhellip Akx ] or [x p1(A)x p2(A)x hellip pk(A)x ]bull Number of columns in xbull [x Ax A2xhellip Akx ] and [y ATy (AT)2yhellip (AT)ky ] or [y ATAy (ATA)2yhellip (ATA)ky ] bull return all vectors or just last one

bull Cotuning andor interleavingbull W = [x Ax A2xhellip Akx ] and TSQR(W) or WTW or hellip bull Ditto but throw away W

bull Preconditioned versions

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 98: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Summary of Iterative Linear Algebra

bull New Lower bounds optimal algorithms big speedups in theory and practice

bull Lots of other progress open problemsndash Many different algorithms reorganized

bull More underway more to be done

ndash Need to recognize stable variants more easilyndash Preconditioning

bull Hierarchically Semiseparable Matrices

ndash Autotuning and synthesisbull pOSKI for SpMV ndash available at bebopcsberkeleyedubull Different kinds of ldquosparse matricesrdquo

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 99: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

For more details

bull Bebopcsberkeleyedubull CS267 ndash Berkeleyrsquos Parallel Computing Course

ndash Live broadcast in Spring 2014bull wwwcsberkeleyedu~demmel

ndash On-line version planned in Spring 2014bull wwwxsedeorgbull Free supercomputer accounts to do homeworkbull University credit with local instructors

ndash 3-day short course every Augustbull ~100 page survey article nearly donehellip

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 100: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Reproducible Floating Point Computation

bull Do you get the same answer if you run the same program twice with the same inputndash Not even on your multicore laptop

bull Floating point addition is nonassociative summation order not reproducible

bull First release of the ReproBLASndash Reproducible BLAS 1 independent of data order number of

processors data layout reduction tree hellipndash Sequential and distributed memory (MPI)

bull bebopcsberkeleyedureproblasbull Workshop at SCrsquo13 later this week

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary
Page 101: Introduction to  Communication-Avoiding Algorithms cs.berkeley /~ demmel /SC13_tutorial

Summary

Donrsquot Communichellip

102

Time to redesign all linear algebra n-bodyhellip algorithms and software

(and compilershellip)

  • Introduction to Communication-Avoiding Algorithms wwwcsberk
  • Why avoid communication (12)
  • Why avoid communication (23)
  • Why Minimize Communication (22)
  • Why Minimize Communication (22) (2)
  • Goals
  • Slide 7
  • Collaborators and Supporters
  • Summary of CA Algorithms
  • Outline
  • Outline (2)
  • Lower bound for all ldquodirectrdquo linear algebra
  • Lower bound for all ldquodirectrdquo linear algebra (2)
  • Lower bound for all ldquodirectrdquo linear algebra (3)
  • Can we attain these lower bounds
  • Naiumlve Matrix Multiply
  • Naiumlve Matrix Multiply (2)
  • Naiumlve Matrix Multiply (3)
  • Blocked (Tiled) Matrix Multiply
  • Blocked (Tiled) Matrix Multiply (2)
  • Does blocked matmul attain lower bound
  • How hard is hand-tuning matmul anyway
  • How hard is hand-tuning matmul anyway (2)
  • Recursive Matrix Multiplication (RMM) (12)
  • Recursive Matrix Multiplication (RMM) (22)
  • CARMA Performance Shared Memory
  • CARMA Performance Shared Memory (2)
  • Why is CARMA Faster
  • Parallel MatMul with 2D Processor Layout
  • SUMMA Algorithm
  • SUMMA ndash n x n matmul on P12 x P12 grid
  • SUMMAndash n x n matmul on P12 x P12 grid
  • Summary of dense parallel algorithms attaining communication l
  • Can we do better
  • 25D Matrix Multiplication
  • 25D Matrix Multiplication (2)
  • 25D Matmul on BGP 16K nodes 64K cores
  • 25D Matmul on BGP 16K nodes 64K cores (2)
  • Perfect Strong Scaling ndash in Time and Energy (12)
  • Perfect Strong Scaling ndash in Time and Energy (22)
  • Handling Heterogeneity
  • Application to Tensor Contractions
  • C(ijk) = Σm A(ijm)B(mk)
  • Application to Tensor Contractions (2)
  • TSQR QR of a Tall Skinny matrix
  • TSQR QR of a Tall Skinny matrix (2)
  • TSQR An Architecture-Dependent Algorithm
  • TSQR Performance Results
  • Using similar idea for TSLU as TSQR Use reduction tree to do
  • LU Speedups from Tournament Pivoting and 25D
  • 25D vs 2D LU With and Without Pivoting
  • Exascale Machine Parameters Source DOE Exascale Workshop
  • Exascale predicted speedups for Gaussian Elimination 2D CA
  • Other CA algorithms
  • Communication Lower Bounds for Strassen-like matmul algorithms
  • vs
  • Slide 57
  • Symmetric Band Reduction
  • Conventional vs CA - SBR
  • Speedups of Sym Band Reduction vs LAPACKrsquos DSBTRD
  • What about sparse matrices (13)
  • Performance of 25D APSP using Kleene
  • What about sparse matrices (23)
  • What about sparse matrices (33)
  • Summary of Direct Linear Algebra
  • Outline (3)
  • Recall optimal sequential Matmul
  • New Thm applied to Matmul
  • New Thm applied to Direct N-Body
  • N-Body Speedups on IBM-BGP (Intrepid) 8K cores 32K particles
  • New Thm applied to Random Code
  • Approach to generalizing lower bounds
  • General Communication Bound
  • Is this bound attainable (12)
  • Is this bound attainable (22)
  • Ongoing Work
  • Outline (4)
  • Avoiding Communication in Iterative Linear Algebra
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
  • Slide 88
  • Slide 89
  • Slide 90
  • Slide 91
  • Minimizing Communication of GMRES to solve Ax=b
  • Slide 93
  • Speed ups of GMRES on 8-core Intel Clovertown
  • Slide 95
  • Slide 96
  • Slide 97
  • Tuning space for Krylov Methods
  • Summary of Iterative Linear Algebra
  • For more details
  • Reproducible Floating Point Computation
  • Summary