MS4: Minimizing Communication in Numerical Algorithms Part ...
Transcript of MS4: Minimizing Communication in Numerical Algorithms Part ...
MS4: Minimizing Communication in Numerical AlgorithmsPart I of IIOrganizers: Oded Schwartz (Hebrew University of Jerusalem) and
Erin Carson (New York University)
Talks:
1. Communication-Avoiding Krylov Subspace Methods in Theory and Practice (Erin Carson)
2. Enlarged GMRES for Reducing Communication When Solving Reservoir Simulation Problems (Hussam Al Daas, Laura Grigori, Pascal Henon, Philippe Ricoux)
3. CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems (Yang You)
4. Outline of a New Roadmap to Permissive Communication and Applications That Can Benefit (James A. Edwards and Uzi Vishkin)
Communication-Avoiding Krylov Subspace Methods
in Theory and PracticeErin Carson
Courant Institute @ NYU
April 12, 2016
SIAM PP 16
What is communication?
β’ Algorithms have two costs: computation and communication
β’ Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)
β’ On todayβs computers, communication is expensive, computation is cheap, in terms of both time and energy!
2
Sequential Parallel
CPU Cache
CPU DRAM
DRAM
CPU DRAM
CPU DRAM
CPU DRAM
Future exascale systems
PetascaleSystems (2009)
Predicted ExascaleSystems
Factor Improvement
System Peak 2 β 1015 flops/s 1018 flops/s ~1000
Node Memory Bandwidth
25 GB/s 0.4-4 TB/s ~10-100
Total Node Interconnect Bandwidth
3.5 GB/s 100-400 GB/s ~100
Memory Latency 100 ns 50 ns ~1
Interconnect Latency 1 πs 0.5 πs ~1
4
β’ Gaps between communication/computation cost only growing larger in future systems
*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)
β’ Avoiding communication will be essential for applications at exascale!
Minimize communication to save energy
Source: John Shalf, LBL
Work in CA algorithms
β’ For both dense and sparse linear algebraβ¦
β’ More recently, extending communication-avoiding ideas to Machine Learning and optimization domains
β’ Prove lower bounds on communication cost of an algorithm
β’ Design new algorithms and implementations that meet these those bounds
Lots of speedupsβ¦
β’ Up to 12x faster for 2.5D matmul on 64K core IBM BG/P
β’ Up to 3x faster for tensor contractions on 2K core Cray XE/6
β’ Up to 6.2x faster for All-Pairs-Shortest-Path on 24K core Cray CE6
β’ Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P
β’ Up to 11.8x faster for direct N-body on 32K core IBM BG/P
β’ Up to 13x faster for Tall Skinny QR on Tesla C2050 Fermi NVIDIA GPU
β’ Up to 6.7x faster for symeig(band A) on 10 core Intel Westmere
β’ Up to 2x faster for 2.5D Strassen on 38K core Cray XT4
β’ Up to 4.2x faster for MiniGMG benchmark bottom solver, using CA-BICGSTAB (2.5xfor overall solve), 2.5x / 1.5x for combustion simulation code
β’ Up to 42x for Parallel Direct 3-Body
These and many more recent papers available at bebop.cs.berkeley.edu
Krylov solvers: limited by communicationIn terms of linear algebra operations:
Dependencies between communication-bound kernels in each iteration limit performance!
SpMV
orthogonalize
8
Γ
Γ
βAdd a dimension to π¦πβ Sparse Matrix-Vector Multiplication (SpMV)β’ Parallel: comm. vector entries w/ neighborsβ’ Sequential: read π΄/vectors from slow memory
βOrthogonalize (with respect to some βπ)β Inner products
Parallel: global reduction (All-Reduce)Sequential: multiple reads/writes to slow memory
Given: initial vector π£1 with π£1 2= 1
π’1 = π΄π£1
for π = 1, 2, β¦ , until convergence do
πΌπ = π£πππ’π
π€π = π’π β πΌππ£π
π½π+1 = π€π 2
π£π+1 = π€π/π½π+1
π’π+1 = π΄π£π+1 β π½π+1π£π
end for
The classical Lanczos method
6
SpMV
Inner products
Communication-Avoiding KSMs
7
β’ Idea: Compute blocks of π iterations at once
β’ Communicate every π iterations instead of every iteration
β’ Reduces communication cost by πΆ(π)!
β’ (latency in parallel, latency and bandwidth in sequential)
β’ An idea rediscovered many timesβ¦β’ First related work: s-dimensional steepest descent, least squares
β’ Khabaza (β63), Forsythe (β68), Marchuk and Kuznecov (β68) β’ Flurry of work on s-step Krylov methods in β80s/early β90s: see, e.g., Van
Rosendale, 1983; Chronopoulos and Gear, 1989β’ Goals: increasing parallelism, avoiding I/O, increasing βconvergence rateβ
β’ Resurgence of interest in recent years due to growing problem sizes; growing relative cost of communication
Communication-Avoiding KSMs: CA-Lanczos
8
β’ Main idea: Unroll iteration loop by a factor of π ; split iteration loop into an outer loop (k) and an inner loop (j)
β’ Key observation: starting at some iteration π β‘ π π + π,
π£π π+π , π’π π+π β π¦π +1 π΄, π£π π+1 + π¦π +1 π΄, π’π π+1 for π β 1, β¦ , π + 1
For each block of s steps:β’ Compute βbasis matrixβ: π΄π such that
span π΄π = π¦π +1 π΄, π£π π+1 + π¦π +1 π΄, π’π π+1
β’ π(π ) SpMVs, requires reading π΄/communicating vectors only once using βmatrix powers kernelβ
β’ Orthogonalize: π’π = π΄πππ΄π
β’ One global reduction
β’ Perform π iterations of updates for π-vectors by updating their π π coordinates in π΄π
β’ No communication
via CA Matrix Powers Kernel
Global reduction
to compute π’π
10
Local computations: no communication!
The CA-Lanczos methodGiven: initial vector π£1 with π£1 2
= 1
π’1 = π΄π£1
for π = 0, 1, β¦ , until convergence doCompute π΄π , compute π’π = π΄π
ππ΄π
Let π£π,1β² = π1, π’π,1
β² = ππ +2
for π = 1, β¦ , π do
πΌπ π+π = π£π,πβ²π π’ππ’π,π
β²
π€π,πβ² = π’π,π
β² β πΌπ π+ππ£π,πβ²
π½π π+π+1 = π€π,πβ²π π’ππ€π,π
β² 1/2
π£π,π+1β² = π€π,π
β² / π½π π+π+1
π’π,π+1β² = β¬ππ£π,π+1
β² β π½π π+π+1π£π,πβ²
end forCompute π£π π+π +1 = π΄ππ£π,π +1
β² , π’π π+π +1 = π΄ππ’π,π +1β²
end for
Complexity comparison
11
Flops Words Moved Messages
SpMV Orth. SpMV Orth. SpMV Orth.
Classical CG
π π
π
π π
π π π π π log2 π π π log2 π
CA-CGπ π
ππ 2π
ππ π π π 2 log2 π 1 log2 π
All values in the table meant in the Big-O sense (i.e., lower order terms and constants not included)
Example of parallel (per processor) complexity for π iterations of Classical Lanczos vs. CA-Lanczos for a 2D 9-point stencil:
(Assuming each of π processors owns π/π rows of the matrix and π β€ π/π)
π
Time per iteration
From theory to practice
13
β’ CA-KSMs are mathematically equivalent to classical KSMs
β’ Roundoff error bounds generally grow with increasing π
β’ But can behave much differently in finite precision!
β’ Two effects of roundoff error:
1. Decrease in accuracy β Tradeoff: increasing blocking factor π past a certain point: accuracy limited
2. Delay of convergence β Tradeoff: increasing blocking factor π past a certain point: no speedup expected
π
Time per iteration
π
Number of iterations
Optimizing iterative method runtime
β’ Want to minimize total time of iterative solver
β’ Time per iteration determined by matrix/preconditioner structure, machine parameters, basis size, etc.
β’ Number of iterations depends on numerical properties of the matrix/preconditioner, basis size
Runtime = (time/iteration) x (# iterations)
x
π
Time per iteration
π
Number of iterations
π
Total Time
=
Optimizing iterative method runtime
β’ Want to minimize total time of iterative solver
β’ Speed per iteration determined by matrix/preconditioner structure, machine parameters, basis size, etc.
β’ Number of iterations depends on numerical properties of the matrix/preconditioner, basis size
β’ Traditional auto-tuners tune kernels (e.g., SpMV and QR) to optimize speed per iteration
β’ This misses a big part of the equation!
β’ Goal: Combine offline auto-tuning with online techniques for achieving desired accuracy and a good convergence rate
β’ Requires a better understanding of behavior of iterative methods in finite precision
Runtime = (time/iteration) x (# iterations)
Main results
β’ Bounds on accuracy and convergence rate for CA methods can be written in terms of those for classical methods times an amplification factor
β’ Amplification factor depends on condition number of the s-step bases π΄π computed in each outer iteration
β’ These bounds can be used to design techniques to improve accuracy and convergence rate while still avoiding communication
Attainable accuracy of (CA)-CG
β’ Results for CG (Greenbaum, van der Vorst and Ye, others):
π β π΄π₯π β ππ β€ νπβ
π=0
π
1 + 2π π΄ π₯π + ππ
β’ Results for CA-CG:
π β π΄π₯π β ππ β€ νπͺππβ
π=0
π
1 + 2π π΄ π₯π + ππ
where Ξπ = maxββ€π
π΄β+
2 β π΄β 2
β’ Bound can be used for designing a βResidual Replacementβ strategy for CA-CG (based on van der Vorst and Ye, 1999)
β’ In tests, CA-CG accuracy improved up to 7 orders of magnitude for little additional cost
β’ Chris Paigeβs results for classical Lanczos:loss of orthogonality eigenvalue convergence
if νπ β€1
12
Convergence and accuracy of CA-Lanczos
18
and use it to design a better algorithm!
β’ This (and other results of Paige) also hold for CA-Lanczos if:
2ν π+11π +15 Ξ2 β€1
12, where Ξ = max
ββ€ππ΄β
+2 β π΄β 2
β’ i.e., maxββ€π
π΄β+
2 β π΄β 2 β€ 24π π + 11π + 15β 1 2
β’ We could approximate this constraint:
π π΄π β€ 1/ ππ
Dynamic basis sizeβ’ Auto-tune to find best π based on machine, sparsity structure; use this as π max
β’ In each outer iteration, select largest π β€ π max such that π (π΄π) β€ 1/ ππ
β’ Benefit: Maintain acceptable convergence rate regardless of userβs choice of s
β’ Cost: Incremental condition number estimation in each outer iteration; potentially wasted SpMVs in each outer iteration
s values used = (6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 4, 5, 5)
MS4: Minimizing Communication in Numerical AlgorithmsPart I of IITalks:
1. Communication-Avoiding Krylov Subspace Methods in Theory and Practice (Erin Carson)
2. Enlarged GMRES for Reducing Communication When Solving Reservoir Simulation Problems (Hussam Al Daas, Laura Grigori, Pascal Henon, Philippe Ricoux)
3. CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems (Yang You)
4. Outline of a New Roadmap to Permissive Communication and Applications That Can Benefit (James A. Edwards and Uzi Vishkin)
MS12: Minimizing Communication in Numerical AlgorithmsPart II of II
1. Communication-Efficient Evaluation of Matrix Polynomials (Sivan A. Toledo)
2. Communication-Optimal Loop Nests (Nicholas Knight)
3. Write-Avoiding Algorithms (Harsha VardhanSimhadri)
4. Lower Bound Techniques for Communication in the Memory Hierarchy (Gianfranco Bilardi)
3:20 PM - 5:00 PM, Room: Salle des theses
Thank you!Email: [email protected]
Website: http://cims.nyu.edu/~erinc/
Matlab code: https://github.com/eccarson/ca-ksms
Residual Replacement Strategy
β’ Improve accuracy by replacing updated residual ππ by the true residual
π β π¨ππ in certain iterations
β’ Related work for classical CG: van der Vorst and Ye (1999)
27
β’ Based on derived bound on deviation of residuals, can devise a residual replacement strategy for CA-CG and CA-BICG
β’ Choose when to replace ππ with π β π΄π₯π to meet two constraints:
1. π β π΄π₯π β ππ is small
2. Convergence rate is maintained (avoid large perturbations to finite
precision CG recurrence)
β’ Implementation has negligible cost β residual replacement strategy allows both speed and accuracy!
if ππ π+πβ1 β€ ν ππ π+πβ1 ππ§π ππ π+π > ν ππ π+π ππ§π ππ π+π > 1.1πππππ‘
π§ = π§ + ππ π₯π,πβ² + π₯π π
π₯π π+π = 0
ππ π+π = π β π΄π§
πππππ‘ = ππ π+π= ν 1 + 2πβ² π΄ π§ + ππ π+π
ππ π+π = ππππ,πβ²
break from inner loop and begin new outer loop
end
Residual Replacement for CA-CG
β’ Use computable bound for π β π΄π₯π π+π β ππ π+π to update ππ π+π, an estimate of error in computing ππ π+π, in each iteration
β’ Set threshold ν β ν, replace whenever ππ π+π/ ππ π+π reaches threshold
28
Pseudo-code for residual replacement with group update for CA-CG:
group update of approximate solution
set updated residual to true residual
β’ In each iteration, update error estimate ππ π+π by:
A Computable Bound for CA-CG
otherwise
π = π
where πβ² = max π, 2π + 1 .
Estimated only onceπΆ(ππ) flops per π iterations; no communicationπΆ(ππ) flops per π iterations; β€1 reduction per π iterations
to compute πππ» ππ
Extra computation all lower order terms; communication only increased by at most factor of 2!
29
+ν π΄ π₯π π+π + 2+2πβ² π΄ ππ β π₯π,π
β² +πβ² ππ β ππ,π β² ,
0,
ππ π+π β‘ ππ π+πβ1
+ν 4+πβ² π΄ ππ β π₯π,πβ² + ππ β π΅π β π₯π,π
β² + ππ β ππ,πβ²
CG trueCG updatedCA-CG (monomial) trueCA-CG (monomial) updated
Model Problem: 2D Poisson, n = 262K, nnz = 1.3M, cond(A) β 104
CG trueCG updatedCA-CG (monomial) trueCA-CG (monomial) updatedCA-CG (Newton) trueCA-CG (Newton) updatedCA-CG (Chebyshev) trueCA-CG (Chebyshev) updated
CA-CG Convergence, s = 8 CA-CG Convergence, s = 16
Better basis choice allows
higher s valuesBut can still see loss of accuracy
31
CG trueCG updatedCA-CG (monomial) trueCA-CG (monomial) updatedCA-CG (Newton) trueCA-CG (Newton) updatedCA-CG (Chebyshev) trueCA-CG (Chebyshev) updated
βconsphβ matrix (3D FEM), From UFL Sparse Matrix Collection
π = 8.3 Γ 104, nnz = 6.0 Γ 106, π π΄ = 9.7 Γ 103, π΄ 2 = 9.7
For real problems, loss of accuracy becomes evident
even with better bases
CA-CG Convergence, s = 12
CG trueCG updatedCA-CG (monomial) trueCA-CG (monomial) updatedCA-CG (Newton) trueCA-CG (Newton) updatedCA-CG (Chebyshev) trueCA-CG (Chebyshev) updated
Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) β 104
CA-CG Convergence, s = 4 CA-CG Convergence, s = 8
CA-CG Convergence, s = 16
Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) β 10^4
CG-RR trueCG-RR updatedCA-CG-RR (monomial) trueCA-CG-RR (monomial) updatedCA-CG-RR (Newton) trueCA-CG-RR (Newton) updatedCA-CG-RR (Chebyshev) trueCA-CG-RR (Chebyshev) updated
Residual Replacement can improve accuracy orders of magnitude
for negligible cost
Maximum replacement steps
(extra communication steps) for any test: 8
CA-CG Convergence, s = 4 CA-CG Convergence, s = 8
CA-CG Convergence, s = 16
34
βconsphβ matrix (3D FEM), From UFL Sparse Matrix Collection
π = 8.3 Γ 104, nnz = 6.0 Γ 106, π π΄ = 9.7 Γ 103, π΄ 2 = 9.7
CA-CG Convergence, s = 12
CG-RR trueCG-RR updatedCA-CG-RR (monomial) trueCA-CG-RR (monomial) updatedCA-CG-RR (Newton) trueCA-CG-RR (Newton) updatedCA-CG-RR (Chebyshev) trueCA-CG-RR (Chebyshev) updated
Maximum number of replacements: 4
35
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
0 512 1024 1536 2048 2560 3072 3584 4096
Tim
e (
seco
nd
s)
Processes (6 threads each)
Bottom Solver Time (total)
MPI_AllReduce Time (total)
Solver Time
Communication Time
Coarse-grid Krylov Solver on NERSCβs Hopper (Cray XE6)
Weak Scaling: 43 points per process (0 slope ideal)
Solver performance and scalability limited by communication!
Avoids communication:
β’ In serial, by exploiting temporal locality:
β’ Reading π΄, reading vectors
β’ In parallel, by doing only 1 βexpandβ phase (instead of π ).
β’ Requires sufficiently low βsurface-to-volumeβ ratio
Tridiagonal Example:
The Matrix Powers Kernel (Demmel et al., 2007)
36
Sequential
Parallel
A3vA2vAv
v
A3vA2vAv
v
black = local elementsred = 1-level dependenciesgreen = 2-level dependenciesblue = 3-level dependencies
Also works for general graphs!
Choosing a Polynomial Basisβ’ Recall: in each outer loop of CA-CG, we compute bases for some Krylov
subspaces, π¦π π΄, π£ = span{π£, π΄π£, β¦ , π΄πβ1π£}
37
β’ Two choices based on spectral information that usually lead to well-conditioned bases:
β’ Newton polynomials
β’ Chebyshev polynomials
β’ Simple loop unrolling gives monomial basis π = π, π΄π, π΄2π, π΄3π, β¦
β’ Condition number can grow exponentially with π
β’ Condition number = ratio of largest to smallest eigenvalues, πmax/πmin
β’ Recognized early on that this negatively affects convergence (Leland, 1989)
β’ Improve basis condition number to improve convergence: Use different polynomials to compute a basis for the same subspace.
History of π -step Krylov Methods
38
1983
Van Rosendale:
CG
1988
Walker: GMRES
Chronopoulosand Gear: CG
1990 1991 1992
First termed βs-step
methodsβ
de Sturler: GMRES
1989
Bai, Hu, and Reichel:GMRES
Chronopoulosand Kim:
Nonsymm. Lanczos
Joubert and Carey: GMRES
Erhel:GMRES
Toledo: CG
de Sturler and van der Vorst:
GMRES
1995 2001
Chronopoulosand Kinkaid:
Orthodir
Chronopoulos and Kim: Orthomin,
GMRES Chronopoulos: MINRES, GCR,
Orthomin
Kim and Chronopoulos: Arndoli, Symm.
Lanczos
Leland: CG
Recent Yearsβ¦
39
2010 2011 2014
Hoemmen:Arnoldi, GMRES,
Lanczos, CG
First termed βCAβ methods; first TSQR,
general matrix powers kernel
Carson, Knight, and Demmel:
BICG, CGS, BICGSTAB
Ballard, Carson, Demmel, Hoemmen,
Knight, Schwartz:Arnoldi, GMRES,
Nonsymm. Lanczos
Carson and Demmel: 2-term
Lanczos
Carson and Demmel:CG-RR,
BICG-RR
First theoretical results on finite
precision behavior
2012 2013
Feuerriegeland BΓΌcker: Lanczos, BICG, QMR
Grigori, Moufawad, Nataf:
CG
First CA-BICGSTAB
method
The Amplification Term Ξ
16
β’ Roundoff errors in CA variant follow same pattern as classical variant, but amplified by factor of Ξ or Ξ2
β’ Theoretically confirms observations on importance of basis conditioning (dating back to late β80s)
β’ Need π΄ |π¦β²| 2 β€ Ξ π΄π¦β² 2 to hold for the computed basis π΄and coordinate vector π¦β² in every bound.
β’ A loose bound for the amplification term:
Ξ β€ maxββ€π
π΄β+
2 β π΄β 2 β€ 2π +1 β maxββ€π
π π΄β
β’ What we really need: π΄ |π¦β²| 2 β€ Ξ π΄π¦β² 2 to hold for the computed basis π΄and coordinate vector π¦β² in every bound.
β’ Tighter bound on πͺ possible; requires some light bookkeeping
β’ Example:
Ξπ,π β‘ maxπ₯β{ π€π,π
β² , π’π,πβ² , π£π,π
β² , π£π,πβ1β² }
π΄π π₯2
π΄ππ₯2
More Current Workβ’ 2.5D symmetric eigensolver (Solomonik et al.)
β’ Write-Avoiding algorithms (talk by Harsha VardhanSimhadri in afternoon session)
β’ CA sparse RRLU (Grigori, Cayrols, Demmel)
β’ CA Parallel Sparse-Dense Matrix-Matrix Multiplication (Koanantakool et al.)
β’ Lower bounds for general programs that access arrays (talk by Nick Knight in afternoon session)
β’ CA Support Vector Machines (talk by Yang You)
β’ CA-RRQR (Demmel, Grigori, Gu, Xiang)
β’ CA-SBR (Ballard, Demmel, Knight)
Dynamic basis sizeβ’ Auto-tune to find best π based on machine, matrix sparsity structure; use this as π max
β’ In each outer iteration, select largest π β€ π max such that
π (π΄π) β€ 1/ ππ
β’ Benefit: Maintain acceptable convergence rate regardless of userβs choice of s
β’ Cost: Incremental condition number estimation in each outer iteration; potentially wasted SpMVs in each outer iteration
s values used = (6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4)
Finite precision Lanczos process: (π΄ is π Γ π with at most π nonzeros per row)
π΄ ππ = ππ ππ + π½π+1 π£π+1ππ
π + πΏ ππ
ππ = π£1, β¦ , π£π , πΏ ππ = πΏ π£1, β¦ , πΏ π£π , ππ =
πΌ1 π½2
π½2 β± β±
β± β± π½π
π½π πΌπ
(CA-)Lanczos Convergence Analysis
Classical Lanczos (Paige, 1976):
for π β {1, β¦ , π},πΏ π£π 2 β€ ν1π
π½π+1 π£ππ π£π+1 β€ 2ν0π
π£π+1π π£π+1 β 1 β€ ν0 2
π½π+12 + πΌπ
2 + π½π2 β π΄ π£π 2
2 β€ 4π 3ν0 + ν1 π2
15
where π β‘ π΄ 2, andππ β‘ π΄ 2
CA-Lanczos (C., 2015):
ν0 = π νπ
ν1 = π νππ
ν0 = π νππͺπ
ν1 = π νπππͺ
Ξ = maxββ€π
π΄β+
2 β π΄β 2
Paigeβs Results for Classical Lanczos (1980)
β’ Using bounds on local rounding errors in Lanczos, showed that
1. The computed eigenvalues always lie between the extreme eigenvalues of π΄ to within a small multiple of machine precision.
2. At least one small interval containing an eigenvalue of π΄ is found by the πth iteration.
3. The algorithm behaves numerically like Lanczos with full reorthogonalization until a very close eigenvalue approximation is found.
4. The loss of orthogonality among basis vectors follows a rigorous pattern and implies that some computed eigenvalues have converged.
Do the same statements hold for CA-Lanczos?
44
π΄ ππ = ππ ππ + π½π+1 π£π+1ππ
π + πΏ ππ
ππ = π£1, β¦ , π£π , πΏ ππ = πΏ π£1, β¦ , πΏ π£π , ππ =
πΌ1 π½2
π½2 β± β±
β± β± π½π
π½π πΌπ
Paigeβs Lanczos Convergence Analysis
Classic Lanczos rounding error result of Paige (1976):
These results form the basis for Paigeβs influential results in (Paige, 1980).
ν0 = π νπ ν1 = π νππ
for π β {1, β¦ , π},πΏ π£π 2 β€ ν1π
π½π+1 π£ππ π£π+1 β€ 2ν0π
π£π+1π π£π+1 β 1 β€ ν0 2
π½π+12 + πΌπ
2 + π½π2 β π΄ π£π 2
2 β€ 4π 3ν0 + ν1 π2
45
where π β‘ π΄ 2, ππ β‘ π΄ 2, ν0 β‘ 2ν π + 4 , and ν1 β‘ 2ν ππ + 7
CA-Lanczos Convergence Analysis
for π β {1, β¦ , π=π π+π},πΏ π£π 2 β€ ν1π
π½π+1 π£ππ π£π+1 β€ 2ν0π
π£π+1π π£π+1 β 1 β€ ν0 2
π½π+12 + πΌπ
2 + π½π2 β π΄ π£π 2
2 β€ 4π 3ν0 + ν1 π2
For CA-Lanczos, we have:
(vs. π νπ for Lanczos)
(vs. π νππ for Lanczos)
46
Let Ξ β‘ maxββ€π
πβ+
2 β πβ 2 β€ 2π +1 β maxββ€π
π πβ .
ν0 β‘ 2ν π+11π +15 Ξ2 = π νπΞ2 ,
ν1 β‘ 2ν N+2π +5 π + 4π +9 π + 10π +16 Ξ = π νππΞ ,
where π β‘ π΄ 2, ππ β‘ π΄ 2, ππ β‘ maxββ€π
π΅β 2
Residual Replacement Strategy
β’ van der Vorst and Ye (1999): improve accuracy by replacing updated
residual ππ by the true residual π β π¨ππ in certain iterations
47
β’ Choose when to replace ππ with π β π΄π₯π to meet two constraints:
1. π β π΄π₯π β ππ is small
2. Convergence rate is maintained (avoid large perturbations to finite
precision CG recurrence)
β’ Requires monitoring estimate of deviation of residuals
β’ We can use the same strategy for CA-CG
β’ Implementation has negligible cost β residual replacement strategy can allow both speed and accuracy!
CG trueCG updatedCA-CG (monomial) trueCA-CG (monomial) updatedCA-CG (Newton) trueCA-CG (Newton) updatedCA-CG (Chebyshev) trueCA-CG (Chebyshev) updated
Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) β 104
CA-CG Convergence, s = 4 CA-CG Convergence, s = 8
CA-CG Convergence, s = 16
Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) β 10^4
CG-RR trueCG-RR updatedCA-CG-RR (monomial) trueCA-CG-RR (monomial) updatedCA-CG-RR (Newton) trueCA-CG-RR (Newton) updatedCA-CG-RR (Chebyshev) trueCA-CG-RR (Chebyshev) updated
Residual Replacement can improve accuracy orders of magnitude
for negligible cost
Maximum replacement steps
(extra communication steps) for any test: 8
CA-CG Convergence, s = 4 CA-CG Convergence, s = 8
CA-CG Convergence, s = 16
Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 512^2 grid
Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 1024^2 grid
Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 2048^2 grid
Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 16^2 grid per process
Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 32^2 grid per process
Hopper, 4 MPI Processes per nodeCG is PETSc solver2D Poisson on 64^2 grid per process
Communication-Avoiding Krylov Method Speedups
56
β’ Recent results: CA-BICGSTAB used as geometric multigrid (GMG) bottom-solve (Williams, Carson, et al., IPDPS β14)
β’ Plot: Net time spent on different operations over one GMG bottom solve using 24,576 cores, 643 points/core on fine grid, 43 points/core on coarse grid
β’ Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini network, 3D torus
β’ CA-BICGSTAB with π = π
β’ 3D Helmholtz equation
ππΌπ’ β ππ» β π½π»π’ = π
πΌ = π½ = 1.0, π = π = 0.9
4.2x speedup in Krylov solve; 2.5x in overall GMG solve
β’ Implemented in BoxLib: applied to low-Mach number combustion and 3D N-body dark matter simulation apps
Benchmark timing breakdown
57
β’ Plot: Net time spent across all bottom solves at 24,576 cores, for BICGSTAB and CA-BICGSTAB with π = 4
β’ 11.2x reduction in MPI_AllReduce time (red)
β BICGSTAB requires 6π more MPI_AllReduceβs than CA-BICGSTAB
β Less than theoretical 24x since messages in CA-BICGSTAB are larger, not always latency-limited
β’ P2P (blue) communication doubles for CA-BICGSTAB
β Basis computation requires twice as many SpMVs (P2P) per iteration as BICGSTAB
0.000
0.250
0.500
0.750
1.000
1.250
1.500
BICGSTAB CA-BICGSTAB
Tim
e (
seco
nd
s)
Breakdown of Bottom Solver
MPI (collectives)MPI (P2P)BLAS3BLAS1applyOpresidual
Representation of Matrix Structures
Rep
rese
nta
tio
n o
f M
atri
x V
alu
es
Example: stencil with variable coefficients
explicit structureexplicit values
explicit structureimplicit values
implicit structureexplicit values
implicit structureimplicit values
Example: stencil with constant coefficients
Example: Laplacianmatrix of a graph
Example: general sparse matrix
Hoemmen (2010), Fig 2.558