MS58: Approaches to Reducing Communication in Krylov ...erinc/ppt/Carson_SIAMLA15.pdf ·...
Transcript of MS58: Approaches to Reducing Communication in Krylov ...erinc/ppt/Carson_SIAMLA15.pdf ·...
MS58: Approaches to Reducing Communication in Krylov Subspace Methods
Organizers: Laura Grigori (INRIA) and Erin Carson (NYU)
Talks:
1. The s-Step Lanczos Method and its Behavior in Finite Precision (Erin Carson, James W. Demmel)
2. Enlarged Krylov Subspace Methods for Reducing Communication (Sophie M. Moufawad, Laura Grigori, Frederic Nataf)
3. Preconditioning Communication-Avoiding Krylov Methods (Siva Rajamanickam, Ichitaro Yamazaki, Andrey Prokopenko, Erik G. Boman, Michael Heroux, Jack J. Dongarra)
4. Sparse Approximate Inverse Preconditioners for Communication-Avoiding Bicgstab Solvers (Maryam MehriDehnavi, Erin Carson, Nicholas Knight, James W. Demmel, David Fernandez)
1
The s-Step Lanczos Method and its Behavior in Finite
Precision
Erin Carson, NYU
James Demmel, UC Berkeley
SIAM LA ‘15
October 30, 2015
Why Avoid “Communication”?
• Algorithms have two costs: computation and communication
• Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)
• On today’s computers, communication is expensive, computation is cheap, in terms of both time and energy!
2
Sequential Parallel
CPU Cache
CPU DRAM
DRAM
CPU DRAM
CPU DRAM
CPU DRAM
Future Exascale Systems
PetascaleSystems (2009)
Predicted ExascaleSystems*
Factor Improvement
System Peak 2 ⋅ 1015 flops 1018 flops ~1000
Node Memory Bandwidth
25 GB/s 0.4-4 TB/s ~10-100
Total Node Interconnect Bandwidth
3.5 GB/s 100-400 GB/s ~100
Memory Latency 100 ns 50 ns ~1
Interconnect Latency 1 𝜇s 0.5 𝜇s ~1
3
*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)
Future Exascale Systems
PetascaleSystems (2009)
Predicted ExascaleSystems*
Factor Improvement
System Peak 2 ⋅ 1015 flops 1018 flops ~1000
Node Memory Bandwidth
25 GB/s 0.4-4 TB/s ~10-100
Total Node Interconnect Bandwidth
3.5 GB/s 100-400 GB/s ~100
Memory Latency 100 ns 50 ns ~1
Interconnect Latency 1 𝜇s 0.5 𝜇s ~1
3
• Gaps between communication/computation cost only growing larger in future systems
*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)
Future Exascale Systems
PetascaleSystems (2009)
Predicted ExascaleSystems*
Factor Improvement
System Peak 2 ⋅ 1015 flops 1018 flops ~1000
Node Memory Bandwidth
25 GB/s 0.4-4 TB/s ~10-100
Total Node Interconnect Bandwidth
3.5 GB/s 100-400 GB/s ~100
Memory Latency 100 ns 50 ns ~1
Interconnect Latency 1 𝜇s 0.5 𝜇s ~1
3
• Gaps between communication/computation cost only growing larger in future systems
*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)
• Avoiding communication will be essential for applications at exascale!
Krylov Subspace Methods
4
• In each iteration,
• Add a dimension to the Krylov subspace 𝒦𝑚
• Orthogonalize (with respect to some ℒ𝑚)
• Examples: Lanczos/Conjugate Gradient (CG), Arnoldi/Generalized Minimum Residual (GMRES), Biconjugate Gradient (BICG), BICGSTAB, GKL, LSQR, etc.
• Projection process onto the expanding Krylov subspace
𝒦𝑚 𝐴, 𝑟0 = span 𝑟0, 𝐴𝑟0, 𝐴2𝑟0, … , 𝐴𝑚−1𝑟0
• General class of iterative solvers: used for linear systems, eigenvalue problems, singular value problems, least squares, etc.
ℒ
𝑟new
𝐴𝛿
𝑟0
0
Krylov Solvers: Limited by CommunicationIn terms of communication:
5
Krylov Solvers: Limited by CommunicationIn terms of communication:
5
“Add a dimension to 𝒦𝑚” Sparse Matrix-Vector Multiplication (SpMV)• Parallel: comm. vector entries w/ neighbors• Sequential: read 𝐴/vectors from slow memory
×
Krylov Solvers: Limited by CommunicationIn terms of communication:
5
“Add a dimension to 𝒦𝑚” Sparse Matrix-Vector Multiplication (SpMV)• Parallel: comm. vector entries w/ neighbors• Sequential: read 𝐴/vectors from slow memory
“Orthogonalize (with respect to some ℒ𝑚)” Inner products
Parallel: global reduction (All-Reduce)Sequential: multiple reads/writes to slow memory
×
×
Krylov Solvers: Limited by CommunicationIn terms of communication:
Dependencies between communication-bound kernels in each iteration limit performance!
SpMV
orthogonalize
5
“Add a dimension to 𝒦𝑚” Sparse Matrix-Vector Multiplication (SpMV)• Parallel: comm. vector entries w/ neighbors• Sequential: read 𝐴/vectors from slow memory
“Orthogonalize (with respect to some ℒ𝑚)” Inner products
Parallel: global reduction (All-Reduce)Sequential: multiple reads/writes to slow memory
×
×
Given: initial vector 𝑣1 with 𝑣1 2= 1
𝑢1 = 𝐴𝑣1
for 𝑖 = 1, 2, … , until convergence do
𝛼𝑖 = 𝑣𝑖𝑇𝑢𝑖
𝑤𝑖 = 𝑢𝑖 − 𝛼𝑖𝑣𝑖
𝛽𝑖+1 = 𝑤𝑖 2
𝑣𝑖+1 = 𝑤𝑖/𝛽𝑖+1
𝑢𝑖+1 = 𝐴𝑣𝑖+1 − 𝛽𝑖+1𝑣𝑖
end for
6
The Classical Lanczos Method
Given: initial vector 𝑣1 with 𝑣1 2= 1
𝑢1 = 𝐴𝑣1
for 𝑖 = 1, 2, … , until convergence do
𝛼𝑖 = 𝑣𝑖𝑇𝑢𝑖
𝑤𝑖 = 𝑢𝑖 − 𝛼𝑖𝑣𝑖
𝛽𝑖+1 = 𝑤𝑖 2
𝑣𝑖+1 = 𝑤𝑖/𝛽𝑖+1
𝑢𝑖+1 = 𝐴𝑣𝑖+1 − 𝛽𝑖+1𝑣𝑖
end for
6
SpMV
The Classical Lanczos Method
Given: initial vector 𝑣1 with 𝑣1 2= 1
𝑢1 = 𝐴𝑣1
for 𝑖 = 1, 2, … , until convergence do
𝛼𝑖 = 𝑣𝑖𝑇𝑢𝑖
𝑤𝑖 = 𝑢𝑖 − 𝛼𝑖𝑣𝑖
𝛽𝑖+1 = 𝑤𝑖 2
𝑣𝑖+1 = 𝑤𝑖/𝛽𝑖+1
𝑢𝑖+1 = 𝐴𝑣𝑖+1 − 𝛽𝑖+1𝑣𝑖
end for
The Classical Lanczos Method
6
SpMV
Inner products
Communication-Avoiding KSMs
7
• Idea: Compute blocks of 𝑠 iterations at once
• Communicate every 𝑠 iterations instead of every iteration
• Reduces communication cost by 𝑶(𝒔)!
• (latency in parallel, latency and bandwidth in sequential)
Communication-Avoiding KSMs
7
• Idea: Compute blocks of 𝑠 iterations at once
• Communicate every 𝑠 iterations instead of every iteration
• Reduces communication cost by 𝑶(𝒔)!
• (latency in parallel, latency and bandwidth in sequential)
• An idea rediscovered many times…• First related work: s-dimensional steepest descent - Khabaza
(‘63), Forsythe (‘68), Marchuk and Kuznecov (‘68): • Flurry of work on s-step Krylov methods in ‘80s/early ‘90s: see,
e.g., Van Rosendale, 1983; Chronopoulos and Gear, 1989• Goals: increasing parallelism, avoiding I/O, increasing
“convergence rate”
Communication-Avoiding KSMs
7
• Idea: Compute blocks of 𝑠 iterations at once
• Communicate every 𝑠 iterations instead of every iteration
• Reduces communication cost by 𝑶(𝒔)!
• (latency in parallel, latency and bandwidth in sequential)
• An idea rediscovered many times…• First related work: s-dimensional steepest descent - Khabaza
(‘63), Forsythe (‘68), Marchuk and Kuznecov (‘68): • Flurry of work on s-step Krylov methods in ‘80s/early ‘90s: see,
e.g., Van Rosendale, 1983; Chronopoulos and Gear, 1989• Goals: increasing parallelism, avoiding I/O, increasing
“convergence rate”
• Resurgence of interest in recent years due to growing problem sizes; growing relative cost of communication
Communication-Avoiding KSMs: CA-Lanczos
8
• Main idea: Unroll iteration loop by a factor of 𝑠; split iteration loop into an outer loop (k) and an inner loop (j)
• Key observation: starting at some iteration 𝑖 ≡ 𝑠𝑘 + 𝑗,
𝑣𝑠𝑘+𝑗 , 𝑢𝑠𝑘+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑣𝑠𝑘+1 + 𝒦𝑠+1 𝐴, 𝑢𝑠𝑘+1 for 𝑗 ∈ 1, … , 𝑠 + 1
Communication-Avoiding KSMs: CA-Lanczos
8
Expand solution space 𝒔 dimensions at once• Compute “basis matrix” 𝒴𝑘 with columns spanning
𝒦𝑠+1 𝐴, 𝑣𝑠𝑘+1 + 𝒦𝑠+1 𝐴, 𝑢𝑠𝑘+1
• Requires reading 𝑨/communicating vectors only once• Using “matrix powers kernel”
Orthogonalize all at once• Compute/store block of inner products between basis vectors in
Gram matrix:
𝒢𝑘 = 𝒴𝑘𝑇𝒴𝑘
• Communication cost of one global reduction
Outer loop 𝒌: Communication step
• Main idea: Unroll iteration loop by a factor of 𝑠; split iteration loop into an outer loop (k) and an inner loop (j)
• Key observation: starting at some iteration 𝑖 ≡ 𝑠𝑘 + 𝑗,
𝑣𝑠𝑘+𝑗 , 𝑢𝑠𝑘+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑣𝑠𝑘+1 + 𝒦𝑠+1 𝐴, 𝑢𝑠𝑘+1 for 𝑗 ∈ 1, … , 𝑠 + 1
9
Perform 𝑠 iterations of updates• Using 𝒴𝑘 and 𝒢𝑘, this requires no communication!• Represent 𝑛-vectors by their 𝑂 𝑠 coordinates in 𝒴𝑘:
𝑣𝑠𝑘+𝑗 = 𝒴𝑘𝑣𝑘,𝑗′ , 𝑢𝑠𝑘+𝑗 = 𝒴𝑘𝑢𝑘,𝑗
′ , 𝑤𝑠𝑘+𝑗 = 𝒴𝑘𝑤𝑗′
Inner loop:Computation
steps, no communication!
Communication-Avoiding KSMs: CA-Lanczos
9
Communication-Avoiding KSMs: CA-Lanczos
𝐴𝑣𝑖+1
×𝑛
𝑛
Perform 𝑠 iterations of updates• Using 𝒴𝑘 and 𝒢𝑘, this requires no communication!• Represent 𝑛-vectors by their 𝑂 𝑠 coordinates in 𝒴𝑘:
𝑣𝑠𝑘+𝑗 = 𝒴𝑘𝑣𝑘,𝑗′ , 𝑢𝑠𝑘+𝑗 = 𝒴𝑘𝑢𝑘,𝑗
′ , 𝑤𝑠𝑘+𝑗 = 𝒴𝑘𝑤𝑗′
Inner loop:Computation
steps, no communication!
9
→
ℬ𝑘𝑣𝑘,𝑗+1′
𝑂(𝑠)
𝑂(𝑠)
×
Communication-Avoiding KSMs: CA-Lanczos
𝐴𝑣𝑖+1
×𝑛
𝑛
Perform 𝑠 iterations of updates• Using 𝒴𝑘 and 𝒢𝑘, this requires no communication!• Represent 𝑛-vectors by their 𝑂 𝑠 coordinates in 𝒴𝑘:
𝑣𝑠𝑘+𝑗 = 𝒴𝑘𝑣𝑘,𝑗′ , 𝑢𝑠𝑘+𝑗 = 𝒴𝑘𝑢𝑘,𝑗
′ , 𝑤𝑠𝑘+𝑗 = 𝒴𝑘𝑤𝑗′
Inner loop:Computation
steps, no communication!
9
→
ℬ𝑘𝑣𝑘,𝑗+1′
𝑂(𝑠)
𝑂(𝑠)
×
Communication-Avoiding KSMs: CA-Lanczos
𝑣𝑖𝑇𝑢𝑖
×
𝐴𝑣𝑖+1
×𝑛
𝑛
Perform 𝑠 iterations of updates• Using 𝒴𝑘 and 𝒢𝑘, this requires no communication!• Represent 𝑛-vectors by their 𝑂 𝑠 coordinates in 𝒴𝑘:
𝑣𝑠𝑘+𝑗 = 𝒴𝑘𝑣𝑘,𝑗′ , 𝑢𝑠𝑘+𝑗 = 𝒴𝑘𝑢𝑘,𝑗
′ , 𝑤𝑠𝑘+𝑗 = 𝒴𝑘𝑤𝑗′
Inner loop:Computation
steps, no communication!
9
→
→
ℬ𝑘𝑣𝑘,𝑗+1′
𝑂(𝑠)
𝑂(𝑠)
×
× ×
𝑣𝑘,𝑖′𝑇 𝒢𝑘𝑢𝑘,𝑖
′
Communication-Avoiding KSMs: CA-Lanczos
𝑣𝑖𝑇𝑢𝑖
×
𝐴𝑣𝑖+1
×𝑛
𝑛
Perform 𝑠 iterations of updates• Using 𝒴𝑘 and 𝒢𝑘, this requires no communication!• Represent 𝑛-vectors by their 𝑂 𝑠 coordinates in 𝒴𝑘:
𝑣𝑠𝑘+𝑗 = 𝒴𝑘𝑣𝑘,𝑗′ , 𝑢𝑠𝑘+𝑗 = 𝒴𝑘𝑢𝑘,𝑗
′ , 𝑤𝑠𝑘+𝑗 = 𝒴𝑘𝑤𝑗′
Inner loop:Computation
steps, no communication!
Given: initial vector 𝑣1 with 𝑣1 2= 1
𝑢1 = 𝐴𝑣1
for 𝑘 = 0, 1, … , until convergence doCompute 𝒴𝑘 , compute 𝒢𝑘 = 𝒴𝑘
𝑇𝒴𝑘
Let 𝑣𝑘,1′ = 𝑒1, 𝑢𝑘,1
′ = 𝑒𝑠+2
for 𝑗 = 1, … , 𝑠 do
𝛼𝑠𝑘+𝑗 = 𝑣𝑘,𝑗′𝑇 𝒢𝑘𝑢𝑘,𝑗
′
𝑤𝑘,𝑗′ = 𝑢𝑘,𝑗
′ − 𝛼𝑠𝑘+𝑗𝑣𝑘,𝑗′
𝛽𝑠𝑘+𝑗+1 = 𝑤𝑘,𝑗′𝑇 𝒢𝑘𝑤𝑘,𝑗
′ 1/2
𝑣𝑘,𝑗+1′ = 𝑤𝑘,𝑗
′ / 𝛽𝑠𝑘+𝑗+1
𝑢𝑘,𝑗+1′ = ℬ𝑘𝑣𝑘,𝑗+1
′ − 𝛽𝑠𝑘+𝑗+1𝑣𝑘,𝑗′
end forCompute 𝑣𝑠𝑘+𝑠+1 = 𝒴𝑘𝑣𝑘,𝑠+1
′ , 𝑢𝑠𝑘+𝑠+1 = 𝒴𝑘𝑢𝑘,𝑠+1′
end for
10
The CA-Lanczos Method
via CA Matrix Powers Kernel
Global reduction
to compute 𝒢𝑘
10
Given: initial vector 𝑣1 with 𝑣1 2= 1
𝑢1 = 𝐴𝑣1
for 𝑘 = 0, 1, … , until convergence doCompute 𝒴𝑘 , compute 𝒢𝑘 = 𝒴𝑘
𝑇𝒴𝑘
Let 𝑣𝑘,1′ = 𝑒1, 𝑢𝑘,1
′ = 𝑒𝑠+2
for 𝑗 = 1, … , 𝑠 do
𝛼𝑠𝑘+𝑗 = 𝑣𝑘,𝑗′𝑇 𝒢𝑘𝑢𝑘,𝑗
′
𝑤𝑘,𝑗′ = 𝑢𝑘,𝑗
′ − 𝛼𝑠𝑘+𝑗𝑣𝑘,𝑗′
𝛽𝑠𝑘+𝑗+1 = 𝑤𝑘,𝑗′𝑇 𝒢𝑘𝑤𝑘,𝑗
′ 1/2
𝑣𝑘,𝑗+1′ = 𝑤𝑘,𝑗
′ / 𝛽𝑠𝑘+𝑗+1
𝑢𝑘,𝑗+1′ = ℬ𝑘𝑣𝑘,𝑗+1
′ − 𝛽𝑠𝑘+𝑗+1𝑣𝑘,𝑗′
end forCompute 𝑣𝑠𝑘+𝑠+1 = 𝒴𝑘𝑣𝑘,𝑠+1
′ , 𝑢𝑠𝑘+𝑠+1 = 𝒴𝑘𝑢𝑘,𝑠+1′
end for
The CA-Lanczos Method
via CA Matrix Powers Kernel
Global reduction
to compute 𝒢𝑘
10
Local computations: no communication!
Given: initial vector 𝑣1 with 𝑣1 2= 1
𝑢1 = 𝐴𝑣1
for 𝑘 = 0, 1, … , until convergence doCompute 𝒴𝑘 , compute 𝒢𝑘 = 𝒴𝑘
𝑇𝒴𝑘
Let 𝑣𝑘,1′ = 𝑒1, 𝑢𝑘,1
′ = 𝑒𝑠+2
for 𝑗 = 1, … , 𝑠 do
𝛼𝑠𝑘+𝑗 = 𝑣𝑘,𝑗′𝑇 𝒢𝑘𝑢𝑘,𝑗
′
𝑤𝑘,𝑗′ = 𝑢𝑘,𝑗
′ − 𝛼𝑠𝑘+𝑗𝑣𝑘,𝑗′
𝛽𝑠𝑘+𝑗+1 = 𝑤𝑘,𝑗′𝑇 𝒢𝑘𝑤𝑘,𝑗
′ 1/2
𝑣𝑘,𝑗+1′ = 𝑤𝑘,𝑗
′ / 𝛽𝑠𝑘+𝑗+1
𝑢𝑘,𝑗+1′ = ℬ𝑘𝑣𝑘,𝑗+1
′ − 𝛽𝑠𝑘+𝑗+1𝑣𝑘,𝑗′
end forCompute 𝑣𝑠𝑘+𝑠+1 = 𝒴𝑘𝑣𝑘,𝑠+1
′ , 𝑢𝑠𝑘+𝑠+1 = 𝒴𝑘𝑢𝑘,𝑠+1′
end for
The CA-Lanczos Method
Complexity Comparison
11
Flops Words Moved Messages
SpMV Orth. SpMV Orth. SpMV Orth.
Classical CG
𝑠𝑛
𝑝
𝑠𝑛
𝑝 𝑠 𝑛 𝑝 𝑠 log2 𝑝 𝑠 𝑠 log2 𝑝
CA-CG𝑠𝑛
𝑝𝑠2𝑛
𝑝𝑠 𝑛 𝑝 𝑠2 log2 𝑝 1 log2 𝑝
Example of parallel (per processor) complexity for 𝑠 iterations of Classical Lanczos vs. CA-Lanczos for a 2D 9-point stencil:
(Assuming each of 𝑝 processors owns 𝑛/𝑝 rows of the matrix and 𝑠 ≤ 𝑛/𝑝)
All values in the table meant in the Big-O sense (i.e., lower order terms and constants not included)
Complexity Comparison
11
Flops Words Moved Messages
SpMV Orth. SpMV Orth. SpMV Orth.
Classical CG
𝑠𝑛
𝑝
𝑠𝑛
𝑝 𝑠 𝑛 𝑝 𝑠 log2 𝑝 𝑠 𝑠 log2 𝑝
CA-CG𝑠𝑛
𝑝𝑠2𝑛
𝑝𝑠 𝑛 𝑝 𝑠2 log2 𝑝 1 log2 𝑝
All values in the table meant in the Big-O sense (i.e., lower order terms and constants not included)
Example of parallel (per processor) complexity for 𝑠 iterations of Classical Lanczos vs. CA-Lanczos for a 2D 9-point stencil:
(Assuming each of 𝑝 processors owns 𝑛/𝑝 rows of the matrix and 𝑠 ≤ 𝑛/𝑝)
Complexity Comparison
11
Flops Words Moved Messages
SpMV Orth. SpMV Orth. SpMV Orth.
Classical CG
𝑠𝑛
𝑝
𝑠𝑛
𝑝 𝑠 𝑛 𝑝 𝑠 log2 𝑝 𝑠 𝑠 log2 𝑝
CA-CG𝑠𝑛
𝑝𝑠2𝑛
𝑝𝑠 𝑛 𝑝 𝑠2 log2 𝑝 1 log2 𝑝
All values in the table meant in the Big-O sense (i.e., lower order terms and constants not included)
Example of parallel (per processor) complexity for 𝑠 iterations of Classical Lanczos vs. CA-Lanczos for a 2D 9-point stencil:
(Assuming each of 𝑝 processors owns 𝑛/𝑝 rows of the matrix and 𝑠 ≤ 𝑛/𝑝)
From Theory to Practice
• Parameter 𝑠 is limited by machine parameters and matrix sparsity structure
• We can auto-tune to find the best 𝑠 based on these properties
• That is, find 𝑠 that gives the fastest speed per iteration
12
From Theory to Practice
• Parameter 𝑠 is limited by machine parameters and matrix sparsity structure
• We can auto-tune to find the best 𝑠 based on these properties
• That is, find 𝑠 that gives the fastest speed per iteration
• In practice, we don’t just care about speed per iteration, but also the number of iterations
Runtime = (time/iteration) x (# iterations)
12
From Theory to Practice
• Parameter 𝑠 is limited by machine parameters and matrix sparsity structure
• We can auto-tune to find the best 𝑠 based on these properties
• That is, find 𝑠 that gives the fastest speed per iteration
• In practice, we don’t just care about speed per iteration, but also the number of iterations
Runtime = (time/iteration) x (# iterations)
• We also need to consider how convergence rate and accuracy are affected by choice of 𝑠!
12
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
• But can behave much differently in finite precision!
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
• Roundoff error bounds generally grow with increasing 𝑠
• But can behave much differently in finite precision!
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
• Roundoff error bounds generally grow with increasing 𝑠
• But can behave much differently in finite precision!
• Two effects of roundoff error:
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
• Roundoff error bounds generally grow with increasing 𝑠
• But can behave much differently in finite precision!
• Two effects of roundoff error:
1. Decrease in accuracy → Tradeoff: increasing blocking factor 𝑠 past a certain point: accuracy limited
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
• Roundoff error bounds generally grow with increasing 𝑠
• But can behave much differently in finite precision!
• Two effects of roundoff error:
1. Decrease in accuracy → Tradeoff: increasing blocking factor 𝑠 past a certain point: accuracy limited
2. Delay of convergence → Tradeoff: increasing blocking factor 𝑠 past a certain point: no speedup expected
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
• Roundoff error bounds generally grow with increasing 𝑠
• But can behave much differently in finite precision!
• Two effects of roundoff error:
Runtime = (time/iteration) x (# iterations)
1. Decrease in accuracy → Tradeoff: increasing blocking factor 𝑠 past a certain point: accuracy limited
2. Delay of convergence → Tradeoff: increasing blocking factor 𝑠 past a certain point: no speedup expected
From Theory to Practice
13
• CA-KSMs are mathematically equivalent to classical KSMs
• Roundoff error bounds generally grow with increasing 𝑠
• But can behave much differently in finite precision!
• Two effects of roundoff error:
Runtime = (time/iteration) x (# iterations)
1. Decrease in accuracy → Tradeoff: increasing blocking factor 𝑠 past a certain point: accuracy limited
2. Delay of convergence → Tradeoff: increasing blocking factor 𝑠 past a certain point: no speedup expected
Paige’s Results for Classical Lanczos
• Using bounds on local rounding errors in Lanczos, Paige showed that
1. The computed Ritz values always lie between the extreme eigenvalues of 𝐴 to within a small multiple of machine precision.
2. At least one small interval containing an eigenvalue of 𝐴 is found by the 𝑛th iteration.
3. The algorithm behaves numerically like Lanczos with full reorthogonalization until a very close eigenvalue approximation is found.
4. The loss of orthogonality among basis vectors follows a rigorous pattern and implies that some Ritz values have converged.
14
Paige’s Results for Classical Lanczos
• Using bounds on local rounding errors in Lanczos, Paige showed that
1. The computed Ritz values always lie between the extreme eigenvalues of 𝐴 to within a small multiple of machine precision.
2. At least one small interval containing an eigenvalue of 𝐴 is found by the 𝑛th iteration.
3. The algorithm behaves numerically like Lanczos with full reorthogonalization until a very close eigenvalue approximation is found.
4. The loss of orthogonality among basis vectors follows a rigorous pattern and implies that some Ritz values have converged.
Do the same statements hold for CA-Lanczos?
14
Finite precision Lanczos process: (𝐴 is 𝑛 × 𝑛 with at most 𝑁 nonzeros per row)
𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚
𝑇 + 𝛿 𝑉𝑚
𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =
𝛼1 𝛽2
𝛽2 ⋱ ⋱
⋱ ⋱ 𝛽𝑚
𝛽𝑚 𝛼𝑚
Paige’s Lanczos Convergence Analysis
15
Finite precision Lanczos process: (𝐴 is 𝑛 × 𝑛 with at most 𝑁 nonzeros per row)
𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚
𝑇 + 𝛿 𝑉𝑚
𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =
𝛼1 𝛽2
𝛽2 ⋱ ⋱
⋱ ⋱ 𝛽𝑚
𝛽𝑚 𝛼𝑚
Paige’s Lanczos Convergence Analysis
for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎
𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎
𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2
𝛽𝑖+12 + 𝛼𝑖
2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2
2 ≤ 4𝑖 3휀0 + 휀1 𝜎2
15
where 𝜎 ≡ 𝐴 2, and𝜃𝜎 ≡ 𝐴 2
Finite precision Lanczos process: (𝐴 is 𝑛 × 𝑛 with at most 𝑁 nonzeros per row)
𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚
𝑇 + 𝛿 𝑉𝑚
𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =
𝛼1 𝛽2
𝛽2 ⋱ ⋱
⋱ ⋱ 𝛽𝑚
𝛽𝑚 𝛼𝑚
Paige’s Lanczos Convergence Analysis
Classic Lanczos (Paige, 1976):
for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎
𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎
𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2
𝛽𝑖+12 + 𝛼𝑖
2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2
2 ≤ 4𝑖 3휀0 + 휀1 𝜎2
15
where 𝜎 ≡ 𝐴 2, and𝜃𝜎 ≡ 𝐴 2
휀0 = 𝑂 휀𝑛
휀1 = 𝑂 휀𝑁𝜃
Finite precision Lanczos process: (𝐴 is 𝑛 × 𝑛 with at most 𝑁 nonzeros per row)
𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚
𝑇 + 𝛿 𝑉𝑚
𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =
𝛼1 𝛽2
𝛽2 ⋱ ⋱
⋱ ⋱ 𝛽𝑚
𝛽𝑚 𝛼𝑚
Paige’s Lanczos Convergence Analysis
Classic Lanczos (Paige, 1976):
for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎
𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎
𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2
𝛽𝑖+12 + 𝛼𝑖
2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2
2 ≤ 4𝑖 3휀0 + 휀1 𝜎2
15
where 𝜎 ≡ 𝐴 2, and𝜃𝜎 ≡ 𝐴 2
CA-Lanczos:
휀0 = 𝑂 휀𝑛
휀1 = 𝑂 휀𝑁𝜃
휀0 = 𝑂 휀𝑛𝚪𝟐
휀1 = 𝑂 휀𝑁𝜃𝚪
Finite precision Lanczos process: (𝐴 is 𝑛 × 𝑛 with at most 𝑁 nonzeros per row)
𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚
𝑇 + 𝛿 𝑉𝑚
𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =
𝛼1 𝛽2
𝛽2 ⋱ ⋱
⋱ ⋱ 𝛽𝑚
𝛽𝑚 𝛼𝑚
Paige’s Lanczos Convergence Analysis
Classic Lanczos (Paige, 1976):
for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎
𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎
𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2
𝛽𝑖+12 + 𝛼𝑖
2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2
2 ≤ 4𝑖 3휀0 + 휀1 𝜎2
15
where 𝜎 ≡ 𝐴 2, and𝜃𝜎 ≡ 𝐴 2
CA-Lanczos:
휀0 = 𝑂 휀𝑛
휀1 = 𝑂 휀𝑁𝜃
휀0 = 𝑂 휀𝑛𝚪𝟐
휀1 = 𝑂 휀𝑁𝜃𝚪
Γ ≤ maxℓ≤𝑘
𝒴ℓ+
2 ∙ 𝒴ℓ 2 ≤ 2𝑠+1 ∙ maxℓ≤𝑘
𝜅 𝒴ℓ
• Roundoff errors in CA variant follow same pattern as classical variant, but amplified by factor of Γ or Γ2
• Theoretically confirms empirical observations on importance of basis conditioning (dating back to late ‘80s)
• A loose bound for the amplification term:
Γ ≤ maxℓ≤𝑘
𝒴ℓ+
2 ∙ 𝒴ℓ 2 ≤ 2𝑠+1 ∙ maxℓ≤𝑘
𝜅 𝒴ℓ
• What we really need: 𝒴 |𝑦′| 2 ≤ Γ 𝒴𝑦′ 2 to hold for the computed basis 𝒴and coordinate vector 𝑦′ in every bound.
• Tighter bound on 𝚪 possible; requires some light bookkeeping
• Example: for bounds on 𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 and 𝑣𝑖+1
𝑇 𝑣𝑖+1 − 1 , we can use the definition
Γ𝑘,𝑗 ≡ max𝑥∈{ 𝑤𝑘,𝑗
′ , 𝑢𝑘,𝑗′ , 𝑣𝑘,𝑗
′ , 𝑣𝑘,𝑗−1′ }
𝒴𝑘 𝑥2
𝒴𝑘𝑥2
The Amplification Term Γ
16
Problem: 2D Poisson, 𝑛 = 256, random starting vector
𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2
𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎
Computed value
Bound Amplification factor Γ2
𝒔 = 𝟒
Problem: 2D Poisson, 𝑛 = 256, random starting vector
Computed value
Bound Amplification factor Γ
𝒔 = 𝟖
𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2
𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎
𝒔 = 𝟏𝟐
Problem: 2D Poisson, 𝑛 = 256, random starting vector
Computed value
Bound Amplification factor Γ2
𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2
𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎
Results for CA-Lanczos
18
• Back to our question: Do Paige’s results, e.g.,loss of orthogonality eigenvalue convergence
hold for CA-Lanczos?
• The answer is YES!
Results for CA-Lanczos
18
…but
• Back to our question: Do Paige’s results, e.g.,loss of orthogonality eigenvalue convergence
hold for CA-Lanczos?
• Only if:
• 휀0 ≡ 2휀 𝑛+11𝑠+15 Γ2 ≤1
12
• i.e., Γ ≤ 24𝜖 𝑛 + 11𝑠 + 15− 1 2
= 𝑂 𝑛𝜖 −1/2
• Otherwise, e.g., can lose orthogonality due to computation with (numerically) rank-deficient basis
• The answer is YES!
Results for CA-Lanczos
18
…but
• Back to our question: Do Paige’s results, e.g.,loss of orthogonality eigenvalue convergence
hold for CA-Lanczos?
• Only if:
• 휀0 ≡ 2휀 𝑛+11𝑠+15 Γ2 ≤1
12
• i.e., Γ ≤ 24𝜖 𝑛 + 11𝑠 + 15− 1 2
= 𝑂 𝑛𝜖 −1/2
• Otherwise, e.g., can lose orthogonality due to computation with (numerically) rank-deficient basis
• The answer is YES!
Results for CA-Lanczos
18
…but
• Back to our question: Do Paige’s results, e.g.,loss of orthogonality eigenvalue convergence
hold for CA-Lanczos?
• Take-away: we can use this bound on Γ to design a better algorithm!• Mixed precision, selective reorthogonalization, dynamic basis size, etc.
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1and 𝜆𝑚𝑎𝑥 = 100; random starting vector
Top plots:
Computed Γ2
24(𝜖(𝑛 + 11𝑠 + 15) −1
Bottom Plots:
𝒔 = 𝟐
Computed Ritz values True eigenvalues
Bounds on range of computed Ritz values
𝒔 = 𝟒
Bottom Plots:
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1and 𝜆𝑚𝑎𝑥 = 100; random starting vector
Top plots:
Computed Γ2
24(𝜖(𝑛 + 11𝑠 + 15) −1
Computed Ritz values True eigenvalues
Bounds on range of computed Ritz values
𝒔 = 𝟏𝟐
Bottom Plots:
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1and 𝜆𝑚𝑎𝑥 = 100; random starting vector
Top plots:
Computed Γ2
24(𝜖(𝑛 + 11𝑠 + 15) −1
Computed Ritz values True eigenvalues
Bounds on range of computed Ritz values
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1 and 𝜆𝑚𝑎𝑥 = 100; random starting vector
max𝑖
|𝑧𝑖𝑚 𝑇
𝑣𝑚+1|
min𝑖
𝛽𝑚+1𝜂𝑚,𝑖(𝑚)
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1 and 𝜆𝑚𝑎𝑥 = 100; random starting vector
max𝑖
|𝑧𝑖𝑚 𝑇
𝑣𝑚+1|
min𝑖
𝛽𝑚+1𝜂𝑚,𝑖(𝑚)
Measure of loss of orthogonality
Measure of Ritz value convergence
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1 and 𝜆𝑚𝑎𝑥 = 100; random starting vector
max𝑖
|𝑧𝑖𝑚 𝑇
𝑣𝑚+1|
min𝑖
𝛽𝑚+1𝜂𝑚,𝑖(𝑚)
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1 and 𝜆𝑚𝑎𝑥 = 100; random starting vector
max𝑖
|𝑧𝑖𝑚 𝑇
𝑣𝑚+1|
min𝑖
𝛽𝑚+1𝜂𝑚,𝑖(𝑚)
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1 and 𝜆𝑚𝑎𝑥 = 100; random starting vector
max𝑖
|𝑧𝑖𝑚 𝑇
𝑣𝑚+1|
min𝑖
𝛽𝑚+1𝜂𝑚,𝑖(𝑚)
Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1 and 𝜆𝑚𝑎𝑥 = 100; random starting vector
max𝑖
|𝑧𝑖𝑚 𝑇
𝑣𝑚+1|
min𝑖
𝛽𝑚+1𝜂𝑚,𝑖(𝑚)
Extending the results of Greenbaum (1989):
21
Eigenvalue approximations generated at each step by a perturbed Lanczos recurrence for 𝐴 are equal to those generated by exact Lanczos applied a larger matrix whose eigenvalues lie within intervals about the eigenvalues of 𝐴.
𝜆
Extending the results of Greenbaum (1989):
21
Eigenvalue approximations generated at each step by a perturbed Lanczos recurrence for 𝐴 are equal to those generated by exact Lanczos applied a larger matrix whose eigenvalues lie within intervals about the eigenvalues of 𝐴.
𝜆
𝑂(𝜖𝑛3 𝐴 )
Classical Lanczos
Extending the results of Greenbaum (1989):
21
Eigenvalue approximations generated at each step by a perturbed Lanczos recurrence for 𝐴 are equal to those generated by exact Lanczos applied a larger matrix whose eigenvalues lie within intervals about the eigenvalues of 𝐴.
𝜆
𝑂(𝜖𝑛3 𝐴 )
𝑂(𝜖𝑛3 𝐴 𝚪𝟐)
Classical Lanczos
CA-Lanczos
Extending the results of Greenbaum (1989):
Eigenvalue approximations generated at each step by a perturbed Lanczos recurrence for 𝐴 are equal to those generated by exact Lanczos applied a larger matrix whose eigenvalues lie within intervals about the eigenvalues of 𝐴.
21
𝜆
𝑂(𝜖𝑛3 𝐴 )
𝑂(𝜖𝑛3 𝐴 𝚪𝟐)
Classical Lanczos
CA-Lanczos
Extending the results of Greenbaum (1989):
Ongoing work…
21
Eigenvalue approximations generated at each step by a perturbed Lanczos recurrence for 𝐴 are equal to those generated by exact Lanczos applied a larger matrix whose eigenvalues lie within intervals about the eigenvalues of 𝐴.
Future Directions
22
• New Algorithms/Applications• Application of communication-avoiding ideas and solvers to new
computational science domains
• Design of new high-performance preconditioners
• Improving Usability• Automating parameter selection via “numerical auto-tuning”
• Finite-Precision Analysis• Bounds on stability and convergence for other Krylov methods
(particularly in the nonsymmetric case)
• Extension of “Backwards-like” error analyses
Broad research agenda: Design methods for large-scale problems that optimize performance subject to application-specific numerical constraints
Thank you!
contact: [email protected]://www.cims.nyu.edu/~erinc/