MS58: Approaches to Reducing Communication in Krylov ...erinc/ppt/Carson_SIAMLA15.pdf ·...

MS58: Approaches to Reducing Communication in Krylov Subspace Methods

Organizers: Laura Grigori (INRIA) and Erin Carson (NYU)

Talks:

1. The s-Step Lanczos Method and its Behavior in Finite Precision (Erin Carson, James W. Demmel)

2. Enlarged Krylov Subspace Methods for Reducing Communication (Sophie M. Moufawad, Laura Grigori, Frederic Nataf)

3. Preconditioning Communication-Avoiding Krylov Methods (Siva Rajamanickam, Ichitaro Yamazaki, Andrey Prokopenko, Erik G. Boman, Michael Heroux, Jack J. Dongarra)

4. Sparse Approximate Inverse Preconditioners for Communication-Avoiding Bicgstab Solvers (Maryam MehriDehnavi, Erin Carson, Nicholas Knight, James W. Demmel, David Fernandez)

1

The s-Step Lanczos Method and its Behavior in Finite

Precision

Erin Carson, NYU

James Demmel, UC Berkeley

SIAM LA ‘15

October 30, 2015

Why Avoid “Communication”?

• Algorithms have two costs: computation and communication

• Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel)

• On today’s computers, communication is expensive, computation is cheap, in terms of both time and energy!

2

Sequential Parallel

CPU Cache

CPU DRAM

DRAM

CPU DRAM

CPU DRAM

CPU DRAM

Future Exascale Systems

PetascaleSystems (2009)

Predicted ExascaleSystems*

Factor Improvement

System Peak 2 ⋅ 1015 flops 1018 flops ~1000

Node Memory Bandwidth

25 GB/s 0.4-4 TB/s ~10-100

Total Node Interconnect Bandwidth

3.5 GB/s 100-400 GB/s ~100

Memory Latency 100 ns 50 ns ~1

Interconnect Latency 1 𝜇s 0.5 𝜇s ~1

3

*Sources: from P. Beckman (ANL), J. Shalf (LBL), and D. Unat (LBL)




Factor Improvement



25 GB/s 0.4-4 TB/s ~10-100


3.5 GB/s 100-400 GB/s ~100



3

• Gaps between communication/computation cost only growing larger in future systems





Factor Improvement



25 GB/s 0.4-4 TB/s ~10-100


3.5 GB/s 100-400 GB/s ~100



3

• Gaps between communication/computation cost only growing larger in future systems


• Avoiding communication will be essential for applications at exascale!

Krylov Subspace Methods

4

• In each iteration,

• Add a dimension to the Krylov subspace 𝒦𝑚

• Orthogonalize (with respect to some ℒ𝑚)

• Examples: Lanczos/Conjugate Gradient (CG), Arnoldi/Generalized Minimum Residual (GMRES), Biconjugate Gradient (BICG), BICGSTAB, GKL, LSQR, etc.

• Projection process onto the expanding Krylov subspace

𝒦𝑚 𝐴, 𝑟0 = span 𝑟0, 𝐴𝑟0, 𝐴2𝑟0, … , 𝐴𝑚−1𝑟0

• General class of iterative solvers: used for linear systems, eigenvalue problems, singular value problems, least squares, etc.

ℒ

𝑟new

𝐴𝛿

𝑟0

0

Krylov Solvers: Limited by CommunicationIn terms of communication:

5


5

“Add a dimension to 𝒦𝑚” Sparse Matrix-Vector Multiplication (SpMV)• Parallel: comm. vector entries w/ neighbors• Sequential: read 𝐴/vectors from slow memory

×


5


“Orthogonalize (with respect to some ℒ𝑚)” Inner products

Parallel: global reduction (All-Reduce)Sequential: multiple reads/writes to slow memory

×

×


Dependencies between communication-bound kernels in each iteration limit performance!

SpMV

orthogonalize

5


“Orthogonalize (with respect to some ℒ𝑚)” Inner products

Parallel: global reduction (All-Reduce)Sequential: multiple reads/writes to slow memory

×

×

Given: initial vector 𝑣1 with 𝑣1 2= 1

𝑢1 = 𝐴𝑣1

for 𝑖 = 1, 2, … , until convergence do

𝛼𝑖 = 𝑣𝑖𝑇𝑢𝑖

𝑤𝑖 = 𝑢𝑖 − 𝛼𝑖𝑣𝑖

𝛽𝑖+1 = 𝑤𝑖 2

𝑣𝑖+1 = 𝑤𝑖/𝛽𝑖+1

𝑢𝑖+1 = 𝐴𝑣𝑖+1 − 𝛽𝑖+1𝑣𝑖

end for

6

The Classical Lanczos Method


𝑢1 = 𝐴𝑣1




𝛽𝑖+1 = 𝑤𝑖 2



end for

6

SpMV



𝑢1 = 𝐴𝑣1




𝛽𝑖+1 = 𝑤𝑖 2



end for


6

SpMV

Inner products

Communication-Avoiding KSMs

7

• Idea: Compute blocks of 𝑠 iterations at once

• Communicate every 𝑠 iterations instead of every iteration

• Reduces communication cost by 𝑶(𝒔)!

• (latency in parallel, latency and bandwidth in sequential)


7





• An idea rediscovered many times…• First related work: s-dimensional steepest descent - Khabaza

(‘63), Forsythe (‘68), Marchuk and Kuznecov (‘68): • Flurry of work on s-step Krylov methods in ‘80s/early ‘90s: see,

e.g., Van Rosendale, 1983; Chronopoulos and Gear, 1989• Goals: increasing parallelism, avoiding I/O, increasing

“convergence rate”


7





• An idea rediscovered many times…• First related work: s-dimensional steepest descent - Khabaza

(‘63), Forsythe (‘68), Marchuk and Kuznecov (‘68): • Flurry of work on s-step Krylov methods in ‘80s/early ‘90s: see,

e.g., Van Rosendale, 1983; Chronopoulos and Gear, 1989• Goals: increasing parallelism, avoiding I/O, increasing

“convergence rate”

• Resurgence of interest in recent years due to growing problem sizes; growing relative cost of communication

Communication-Avoiding KSMs: CA-Lanczos

8

• Main idea: Unroll iteration loop by a factor of 𝑠; split iteration loop into an outer loop (k) and an inner loop (j)

• Key observation: starting at some iteration 𝑖 ≡ 𝑠𝑘 + 𝑗,

𝑣𝑠𝑘+𝑗 , 𝑢𝑠𝑘+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑣𝑠𝑘+1 + 𝒦𝑠+1 𝐴, 𝑢𝑠𝑘+1 for 𝑗 ∈ 1, … , 𝑠 + 1


8

Expand solution space 𝒔 dimensions at once• Compute “basis matrix” 𝒴𝑘 with columns spanning

𝒦𝑠+1 𝐴, 𝑣𝑠𝑘+1 + 𝒦𝑠+1 𝐴, 𝑢𝑠𝑘+1

• Requires reading 𝑨/communicating vectors only once• Using “matrix powers kernel”

Orthogonalize all at once• Compute/store block of inner products between basis vectors in

Gram matrix:

𝒢𝑘 = 𝒴𝑘𝑇𝒴𝑘

• Communication cost of one global reduction

Outer loop 𝒌: Communication step

• Main idea: Unroll iteration loop by a factor of 𝑠; split iteration loop into an outer loop (k) and an inner loop (j)

• Key observation: starting at some iteration 𝑖 ≡ 𝑠𝑘 + 𝑗,

𝑣𝑠𝑘+𝑗 , 𝑢𝑠𝑘+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑣𝑠𝑘+1 + 𝒦𝑠+1 𝐴, 𝑢𝑠𝑘+1 for 𝑗 ∈ 1, … , 𝑠 + 1

9

Perform 𝑠 iterations of updates• Using 𝒴𝑘 and 𝒢𝑘, this requires no communication!• Represent 𝑛-vectors by their 𝑂 𝑠 coordinates in 𝒴𝑘:

𝑣𝑠𝑘+𝑗 = 𝒴𝑘𝑣𝑘,𝑗′ , 𝑢𝑠𝑘+𝑗 = 𝒴𝑘𝑢𝑘,𝑗

′ , 𝑤𝑠𝑘+𝑗 = 𝒴𝑘𝑤𝑗′

Inner loop:Computation

steps, no communication!


9


𝐴𝑣𝑖+1

×𝑛

𝑛






9

→

ℬ𝑘𝑣𝑘,𝑗+1′

𝑂(𝑠)

𝑂(𝑠)

×


𝐴𝑣𝑖+1

×𝑛

𝑛






9

→


𝑂(𝑠)

𝑂(𝑠)

×


𝑣𝑖𝑇𝑢𝑖

×

𝐴𝑣𝑖+1

×𝑛

𝑛






9

→

→


𝑂(𝑠)

𝑂(𝑠)

×

× ×

𝑣𝑘,𝑖′𝑇 𝒢𝑘𝑢𝑘,𝑖

′


𝑣𝑖𝑇𝑢𝑖

×

𝐴𝑣𝑖+1

×𝑛

𝑛







𝑢1 = 𝐴𝑣1

for 𝑘 = 0, 1, … , until convergence doCompute 𝒴𝑘 , compute 𝒢𝑘 = 𝒴𝑘

𝑇𝒴𝑘

Let 𝑣𝑘,1′ = 𝑒1, 𝑢𝑘,1

′ = 𝑒𝑠+2

for 𝑗 = 1, … , 𝑠 do

𝛼𝑠𝑘+𝑗 = 𝑣𝑘,𝑗′𝑇 𝒢𝑘𝑢𝑘,𝑗

′

𝑤𝑘,𝑗′ = 𝑢𝑘,𝑗

′ − 𝛼𝑠𝑘+𝑗𝑣𝑘,𝑗′

𝛽𝑠𝑘+𝑗+1 = 𝑤𝑘,𝑗′𝑇 𝒢𝑘𝑤𝑘,𝑗

′ 1/2

𝑣𝑘,𝑗+1′ = 𝑤𝑘,𝑗

′ / 𝛽𝑠𝑘+𝑗+1

𝑢𝑘,𝑗+1′ = ℬ𝑘𝑣𝑘,𝑗+1

′ − 𝛽𝑠𝑘+𝑗+1𝑣𝑘,𝑗′

end forCompute 𝑣𝑠𝑘+𝑠+1 = 𝒴𝑘𝑣𝑘,𝑠+1

′ , 𝑢𝑠𝑘+𝑠+1 = 𝒴𝑘𝑢𝑘,𝑠+1′

end for

10

The CA-Lanczos Method

via CA Matrix Powers Kernel

Global reduction

to compute 𝒢𝑘

10


𝑢1 = 𝐴𝑣1


𝑇𝒴𝑘


′ = 𝑒𝑠+2

for 𝑗 = 1, … , 𝑠 do


′




′ 1/2


′ / 𝛽𝑠𝑘+𝑗+1




′ , 𝑢𝑠𝑘+𝑠+1 = 𝒴𝑘𝑢𝑘,𝑠+1′

end for


via CA Matrix Powers Kernel

Global reduction

to compute 𝒢𝑘

10

Local computations: no communication!


𝑢1 = 𝐴𝑣1


𝑇𝒴𝑘


′ = 𝑒𝑠+2

for 𝑗 = 1, … , 𝑠 do


′




′ 1/2


′ / 𝛽𝑠𝑘+𝑗+1




′ , 𝑢𝑠𝑘+𝑠+1 = 𝒴𝑘𝑢𝑘,𝑠+1′

end for


Complexity Comparison

11

Flops Words Moved Messages

SpMV Orth. SpMV Orth. SpMV Orth.

Classical CG

𝑠𝑛

𝑝

𝑠𝑛

𝑝 𝑠 𝑛 𝑝 𝑠 log2 𝑝 𝑠 𝑠 log2 𝑝

CA-CG𝑠𝑛

𝑝𝑠2𝑛

𝑝𝑠 𝑛 𝑝 𝑠2 log2 𝑝 1 log2 𝑝

Example of parallel (per processor) complexity for 𝑠 iterations of Classical Lanczos vs. CA-Lanczos for a 2D 9-point stencil:

(Assuming each of 𝑝 processors owns 𝑛/𝑝 rows of the matrix and 𝑠 ≤ 𝑛/𝑝)

All values in the table meant in the Big-O sense (i.e., lower order terms and constants not included)

Complexity Comparison

11

Flops Words Moved Messages

SpMV Orth. SpMV Orth. SpMV Orth.

Classical CG

𝑠𝑛

𝑝

𝑠𝑛

𝑝 𝑠 𝑛 𝑝 𝑠 log2 𝑝 𝑠 𝑠 log2 𝑝

CA-CG𝑠𝑛

𝑝𝑠2𝑛

𝑝𝑠 𝑛 𝑝 𝑠2 log2 𝑝 1 log2 𝑝

All values in the table meant in the Big-O sense (i.e., lower order terms and constants not included)

Example of parallel (per processor) complexity for 𝑠 iterations of Classical Lanczos vs. CA-Lanczos for a 2D 9-point stencil:

(Assuming each of 𝑝 processors owns 𝑛/𝑝 rows of the matrix and 𝑠 ≤ 𝑛/𝑝)

From Theory to Practice

• Parameter 𝑠 is limited by machine parameters and matrix sparsity structure

• We can auto-tune to find the best 𝑠 based on these properties

• That is, find 𝑠 that gives the fastest speed per iteration

12





• In practice, we don’t just care about speed per iteration, but also the number of iterations

Runtime = (time/iteration) x (# iterations)

12





• In practice, we don’t just care about speed per iteration, but also the number of iterations


• We also need to consider how convergence rate and accuracy are affected by choice of 𝑠!

12


13

• CA-KSMs are mathematically equivalent to classical KSMs


13


• But can behave much differently in finite precision!


13


• Roundoff error bounds generally grow with increasing 𝑠



13




• Two effects of roundoff error:


13





1. Decrease in accuracy → Tradeoff: increasing blocking factor 𝑠 past a certain point: accuracy limited


13






2. Delay of convergence → Tradeoff: increasing blocking factor 𝑠 past a certain point: no speedup expected


13







2. Delay of convergence → Tradeoff: increasing blocking factor 𝑠 past a certain point: no speedup expected

Paige’s Results for Classical Lanczos

• Using bounds on local rounding errors in Lanczos, Paige showed that

1. The computed Ritz values always lie between the extreme eigenvalues of 𝐴 to within a small multiple of machine precision.

2. At least one small interval containing an eigenvalue of 𝐴 is found by the 𝑛th iteration.

3. The algorithm behaves numerically like Lanczos with full reorthogonalization until a very close eigenvalue approximation is found.

4. The loss of orthogonality among basis vectors follows a rigorous pattern and implies that some Ritz values have converged.

14

Paige’s Results for Classical Lanczos

• Using bounds on local rounding errors in Lanczos, Paige showed that

1. The computed Ritz values always lie between the extreme eigenvalues of 𝐴 to within a small multiple of machine precision.

2. At least one small interval containing an eigenvalue of 𝐴 is found by the 𝑛th iteration.

3. The algorithm behaves numerically like Lanczos with full reorthogonalization until a very close eigenvalue approximation is found.

4. The loss of orthogonality among basis vectors follows a rigorous pattern and implies that some Ritz values have converged.

Do the same statements hold for CA-Lanczos?

14

Finite precision Lanczos process: (𝐴 is 𝑛 × 𝑛 with at most 𝑁 nonzeros per row)

𝐴 𝑉𝑚 = 𝑉𝑚 𝑇𝑚 + 𝛽𝑚+1 𝑣𝑚+1𝑒𝑚

𝑇 + 𝛿 𝑉𝑚

𝑉𝑚 = 𝑣1, … , 𝑣𝑚 , 𝛿 𝑉𝑚 = 𝛿 𝑣1, … , 𝛿 𝑣𝑚 , 𝑇𝑚 =

𝛼1 𝛽2

𝛽2 ⋱ ⋱

⋱ ⋱ 𝛽𝑚

𝛽𝑚 𝛼𝑚

Paige’s Lanczos Convergence Analysis

15





𝛼1 𝛽2

𝛽2 ⋱ ⋱

⋱ ⋱ 𝛽𝑚

𝛽𝑚 𝛼𝑚


for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎

𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 ≤ 2휀0𝜎

𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2

𝛽𝑖+12 + 𝛼𝑖

2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2

2 ≤ 4𝑖 3휀0 + 휀1 𝜎2

15

where 𝜎 ≡ 𝐴 2, and𝜃𝜎 ≡ 𝐴 2





𝛼1 𝛽2

𝛽2 ⋱ ⋱

⋱ ⋱ 𝛽𝑚

𝛽𝑚 𝛼𝑚


Classic Lanczos (Paige, 1976):

for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎


𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2


2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2

2 ≤ 4𝑖 3휀0 + 휀1 𝜎2

15


휀0 = 𝑂 휀𝑛

휀1 = 𝑂 휀𝑁𝜃





𝛼1 𝛽2

𝛽2 ⋱ ⋱

⋱ ⋱ 𝛽𝑚

𝛽𝑚 𝛼𝑚



for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎


𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2


2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2

2 ≤ 4𝑖 3휀0 + 휀1 𝜎2

15


CA-Lanczos:

휀0 = 𝑂 휀𝑛


휀0 = 𝑂 휀𝑛𝚪𝟐

휀1 = 𝑂 휀𝑁𝜃𝚪





𝛼1 𝛽2

𝛽2 ⋱ ⋱

⋱ ⋱ 𝛽𝑚

𝛽𝑚 𝛼𝑚



for 𝑖 ∈ {1, … , 𝑚},𝛿 𝑣𝑖 2 ≤ 휀1𝜎


𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2


2 + 𝛽𝑖2 − 𝐴 𝑣𝑖 2

2 ≤ 4𝑖 3휀0 + 휀1 𝜎2

15


CA-Lanczos:

휀0 = 𝑂 휀𝑛


휀0 = 𝑂 휀𝑛𝚪𝟐

휀1 = 𝑂 휀𝑁𝜃𝚪

Γ ≤ maxℓ≤𝑘

𝒴ℓ+

2 ∙ 𝒴ℓ 2 ≤ 2𝑠+1 ∙ maxℓ≤𝑘

𝜅 𝒴ℓ

• Roundoff errors in CA variant follow same pattern as classical variant, but amplified by factor of Γ or Γ2

• Theoretically confirms empirical observations on importance of basis conditioning (dating back to late ‘80s)

• A loose bound for the amplification term:

Γ ≤ maxℓ≤𝑘

𝒴ℓ+

2 ∙ 𝒴ℓ 2 ≤ 2𝑠+1 ∙ maxℓ≤𝑘

𝜅 𝒴ℓ

• What we really need: 𝒴 |𝑦′| 2 ≤ Γ 𝒴𝑦′ 2 to hold for the computed basis 𝒴and coordinate vector 𝑦′ in every bound.

• Tighter bound on 𝚪 possible; requires some light bookkeeping

• Example: for bounds on 𝛽𝑖+1 𝑣𝑖𝑇 𝑣𝑖+1 and 𝑣𝑖+1

𝑇 𝑣𝑖+1 − 1 , we can use the definition

Γ𝑘,𝑗 ≡ max𝑥∈{ 𝑤𝑘,𝑗

′ , 𝑢𝑘,𝑗′ , 𝑣𝑘,𝑗

′ , 𝑣𝑘,𝑗−1′ }

𝒴𝑘 𝑥2

𝒴𝑘𝑥2

The Amplification Term Γ

16

Problem: 2D Poisson, 𝑛 = 256, random starting vector

𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2


Computed value

Bound Amplification factor Γ2

𝒔 = 𝟒


Computed value

Bound Amplification factor Γ

𝒔 = 𝟖

𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2


𝒔 = 𝟏𝟐


Computed value

Bound Amplification factor Γ2

𝑣𝑖+1𝑇 𝑣𝑖+1 − 1 ≤ 휀0 2


Results for CA-Lanczos

18

• Back to our question: Do Paige’s results, e.g.,loss of orthogonality eigenvalue convergence

hold for CA-Lanczos?

• The answer is YES!


18

…but



• Only if:

• 휀0 ≡ 2휀 𝑛+11𝑠+15 Γ2 ≤1

12

• i.e., Γ ≤ 24𝜖 𝑛 + 11𝑠 + 15− 1 2

= 𝑂 𝑛𝜖 −1/2

• Otherwise, e.g., can lose orthogonality due to computation with (numerically) rank-deficient basis



18

…but



• Only if:

• 휀0 ≡ 2휀 𝑛+11𝑠+15 Γ2 ≤1

12

• i.e., Γ ≤ 24𝜖 𝑛 + 11𝑠 + 15− 1 2

= 𝑂 𝑛𝜖 −1/2

• Otherwise, e.g., can lose orthogonality due to computation with (numerically) rank-deficient basis



18

…but



• Take-away: we can use this bound on Γ to design a better algorithm!• Mixed precision, selective reorthogonalization, dynamic basis size, etc.

Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1and 𝜆𝑚𝑎𝑥 = 100; random starting vector

Top plots:

Computed Γ2

24(𝜖(𝑛 + 11𝑠 + 15) −1

Bottom Plots:

𝒔 = 𝟐

Computed Ritz values True eigenvalues

Bounds on range of computed Ritz values

𝒔 = 𝟒

Bottom Plots:


Top plots:

Computed Γ2

24(𝜖(𝑛 + 11𝑠 + 15) −1



𝒔 = 𝟏𝟐

Bottom Plots:


Top plots:

Computed Γ2

24(𝜖(𝑛 + 11𝑠 + 15) −1



Problem: Diagonal matrix with 𝑛 = 100 with evenly spaced eigenvalues between 𝜆𝑚𝑖𝑛 = 0.1 and 𝜆𝑚𝑎𝑥 = 100; random starting vector

max𝑖

|𝑧𝑖𝑚 𝑇

𝑣𝑚+1|

min𝑖

𝛽𝑚+1𝜂𝑚,𝑖(𝑚)


max𝑖

|𝑧𝑖𝑚 𝑇

𝑣𝑚+1|

min𝑖


Measure of loss of orthogonality

Measure of Ritz value convergence


max𝑖

|𝑧𝑖𝑚 𝑇

𝑣𝑚+1|

min𝑖


Extending the results of Greenbaum (1989):

21

Eigenvalue approximations generated at each step by a perturbed Lanczos recurrence for 𝐴 are equal to those generated by exact Lanczos applied a larger matrix whose eigenvalues lie within intervals about the eigenvalues of 𝐴.

𝜆


21


𝜆

𝑂(𝜖𝑛3 𝐴 )

Classical Lanczos


21


𝜆


𝑂(𝜖𝑛3 𝐴 𝚪𝟐)

Classical Lanczos

CA-Lanczos



21

𝜆


𝑂(𝜖𝑛3 𝐴 𝚪𝟐)

Classical Lanczos

CA-Lanczos


Ongoing work…

21


Future Directions

22

• New Algorithms/Applications• Application of communication-avoiding ideas and solvers to new

computational science domains

• Design of new high-performance preconditioners

• Improving Usability• Automating parameter selection via “numerical auto-tuning”

• Finite-Precision Analysis• Bounds on stability and convergence for other Krylov methods

(particularly in the nonsymmetric case)

• Extension of “Backwards-like” error analyses

Broad research agenda: Design methods for large-scale problems that optimize performance subject to application-specific numerical constraints

Thank you!

contact: [email protected]://www.cims.nyu.edu/~erinc/

MS58: Approaches to Reducing Communication in Krylov ...erinc/ppt/Carson_SIAMLA15.pdf ·...

Documents

Transcript of MS58: Approaches to Reducing Communication in Krylov ...erinc/ppt/Carson_SIAMLA15.pdf ·...