Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel...
Transcript of Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel...
![Page 1: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/1.jpg)
Dense Linear Algebra for Massively
Parallel Computers
Jack Dongarra University of Tennessee
Oak Ridge National Laboratory University of Manchester
�?B9J?>�#1D8C� �� �
![Page 2: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/2.jpg)
Dense Linear Algebra
• Look at the largest machines today. • Look at where things are headed. • What’s the current thinking on
effective use. • Applies to DLA but ideas also extend
beyond that.
2
![Page 3: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/3.jpg)
November 2012: The TOP10 Rank Site Computer Country Cores Rmax
[Pflops] % of Peak
Power [MW]
MFlops/Watt
1 DOE / OS Oak Ridge Nat Lab
Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + custom USA 560,640 17.6 66 8.3 2120
2 DOE / NNSA L Livermore Nat Lab
Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 7.9 2063
3 RIKEN Advanced Inst for Comp Sci
K computer Fujitsu SPARC64 VIIIfx (8c) + custom Japan 705,024 10.5 93 12.7 827
4 DOE / OS Argonne Nat Lab
Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2066
5 Forschungszentrum Juelich
JuQUEEN, BlueGene/Q (16c) + custom Germany 393,216 4.14 82 1.97 2102
6 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848
7 Texas Advanced Computing Center
Stampede, Dell Intel (8) + Intel Xeon Phi (61) + IB USA 204,900 2.66 67 3.3 806
8 Nat. SuperComputer Center in Tianjin
Tianhe-1A, NUDT Intel (6c) + Nvidia Fermi GPU
(14c) + custom China 186,368 2.57 55 4.04 636
9 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .822 2105
10 IBM DARPA Trial System, Power7 (8C) + custom USA 63,360 1.51 78 .358 422
�����Slovak Academy Sci IBM Power 7�����������������Slovak Rep�������������������������������������������������� 3
![Page 4: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/4.jpg)
November 2012: The TOP10 Rank Site Computer Country Cores Rmax
[Pflops] % of Peak
Power [MW]
MFlops/Watt
1 DOE / OS Oak Ridge Nat Lab
Titan, Cray XK7 (16C) + Nvidia Kepler GPU (14c) + custom USA 560,640 17.6 66 8.3 2120
2 DOE / NNSA L Livermore Nat Lab
Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 7.9 2063
3 RIKEN Advanced Inst for Comp Sci
K computer Fujitsu SPARC64 VIIIfx (8c) + custom Japan 705,024 10.5 93 12.7 827
4 DOE / OS Argonne Nat Lab
Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2066
5 Forschungszentrum Juelich
JuQUEEN, BlueGene/Q (16c) + custom Germany 393,216 4.14 82 1.97 2102
6 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848
7 Texas Advanced Computing Center
Stampede, Dell Intel (8) + Intel Xeon Phi (61) + IB USA 204,900 2.66 67 3.3 806
8 Nat. SuperComputer Center in Tianjin
Tianhe-1A, NUDT Intel (6c) + Nvidia Fermi GPU
(14c) + custom China 186,368 2.57 55 4.04 636
9 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .822 2105
10 IBM DARPA Trial System, Power7 (8C) + custom USA 63,360 1.51 78 .358 422
4 �����Slovak Academy Sci IBM Power 7�����������������Slovak Rep��������������������������������������������������
![Page 5: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/5.jpg)
Energy Cost Challenge � At ~$1M per MW energy costs are
substantial ! 10 Pflop/s in 2011 uses ~10 MWs ! 1 Eflop/s in 2018 > 100 MWs
! DOE Target: 1 Eflop/s around 2020-2022 at 20
MWs 5
![Page 6: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/6.jpg)
Potential System Architecture Systems 2012
Titan Computer 2022 Difference
Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100)
Power 8.3 MW (2 Gflops/W)
~20 MW (50 Gflops/W)
System memory 710 TB (38*18688)
32 - 64 PB O(10)
Node performance 1,452 GF/s (1311+141)
1.2 or 15TF/s O(10) – O(100)
Node memory BW 232 GB/s (52+180)
2 - 4TB/s O(1000)
Node concurrency 16 cores CPU 2688 CUDA
cores
O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW
8 GB/s 200-400GB/s O(10)
System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 50 M O(billion) O(1,000)
MTTF ?? unknown O(<1 day) - O(10)
![Page 7: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/7.jpg)
Potential System Architecture with a cap of $200M and 20MW Systems 2012
Titan Computer 2020-2022 Difference
Today & 2022 System peak 27 Pflop/s 1 Eflop/s O(100)
Power 8.3 MW (2 Gflops/W)
~20 MW (50 Gflops/W)
O(10)
System memory 710 TB (38*18688)
32 - 64 PB O(100)
Node performance 1,452 GF/s (1311+141)
1.2 or 15TF/s O(10)
Node memory BW 232 GB/s (52+180)
2 - 4TB/s O(10)
Node concurrency 16 cores CPU 2688 CUDA
cores
O(1k) or 10k O(100) – O(10)
Total Node Interconnect BW
8 GB/s 200-400GB/s O(100)
System size (nodes) 18,688 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 50 M O(billion) O(100)
MTTF ?? unknown O(<1 day) O(?)
![Page 8: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/8.jpg)
• Must rethink the design of our applications, algorithms and software " Manycore and Hybrid architectures (62)
are disruptive technology • Similar to what happened with cluster
computing and message passing " Data movement is expensive & Flops
are cheap • Many methods are not optimized this way
Major Changes to Algorithms/Software
8
![Page 9: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/9.jpg)
Critical Issues at Peta & Exascale for Algorithm and Software Design • Synchronization-reducing algorithms
" Break Fork-Join model
• Communication-reducing algorithms " Communication carries a high energy cost. " Use methods which have lower bound on communication
• Mixed precision methods " 2x speed of ops and 2x speed for data movement
• Autotuning " Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware
• Fault resilient algorithms " Implement algorithms that can recover from failures/bit flips
• Reproducibility of results " Today we can’t guarantee this. We understand the issues,
but some of our “colleagues” have a hard time with this.
![Page 10: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/10.jpg)
A New Generation of DLA Software
�#��#����I2B94��<7?B9D8=CN��85D5B?75>59DI�6B95>4<I���
(5<I�?>����8I2B94�C3854E<5B����8I2B94�;5B>5<C�
Software/Algorithms follow hardware evolution in time
LINPACK (70’s) (Vector operations)
Rely on - Level-1 BLAS operations
LAPACK (80’s) (Blocking, cache friendly)
Rely on - Level-3 BLAS operations
ScaLAPACK (90’s) (Distributed Memory)
Rely on - PBLAS Mess Passing
PLASMA New Algorithms (many-core friendly)
Rely on - a DAG/scheduler - block data layout - some extra kernels
![Page 11: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/11.jpg)
A New Generation of DLA Software
�#��#����I2B94��<7?B9D8=CN��85D5B?75>59DI�6B95>4<I���
(5<I�?>����8I2B94�C3854E<5B����8I2B94�;5B>5<C�
Software/Algorithms follow hardware evolution in time
LINPACK (70’s) (Vector operations)
Rely on - Level-1 BLAS operations
LAPACK (80’s) (Blocking, cache friendly)
Rely on - Level-3 BLAS operations
ScaLAPACK (90’s) (Distributed Memory)
Rely on - PBLAS Mess Passing
PLASMA New Algorithms (many-core friendly)
Rely on - a DAG/scheduler - block data layout - some extra kernels
![Page 12: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/12.jpg)
Accelerating Dense Linear Algebra with GPUs
1024 3072 5184 7040 90880
40
80
120
160
200
240
GPU (MAGMA)CPU (LAPACK)
Matrix Size
GFl
op/s
"+��13D?B9J1D9?>�9>�4?E2<5�@B539C9?>���&�N/�6?B�C?<F9>7�1�45>C5�<9>51B�CICD5=�0�
��-��
�� �-��
��������&+����5B=9�� ����/�����+����?B5C���������J�0���������������������������������&+������#�� )*�$�+"��N������������������� >D5<�'�����/���3?B5C��� �����J0������������������������������������������������������������/���C?3;5DC�H���3?B5C�����3?B5C��� ���J�0N�������������������&�@51;��������������������������<?@C��������������������������������������������������������������&�@51;�����������������������<?@CN������������������)ICD5=�3?CD��K�����������������������������������������������������������������������������������������)ICD5=�3?CD��K��������N����������������&?G5B���������K������ ��-�������������������������������������������������������������������������������&?G5B�����������K�������� �-NN
��������������������������?=@ED1D9?>�3?>CE=54�@?G5B�B1D5��D?D1<�CICD5=�B1D5�=9>EC�94<5�B1D5���=51CEB54�G9D8���������������������!�����������
K���@51;�K����@B935�K���@?G5B�K��.�@5B6?B=1>35�
![Page 13: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/13.jpg)
Paralleliza(on+of+QR+Factoriza(on+
Parallelize+the+update:+• !Easy!and!done!in!any!reasonable!so.ware.!• !This!is!the!2/3n3!term!in!the!FLOPs!count.!• !Can!be!done!“efficiently”!with!LAPACK+mulEthreaded!BLAS!
��
475==�
13
��
AB�� ��
475A6 ���4<1B6D�
4<1B62�
,�
(�
�����
�� ��,�
(�
+@4
1D5�
?6�D8
5�B5
=19
>9>7
�CE2
=1D
B9H�
&1>5
<�61
3D?B
9J1D
9?>�
�?B;���!?9>�@1B1<<5<9C=��E<;�)I>3�&B?35CC9>7�
![Page 14: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/14.jpg)
Synchronization (in LAPACK LU)
• Fork-join, bulk synchronous processing 27
����� ����� ����� ����� ������
23
����������������
��������������
��������������
��� ���������������
����� ������ �����
! fork join ! bulk synchronous processing
14 Allowing for delayed update, out of order, asynchronous, dataflow execution
![Page 15: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/15.jpg)
• Objectives " High utilization of each core " Scaling to large number of cores " Synchronization reducing algorithms
• Methodology " Dynamic DAG scheduling (QUARK) " Explicit parallelism " Implicit communication " Fine granularity / block data layout
• Arbitrary DAG with dynamic scheduling
���
Cholesky 6 x 6
�?B;�:?9>�@1B1<<5<9C=�
PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures
����C3854E<54�@1B1<<5<9C=�
*9=5�
POTRF
TRSM
TRSM
TRSM
TRSM
TRSM
SYRK
GEMM
GEMM
GEMM
GEMM
SYRK
SYRK
GEMM
GEMM
GEMM
GEMM
SYRK
POTRF
TRSM
TRSM
TRSM
SYRK
GEMM
GEMM
SYRK
POTRF
TRSM
TRSM
GEMM
TRSM
GEMM
GEMM
GEMM
TRSM
GEMM
SYRK
GEMM
GEMM
SYRK
GEMM
SYRK
SYRK
GEMM
SYRK
SYRK
SYRK
POTRF
SYRK
TRSM
GEMM
TRSM
POTRF
SYRK
TRSM
POTRF
![Page 16: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/16.jpg)
Pipelining: Cholesky Inversion 3 Steps: Factor, Invert L, Multiply L’s
16
&%*(��*(*( �"�++#�� ����D�����8?<5C;I��13D?B9J1D9?>�1<?>5���D� �
���3?B5C�&%*(���*(*( �1>4�"�++#�*85�=1DB9H�9C������H������D9<5�C9J5�9C� ���H� ����
&9@5<9>54�������D����
![Page 17: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/17.jpg)
Communication Avoiding Algorithms • Goal: Algorithms that communicate as little as possible • J. Demmel & L. Grigori and company have been working on
algorithms that obtain a provable minimum communication. • Direct methods (BLAS, LU, QR, SVD, other decompositions)
• Communication lower bounds for all these problems • Algorithms that attain them (all dense linear algebra, some
sparse)
• Iterative methods – Krylov subspace methods for Ax=b, Ax=λx • Communication lower bounds, and algorithms that attain them
(depending on sparsity structure)
• For QR Factorization they can show:
17
![Page 18: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/18.jpg)
Communication Avoiding QR Example
��&?D85>�1>4�&�(1781F1>��9CDB92ED54�?BD8?7?>1<�613D?B9J1D9?>� >������#���! ��#� ���! � ("�#�&����! �&##� %��!�"&%�#$�� ���""����%�! $��'!�&��������""����%�! $��@175C�����M�� ���&1C145>1������!1>��������#�&5>>�)D1D5�
R0!
R1!
R2!
R3!
R0!
R2!
R0!R! R!
D1!
D2!
D3!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
D0!
D1!
D2!
D3!
D0!
� �� � 18
![Page 19: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/19.jpg)
Communication Avoiding QR Example
��&?D85>�1>4�&�(1781F1>��9CDB92ED54�?BD8?7?>1<�613D?B9J1D9?>� >������#���! ��#� ���! � ("�#�&����! �&##� %��!�"&%�#$�� ���""����%�! $��'!�&��������""����%�! $��@175C�����M�� ���&1C145>1������!1>��������#�&5>>�)D1D5�
R0!
R1!
R2!
R3!
R0!
R2!
R0!R! R!
D1!
D2!
D3!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
D0!
D1!
D2!
D3!
D0!
� �� � 19
![Page 20: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/20.jpg)
Communication Avoiding QR Example
��&?D85>�1>4�&�(1781F1>��9CDB92ED54�?BD8?7?>1<�613D?B9J1D9?>� >������#���! ��#� ���! � ("�#�&����! �&##� %��!�"&%�#$�� ���""����%�! $��'!�&��������""����%�! $��@175C�����M�� ���&1C145>1������!1>��������#�&5>>�)D1D5�
R0!
R1!
R2!
R3!
R0!
R2!
R0!R! R!
D1!
D2!
D3!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
D0!
D1!
D2!
D3!
D0!
� �� � 20
![Page 21: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/21.jpg)
Communication Avoiding QR Example
��&?D85>�1>4�&�(1781F1>��9CDB92ED54�?BD8?7?>1<�613D?B9J1D9?>� >������#���! ��#� ���! � ("�#�&����! �&##� %��!�"&%�#$�� ���""����%�! $��'!�&��������""����%�! $��@175C�����M�� ���&1C145>1������!1>��������#�&5>>�)D1D5�
R0!
R1!
R2!
R3!
R0!
R2!
R0!R! R!
D1!
D2!
D3!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
D0!
D1!
D2!
D3!
D0!
� �� � 21
![Page 22: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/22.jpg)
Communication Avoiding QR Example
��&?D85>�1>4�&�(1781F1>��9CDB92ED54�?BD8?7?>1<�613D?B9J1D9?>� >������#���! ��#� ���! � ("�#�&����! �&##� %��!�"&%�#$�� ���""����%�! $��'!�&��������""����%�! $��@175C�����M�� ���&1C145>1������!1>��������#�&5>>�)D1D5�
R0!
R1!
R2!
R3!
R0!
R2!
R0!R! R!
D1!
D2!
D3!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
Domain_Tile_QR!
D0!
D1!
D2!
D3!
D0!
� �� � 22
![Page 23: Dense Linear Algebra for Massively Parallel Computers€¦ · Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.42 848 7 Texas Advanced Computing Center Stampede,](https://reader034.fdocuments.net/reader034/viewer/2022052103/603d3d0f7e1ca5216b1062be/html5/thumbnails/23.jpg)
Communication Avoiding QR Example
'E14�C?3;5D��AE14�3?B5�=1389>5� >D5<�.5?>��#*���������1D� �����J��*85?B5D931<�@51;�9C�L��� ��6<?@C�G9D8����3?B5C�#1DB9H�C9J5��� ���2I�� ���