HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research...
Transcript of HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research...
![Page 1: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/1.jpg)
HiCMA:Hierarchical Computations on Manycore
Architectures
Hatem LtaiefExtreme Computing Research Center
King Abdullah University of Science and TechnologyThuwal, Saudi Arabia
NVIDIA GTC - San JoseApril 5th, 2016
![Page 2: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/2.jpg)
4*
Outline
Motivations
QR-based Dynamically Weighted Halley for SVD
Level 3 BLAS
H-Matrices
HiCMA in a nutshell
![Page 3: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/3.jpg)
4*
Students/Collaborators/Support
Academic:I Extreme Computing Research Center @ KAUST
W. Boukaram, A. Charara, G. Chavez, D. Keyes, D. Sukkariand G. Turkiyyah
I Tokyo Institute of TechnologyR. Yokota
I Institut Polytechnique de Bordeaux - INRIA BordeauxM. Faverge
I Innovative Computing Laboratory - UTK
Industry:
![Page 4: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/4.jpg)
4*
Outline
Motivations
QR-based Dynamically Weighted Halley for SVD
Level 3 BLAS
H-Matrices
HiCMA in a nutshell
![Page 5: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/5.jpg)
4*
Hardware/Software Trends
I Flops are free
I On/Off-chip network bandwidth limited
I Increasing gap between flops and bandwidth
I Data movement are the most energy-consuming operations
I Synchronization-reducing
I Communication-reducing
![Page 6: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/6.jpg)
4*
Going hierarchical all the way down the software stack
I Recursive formulation (increase data locality)
I Old concept!
I Tree structure (depth-first Vs breadth-first tree traversal)
I Reduce vertical/horizontal data motion
I Without compromising concurrency
I Trade-off between data reuse and parallelism
![Page 7: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/7.jpg)
4*
Outline
Motivations
QR-based Dynamically Weighted Halley for SVD
Level 3 BLAS
H-Matrices
HiCMA in a nutshell
![Page 8: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/8.jpg)
4*
Standard SVD solver
One stage reduction:
Figure: Computational stages of the standard SVD algorithm: (a)bidiagonal reduction, (b) bidiagonal SVD solver and (c) backtransformation.
![Page 9: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/9.jpg)
4*
Two-stage SVD solver
Two-stage reduction:
Figure: Reduction of a general dense matrix to bidiagonal form using atwo-stage approach.
![Page 10: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/10.jpg)
4*
QDWH-SVD[1,2] solver
Three computational stages:
I Polar decomposition A = UpH: iterative procedure using thematrix inversion free formulation based on QR/Choleskyfactorization
I Symmetric eigensolver H = VΣV> to calculate the singularvalues and the right singular vectors
I Matrix-matrix multiplication U = UpV to calculate the leftsingular vectors
[1] Y. Nakatsukasa and N. J.Higham, Stable and Efficient SpectralDivide and Conquer Algorithms for the Symmetric EigenvalueDecomposition and the SVD, SISC, 35 (2013), pp. A1325-A1349.[2] D. Sukkari, H. Ltaief and D. Keyes, A High PerformanceQDWH-SVD Solver Using Hardware Accelerators, submitted toTrans. on Math. Soft., 2015.
![Page 11: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/11.jpg)
4*
Divide-and-Conquer
Figure: The recursive QDWH-SVD algorithm. The matrix Ai,j
corresponds to the submatrix indexed j at the ith level of recursion.
![Page 12: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/12.jpg)
4*
Performance results
0.1
1
10
100
1000
1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
14336
15360
Log (
tim
e (
s))
Matrix size
2.3x
MKL-DGESVDMKL-DGESDD
MAGMA-QDWH-SVD
(a) Ill-conditioned matrix.
0.1
1
10
100
1000
1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
14336
15360
Log (
tim
e (
s))
Matrix size
3.5x
MKL-DGESVDMAGMA-QDWH-SVD
(b) Well-conditioned matrix.
Figure: Performance comparisons of MAGMA-QDWH-SVD (GPU)against Intel MKL (CPU).
![Page 13: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/13.jpg)
4*
Performance results
0.1
1
10
100
1000
1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
14336
15360
Log (
tim
e (
s))
Matrix size
18%
MAGMA-DGESVDMAGMA-DGESDD
MAGMA-QDWH-SVD
(a) Ill-conditioned matrix.
0.1
1
10
100
1000
1024
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
13312
14336
15360
Log (
tim
e (
s))
Matrix size
2.1x
MAGMA-DGESVDMAGMA-QDWH-SVD
(b) Well-conditioned matrix.
Figure: Performance comparisons against existing MAGMA SVD solvers(GPU).
![Page 14: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/14.jpg)
4*
Outline
Motivations
QR-based Dynamically Weighted Halley for SVD
Level 3 BLAS
H-Matrices
HiCMA in a nutshell
![Page 15: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/15.jpg)
4*
Recursive formulation
I Usually used for Level 2 BLAS algorithms (e.g., panelfactorization)
I Increase data locality
I Run at the cache level speed
I Again, not new and literature is quite rich
I And it does pay off for Level 3 BLAS too!
![Page 16: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/16.jpg)
4*
Triangular matrix-matrix multiplication (TRMM)
Bl Br
1-RecTRMM
3-RecTRMM
2-GEMM
Au All Ar
M
N1 N2
N1 N2
N2
N1
Figure: Illustrating a Right-Lower-NonTranspose-NonUnit recursiveTRMM, and splitting along the vertical direction. Operations areperformed according to their numbering.
![Page 17: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/17.jpg)
4*
Triangular matrix-matrix multiplication
GEMM
GEMM
GEMM GEMM
GEMM
GEMM GEMM
TRMM TRMM TRMM TRMM TRMM TRMM TRMM TRMM
Figure: A hypothetical tree representing the operations executed by therecursive algorithm. Operations are to be executed by traversing the treein depth-first order.
![Page 18: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/18.jpg)
4*
Performance results on NVIDIA GPUs
0!100!200!300!400!500!600!700!800!900!
1000!1100!1200!
1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14!
Perfo
rman
ce (G
Flop
/ s)!
Matrix Dimension (x1024)!
Theo-Peak! DGEMM! cuBLAS_DTRMM (OOP)! KBLAS_DTRMM (IP)! cuBLAS_DTRMM (IP)!
Figure: Performance comparisons of KBLAS DTRMM against that of IPand OOP cuBLAS DTRMM (Integration to CUDA 8.0).
![Page 19: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/19.jpg)
4*
Performance results on higher DLA computations using GPUs
0!
200!
400!
600!
800!
1000!
1200!
1400!
1600!
1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14!
Perfo
rman
ce (G
Flop
/ s)!
Matrix Dimension (x1024)!
Theo-Peak! DGEMM! DPOTRI + KBLAS_TRMM! DPOTRI + cuBLAS_TRMM!
Figure: Performance speedup of matrix inversion in MAGMA library(DPOTRI) using KBLAS DTRMM vs using cuBLAS DTRMM.
![Page 20: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/20.jpg)
4*
Outline
Motivations
QR-based Dynamically Weighted Halley for SVD
Level 3 BLAS
H-Matrices
HiCMA in a nutshell
![Page 21: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/21.jpg)
4*
Low Rank Approximation using H-Matrices
I Introduced by E. Tyrtyshnikov and revisited later by W.Hackbush[1,2]
I R = U X V T
[1] W. Hackbush, A Sparse Matrix Arithmetic based on H-Matrices(Part I), Computing, 62(2), pp 89-108, 1999.[2] W. Hackbusch and B. Khoromskij, A Sparse H-MatrixArithmetic (Part II): Application to multi-dimensional problems,Computing, 64(1), pp. 21-47, 2000.
![Page 22: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/22.jpg)
4*
Nice Properties of H-Matrices
I Memory footprint savingfrom O(n2) to O(k n log(n))
I Linear arithmetic complexityMVM: from O(n2) to O(k n log(n))MMM and A−1: from O(n3) to O(k2 n log2(n))
![Page 23: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/23.jpg)
4*
Examples of H-Matrices
4 3
3 3 5
5 5
9 9
5 9
9 5
9 9
5 4
4 4
6 9
9 9
6 9
9 9
6 5
5 5
9 9
9 9
4 9 9 9
9 9
9
9 4
9 9
9 9 9
9 9
4 4
4 5
9 9
5 9
9 5
9 9
5 5
5
4 3
3 3
9 9
6 9
9 6
9 9
6 5
5 5
9 99 9
5 9 9
9 9
9 9
9 9
9 9
9 99 9
5 9 9
9
9 5
9 9
9 9
9 9
9 9
9 9
9 9
9
9 5
9 9
9 9
5 5
5 6
9 9
6 9
9 6
9 9
3 3
3 4 5
5 5
9 9
5 9
9 5
9 9
5 4
4 4
9 9
9
9 9
9 9
4 9 9
9 9
9
9
9 4
9 9
9 9
5 5
5 6
9 9
9 6
9 9
9 6
4 4
4 5
9 9
5 9
9 5
9 9
5 5
5
3 3
3 4
Figure: Example of H-matrix approximation for BEM.
![Page 24: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/24.jpg)
4*
Examples of H-Matrices
25 20
20 20 30
3020 16
16 1630
30
20 20
20 20 30
30 3227
27
20 20
20 20 30
30 32 28
2832 28
28 32
18
18
20 20
20 20 30
30 32 29
29
20 20
20 20 29
29 3219
19
32 29
29 32 19
1932 19
19 32
9
9
20 20
20 20 30
30 32 30
3032 30
30 3220
20
20 20
20 20 30
30 32 20
2032 20
20 32
10
10
32 30
30 32 20
2032 20
20 3210
10
32 20
20 32 10
1032 10
10 32
Figure: Examples of H-matrix approximation for covariance matrix.
![Page 25: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/25.jpg)
4*
Tree Structure
![Page 26: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/26.jpg)
4*
H-MVM
![Page 27: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/27.jpg)
4*
Implementation Details
Dense MVM:
I Calculate the products V T x for the leaves in the treeBatch MVM operation
Upsweep:
I Sweep up the column basis tree tree calculating the productsof the inner nodes from the products of their childrenblock SpMV (BSR)
Mult:
I It is also block SpMV (BSR) per level of the tree
Downsweep:
I Transpose operation of the upsweep phaseblock SpMV (BSR)
Pipelining:
I Overlapping computations possible within Dense MVM /Upsweep / Mult phases. Downsweep, however, requires a syncpoint!
![Page 28: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/28.jpg)
4*
Performance Results
Figure: Performance (GB/s) of H-MVM using k = 8 and n min = 32.
![Page 29: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/29.jpg)
4*
Advanced Hierarchical Matrix Operations
Context:
I Very small sizes!
I Batch operation executions at each level of the tree
I (usually) Fixed sizes
I Recursive formulation, stressing register usage
I State-of-the-art implementations not well optimized for thisscope or not supported
I NVIDIA K40 GPU (single GPU)
![Page 30: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/30.jpg)
4*
Advanced Hierarchical Matrix Operations
H-Matrix compression:
I Batch QR factorizations (square and tall-and-skinny)
I Batch SVD
H-Matrix computations:
I Level 3 BLAS: SYRK, TRSM
I Factorizations: POTRF
I Solves: POTRS, POSV
![Page 31: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/31.jpg)
4*
Performance Results (preliminary)
0.125
0.25
0.5
1
2
4
8
16
32
64
128
8 16 32 64 128 256
Pe
rfo
rman
ce (
GFL
OP
/s L
og2
)
Matrix Size
Batch QR (Square matrix)
KBLAS_10000
KBLAS_1000
CUBLAS_10000
CUBLAS_1000
MAGMA_10000
MAGMA_1000
![Page 32: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/32.jpg)
4*
Performance Results (preliminary)
![Page 33: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/33.jpg)
4*
Performance Results (preliminary)
![Page 34: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/34.jpg)
4*
Performance Results (preliminary)
1
2
4
8
16
32
64
128
256
512
8 16 32 64 128 256 512
Performan
ce (G
Flop
/s Log
2)
Matrix Size
DSYRK_Batch
KBLAS_10240
KBLAS_1024
MAGMA_10240
MAGMA_1024
![Page 35: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/35.jpg)
4*
Performance Results (preliminary)
0.125 0.25 0.5 1 2 4 8 16 32 64
128 256 512
8 16 32 64 128 256 512
Performan
ce (G
Flop
/s Log
2)
Matrix Size
DTRSM_Batch
KBLAS_10240 KBLAS_1024 MAGMA_IP_10240 MAGMA_IP_1024 CUBLAS_10240 CUBLAS_1024
![Page 36: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/36.jpg)
4*
Performance Results (preliminary)
1
2
4
8
16
32
64
128
256
8 16 32 64 128 256
Performan
ce (G
Flop
/s Log
2)
Matrix Size
DPOTRF_Batch
KBLAS_1024
KBLAS_10240
MAGMA_1024
MAGMA_10240
![Page 37: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/37.jpg)
4*
Performance Results (preliminary)
0.25
0.5
1
2
4
8
16
32
64
128
256
512
8 16 32 64 128 256 512
Performan
ce (G
Flop
/s Log
2)
Matrix Size
DPOTRS_Batch
KBLAS_10240
KBLAS_1024
MAGMA_10240
MAGMA_1024
![Page 38: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/38.jpg)
4*
Performance Results (preliminary)
0.25
0.5
1
2
4
8
16
32
64
128
256
512
8 16 32 64 128 256 512
Performan
ce (G
Flop
/s Log
2)
Matrix Size
DPOSV_Batch
KBLAS_10240
KBLAS_1024
MAGMA_10240
MAGMA_1024
![Page 39: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/39.jpg)
4*
Outline
Motivations
QR-based Dynamically Weighted Halley for SVD
Level 3 BLAS
H-Matrices
HiCMA in a nutshell
![Page 40: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/40.jpg)
4*
HiCMA’s Scope
The hierarchical computations for manycore architectures libraryaims to:
I Develop high performance numerical solvers:Dense/ Data-Sparse (H)
I Increase data reuse thanks to a recursive/hierarchicalformulation
I Exploit high level of concurrency
I Perform asynchronous execution
I Target various architectures:Shared/Distributed-memoryAccelerators/Co-processorsARM
![Page 41: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/41.jpg)
4*
HiCMA Software Stack
![Page 42: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/42.jpg)
4*
HiCMA’s Backbone
![Page 43: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/43.jpg)
4*
HiCMA’s Horsepower
![Page 44: HiCMA: Hierarchical Computations on Manycore Architectures...Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia](https://reader034.fdocuments.net/reader034/viewer/2022052010/60200ad2dd364f60111a18ce/html5/thumbnails/44.jpg)
4*
HiCMA’s MoC