Pramod Kumbha r
-
Upload
danielle-hays -
Category
Documents
-
view
218 -
download
0
Transcript of Pramod Kumbha r
-
7/29/2019 Pramod Kumbha r
1/88
Performance of PETSc GPU Implementation with Sparse
Matrix Storage Schemes
Pramod Kumbhar
August 19, 2011
MSc in High Performance Computing
The University of Edinburgh
Year of Presentation: 2011
-
7/29/2019 Pramod Kumbha r
2/88
i
Abstract
PETSc is a scalable solver library developed at Argonne National Laboratory (ANL). It
is widely used for solving system of equations arising from discretisation of partial
differential equations (PDEs). GPU support has recently been added to PETSc to
exploit the performance of GPUs. This support is quite new and currently only
available in the PETSc development release. The goal of this MSc project is to evaluate
the performance of the current GPU implementation, especially iterative solvers on the
HECToR GPU cluster. In the current implementation, a new sub-class of matrix wasadded which stores matrix in Compressed Sparse Row (CSR) format. We have
extended the current PETSc GPU implementation to improve the performance using
different sparse matrix storage schemes like ELL, Diagonal and Hybrid.
For structured matrices, the current GPU implementation shows 4x speedup compared -
to Intel Xeon quad-core CPU. For multi-GPU applications, speedup starts decreasing
due to high communication costs on the HECToR GPU cluster. Our implementation
with new storage schemes show 50% performance improvement on sparse matrix-
vector operations. For structured matrices, new implementation shows 7x speedup and
significantly improves the performance of vector operations on the GPU.
-
7/29/2019 Pramod Kumbha r
3/88
ii
Contents
Chapter 1 Introduction ....................................................................................................1
1.1 Background ............................................................................................................... 1
1.2 Motivation ................................................................................................................. 2
1.3 Related Work ............................................................................................................ 3
1.4 Contributions and Outline ........................................................................................ 4
1.5 Change in Project plan ............................................................................................. 4
Chapter 2 Background .....................................................................................................6
2.1 GPGPU ..................................................................................................................... 6
2.2 GPU Programming models ...................................................................................... 6
2.2.1 CUDA ................................................................................................................ 6
2.3 CUSP and Thrust ...................................................................................................... 8
2.3.1 Thrust ................................................................................................................. 8
2.3.2 CUSP ...............................................................................................................10
3'(V6RXUFHRI6parse Matrices .........................................................................11
2.5 Iterative Methods for Sparse Linear Systems ........................................................13
Chapter 3 PE TSc GPU Implementation ......................................................................15
3.1 PETSc .....................................................................................................................15
3.1.1 PETSc Kernels ................................................................................................15
3.1.2 PETSc Components ........................................................................................16
3.1.3 PETSc Object Design ......................................................................................17
3.2 PETSc GPU Implementation .................................................................................18
3.2.1 Sequential Implementation .............................................................................18
3.2.2 Parallel Implementation ..................................................................................19
3.3 Applications running with PETSc GPU support ...................................................20
Chapter 4 Sparse Matrices ...........................................................................................22
-
7/29/2019 Pramod Kumbha r
4/88
iii
4.1 Sparse Matrix Representation ................................................................................23
4.2 Sparse Matrix Storage Schemes .............................................................................23
4.2.1 Coordinate List ................................................................................................24
4.2.2 Compressed Sparse Row .................................................................................24
4.2.3 Diagonal...........................................................................................................25
4.2.4 ELL or Padded ITPACK .................................................................................26
4.2.5 Hybrid .............................................................................................................26
4.2.6 Jagged Diagonal Storage (JDS) ......................................................................27
4.2.7 Skyline or Variable Band ................................................................................27
4.3 Performance of Storage Schemes ..........................................................................28
Chapter 5 Implementation of Sparse Storage Support in PE TSc............................35
5.1 Design Approach ....................................................................................................35
5.2 Implementation Details ..........................................................................................37
5.2.1 New Matrix types for GPU .............................................................................37
5.2.2 PETSc Mat Object ...........................................................................................37
5.2.3 New User level API ........................................................................................38
5.2.4 PETSc Mat objects on GPU ............................................................................39
5.2.5 Conversion of PETSc MatAIJ to CUSP CSR ................................................40
5.2.6 Conversion of PETSc MatAIJ to CUSP DIA/ELL/HYB/COO ....................41
5.2.7 Matrix-Vector multiplication for different sparse formats ............................43
5.2.8 Other Important notes .....................................................................................44
5.3 Sample Use Case and Validation ...........................................................................44
Chapter 6 W rapper Codes and Benchmarks..............................................................46
6.1 Testing Codes .........................................................................................................46
6.2 Matrix Market to PETSc binary format .................................................................46
6.3 Benchmarking codes ..............................................................................................48
6.4 Benchmarking Approach .......................................................................................48
Chapter 7 Performance Analysis..................................................................................50
7.1 Benchmarking System ............................................................................................50
7.2 Single GPU Performance .......................................................................................52
7.2.1 Structured Matrices .........................................................................................52
7.2.2 Semi-Structured Matrices ...............................................................................56
7.2.3 Unstructured Matrices .....................................................................................57
-
7/29/2019 Pramod Kumbha r
5/88
iv
7.3 Multi-GPU Performance ........................................................................................60
7.4 Comparing multi-GPU performance with HECToR .............................................63
7.5 CUSP Matrix Conversion Cost ..............................................................................64
Chapter 8 Discussion ......................................................................................................66
8.1 Challenges for multi-GPU parallelisation .............................................................66
8.1.1 CPU-GPU and GPU-GDRAM Memory transfer ..........................................66
8.1.2 GPU-GPU Communication ............................................................................67
8.2 Future Work ............................................................................................................68
Chapter 9 Conclusion .....................................................................................................72
Bibliography ....................................................................................................................74
-
7/29/2019 Pramod Kumbha r
6/88
v
List of F iguresFigure 1.1: Main stages involved in single iterations of Fluidity framework [7] ............. 2
)LJXUH 3URILOLQJ UHVXOWV RI %XUJHUV HTXDWLRQ -D model problem with mesh
spacing of 0.002 and domain [-10, 10] .............................................................................. 3
Figure 3.1: PETSc Kernels (implementation: petsc/src/sys) .............................15
Figure 3.2: PETSc Library Organisation [28]..................................................................16
Figure 3.4: PETSc Objects and Application Level Interface ..........................................17
Figure 3.5: VecAXPY implementation in PETSc using CUSP & CuBLAS [11]..........19
Figure 3.6: Parallel Matrix with on-diagonal and off-diagonal elements for two MPI
process ...............................................................................................................................19
Figure 3.7: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11]
...........................................................................................................................................20
Figure 4.1: MxN: 4,690,002x4,690,002 NNZ: 20,316,253 id: 1398 ........................32
Figure 4.2: MxN: 1,391,349x1,391,349 NNZ: 64,531,701 id: 2541 ........................32
Figure 4.3: MxN: 3,542,400x3,542,400 NNZ: 96,845,792 id: 1902 ............................32
Figure 4.4: MxN: 4,802,000x4,802,000 NNZ: : 85,362,744 id: 2496 ......................32
Figure 4.5: MxN: 16,614x16,614 NNZ: 1,096,948 id: 409 ......................................33
Figure 4.6: MxN: 999,999x999,999 NNZ: 4,995,991 id: 1883 ................................33
Figure 4.7: MxN: 1,489,752x1,489,752 NNZ: 10,319,760 id: 2267 ........................33
Figure 4.8: MxN: 1,971,281x1,971,281 NNZ: 5,533,214 id: 374 ................................33
Figure 4.9: MxN: 1,61,070x1,61,070 NNZ: 8,185,136 id: 2336 ............................34
Figure 4.10: MxN: 1, 20,216x1, 20,216 NNZ3,121,160 id: 2228 ..........................34
Figure 5.1: PETSc Objects creation in current GPU implementation ............................36
Figure 5.2: PETSc Object creation and new sparse matrix support in new GPU
implementation .................................................................................................................36
Figure 5.3: New User level API registration with Mat class (petsc-src/mat/interface/matreg.c)..............................................................................................38
Figure 5.4: Modified Mat_SeqAIJCUSP class with ELL, DIA, HYB and COO storage
support using the CUSP library ........................................................................................39
Figure 5.5: Converting PETSc AIJ Matrix to CUSP CSR matrix ..................................40
Figure 5.6: Transparent conversion between different sparse formats with CUSP ........41
Figure 5.7: Converting PETSc MatAIJ to CUSP ELL matrix (algorithmic details) ......42
Figure 5.8: Converting PETSc MatAIJ to CUSP ELL format (Algorithmic
Implementation) ................................................................................................................42
http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745588http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745590http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745591http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745593http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745594http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745597http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745598http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745599http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745600http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745601http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745602http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745603http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745604http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745605http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745606http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745607http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745611http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745612http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745613http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745614http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745613http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745612http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745611http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745610http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745609http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745608http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745607http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745606http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745605http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745604http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745603http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745602http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745601http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745600http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745599http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745598http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745597http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745596http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745595http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745594http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745593http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745591http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745590http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745589http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745588 -
7/29/2019 Pramod Kumbha r
7/88
vi
Figure 5.9: Sparse Matrix-Vector operation support for different matrix formats using
CUSP .................................................................................................................................43
Figure 5.10: Simple example of KSP with the use of new sparse matrix format ...........44
Figure 5.11: Convergence of KSP CG solver with different sparse matrix formats on
CPU & GPUs for simple example of 2-D Laplacian from PETSc .................................45
Figure 6.1: Converting Matrix Market format to PETSc binary format (Algorithmic
Implementation) ................................................................................................................47
Figure 7.1: HECToR GPGPU Testbed System consist of NVIDIA and AMD GPUs
connected by Infiniband Network ....................................................................................51
Figure 7.2: Tesla C2050/C2070 Specification [58] .........................................................51
Figure 7.3: Total execution time with different sparse matrix formats on GPU
(using GMRES method) ...................................................................................................52
Figure 7.4: Performance with different sparse matrix formats on GPU
(using GMRES method) ...................................................................................................52
Figure 7.5: Execution time of SpMV with different Sparse matrix formats ..................54
Figure 7.6: Execution time of SpMV+VecMDot+VecMAXPY with different sparse
matrix formats on GPU.....................................................................................................54
Figure 7.7: Performance on CPU with CSR, GPU with CSR and GPU with DIA ........55
Figure 7.8: Achieved Speedup compare to Intel Xeon quad-core ..................................55
Figure 7.9: Execution time of different sparse matrix format for semi-structured matrix
on GPU ..............................................................................................................................56
Figure 7.10: Performance for Semi-Structured Matrices ................................................56
Figure 7.11: Sparse Matrix-Vector Execution time for different sparse matrix formats57
Figure 7.12: Unstructured matrix of size 503712 x 503712 with 36,816,170 non zero
elements ............................................................................................................................57
Figure 7.13: Total execution time on GPU with CSR and HYB format ........................58
Figure 7.14: Performance of CSR and HYB on the GPU ...............................................58
Figure 7.15: Execution time of SpMV on CPU (CSR), GPU (CSR) and GPU (HYB) .59
Figure 7.16: Performance on HECToR GPU cluster with CSR and DIA matrix format
...........................................................................................................................................60
Figure 7.17: Performance with CSR and DIA matrix formats with different number for
GPUs .................................................................................................................................61
Figure 7.18: Execution time for SpMV using CSR and DIA matrix formats on
HECToR GPU ..................................................................................................................62
Figure 7.19: Performance comparison between HECToR GPU system ........................63
Figure 8.1: Overall system architecture considering bandwidth of different sub-systems
(Pre Sandy-Bridge Architecture) ......................................................................................67
http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745616http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745620http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745623http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745625http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745626http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745628http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745629http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745631http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745632http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745633http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745637http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745638http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745637http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745636http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745635http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745634http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745633http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745632http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745631http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745630http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745629http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745628http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745627http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745626http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745625http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745624http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745623http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745622http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745621http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745620http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745619http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745618http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745617http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745616http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745615 -
7/29/2019 Pramod Kumbha r
8/88
vii
Figure 8.2: HECToR GPU: Infiniband Network with Switched fibre topology
(schematic layout) .............................................................................................................68
Figure 8.3: Speedup using the default Block Jacobi Preconditioner on CPU and GPU
with CSR, ELL .................................................................................................................69
Figure 8.4: Diagonal matrix with few independent nonzero numbers ............................70
Figure 8.5: User Implemented SpMV in PETSc Using MattShell (Design) .................71
http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745641http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745642http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745641http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745640http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639http://file///C:/Users/Snehal/Dropbox/dissertation/Msc%20Dissertation%20Draft-5.docx%23_Toc301745639 -
7/29/2019 Pramod Kumbha r
9/88
viii
Acknowledgements
I am very grateful to Dr. Michele Weiland and Dr. Chris Maynard for their advice
and supervision during this dissertation. I would also like to thank Dr. Lawrence
Mitchell for providing valuable advice during project discussions. I am also indebted
to my friends and family for their continued support during my study.
-
7/29/2019 Pramod Kumbha r
10/88
1
-
7/29/2019 Pramod Kumbha r
11/88
1
Chapter 1
Introduction
1.1BackgroundPETSc (Portable Extensible Toolkit for Scientific Computations) is an open source,scalable solver library developed over past twenty years at Argonne National
Laboratory (ANL). It is used for solving system of equations arising from
discretisation of partial differential equations (PDEs). Developing parallel, nontrivial
PDE solvers for high end computing systems considering the scalability over
thousands of processors is still difficult task and takes lots of time. The PETSc is
designed to ease this task and reduce the development time. It provides parallel
algorithms, debugging support and low-overhead profiling interface that help in the
development of large and complex applications. PETSc is used to solve large linear
systems with 500B unknowns on supercomputers like Jaguar and Jugene with more
than 200K processors [1]. It is used in modelling of many scientific applications inthe area of Geosciences, Computational Fluid Dynamics, Weather Modelling,
Seismology, Surface water flow, Polymer Injection modelling etc. We will discuss
one such application of PETSc called as Fluidity that we have analysed.
Fluidity is an open source, general purpose computational fluid dynamics framework
[2] developed by the Applied Modelling and Computational Group (AMCG) at
Imperial College London. This framework is used in many scientific simulations in
the areas of fluid dynamics, ocean modelling, atmospheric modelling etc. It solves the
Navier-Stokes equations [3] on arbitrary unstructured, adaptive meshes using finite
element methods. While solving this system, we impose the grid on the problemdomain to calculate the numerical solution of partial differential equations (PDEs).
The accuracy and computational cost of the solution depends on the grid spacing. To
compute accurate solutions, one has to use a finer grid; but this increases
computational cost. Hence the Adaptive Mesh Refinement (AMR) technique is used
to reduce the computational costs. AMR uses a coarse grid at the start of the
simulation and as the solution progresses, it identifies areas of interest (i.e. parts of
the grid which exhibits large change in solution) where the grid needs to be refined.
These methods are discussed in more detail in [4], [5], [6].
Figure 1.1 shows the main stages involved in simulations using the Fluidity
framework:
-
7/29/2019 Pramod Kumbha r
12/88
2
During simulation, a new mesh may need to be generated by using AMR techniques
to maintain the accuracy of the solution. During the assembly stage, a system of
simultaneous equations is assembled using the finite element mesh. In the solver
stage, the system of equations assembled in the assembly stage is solved using
iterative methods. Fluidity uses iterative solvers from the PETSc to solve large sparse
system. The sparse matrices that arise are positive definite, non-symmetric and hence
the Generalised Minimum Residual (GMRES) algorithm is normally used [7]. Theupdate stage involves updating the solution variables, calculating new timestamps and
estimating the error. Finally, the current solution can be written to disk. All these
stages are discussed with more details in [7] and [8]. Currently, Fluidity uses various
libraries like MPI (Message Passing Interface), ParMETIS and PETSc to support
parallelisation [9].
1.2MotivationDespite various parallelisation and optimisation techniques, simulations of complex
phenomenon like tidal modelling or tsunami simulation take from hours to a few dayson modern supercomputers. For Fluidity, the main computationally expensive stages
are the assembly phase and the solver phase. The initial idea of the project was to
improve the performance of Fluidity framework using directive based GPU
programming models like HMPP (Hybrid Multicore Parallel Programming) and PGI
Accelerators. For initial profiling and performance optimisation, we decided to use 1-
D non-OLQHDU PRGHO SUREOHP RI %XUJHUV HTXDWLRQ [10]. The Burger equation is a
fundamental PDE which occurs in various applications of fluid dynamics which takes
the form:
where is the velocity and is the viscosity coefficient. This is a basically 1-D
Navier-Stokes equation (discussed in Section 2.4) without pressure and volume forces
terms. The following Figure 1.2 shows the profiling result of this problem on single
Intel Xeon processor.
repeat
MeshGeneration
AssemblyPhase
Solver PhaseUpdate
SolutionOutputSolution
Figure 1.1: Main stages involved in single iterations of Fluidity framework [7]
-
7/29/2019 Pramod Kumbha r
13/88
3
The profiling results show that most of the execution time (84%) is spent in PETSc
solver library. For this model problem, when we increase the resolution (i.e. finer
mesh spacing), the assembly phase time increases proportionally and the solver time
increases exponentially. As this is a simple 1-D example, assembly time is relatively
small. But the condition number of matrix is very high and hence the solver takeslarge time to converge. Hence, we decided to deviate from our original plan and
optimise the solver phase of Fluidity which is ultimately the PETSc solvers.
Graphics Processing Units (GPUs) are becoming more popular due to their
performance to cost ratio and potential performance gains compared to CPUs. To
improve the performance of the solver phase, we decided to use the newly implemented
GPU support in PETSc. Also we identified potential performance improvements in
PETSc using different sparse matrix storage schemes, which are more suitable for
GPUs.
1.3Related WorkIn the last year, basic GPU support in PETSc has been added, which is currently
available in the PETSc development release [11]. As per our knowledge, there are no
benchmarking results published for this GPU implementation. Also, the current
implementation only supports CSR (Compressed Sparse Row) matrix storage format
and there is no development effort to support other matrix storage schemes. Nathan
Bell and Michael Garland from NVIDIA Research have published performance results
[12] of sparse matrix-vector operations on GPUs using different sparse matrix storage
schemes. These results show that storage schemes like DIA (Diagonal), ELL
(ELLAPACK) and HYB (Hybrid) are well suited for GPUs. Compared to CSR storagescheme, DIA and ELL formats can achieve 4-6x speedup.
Figure 1.23URILOLQJUHVXOWVRI%XUJHUVHTXDWLRQ-D model problem with
mesh spacing of 0.002 and domain [-10, 10]
-
7/29/2019 Pramod Kumbha r
14/88
4
There are two main goals of this MSc project. First goal is to improve the performance
of current PETSc GPU implementation using DIA, ELL and HYB sparse matrix
storage formats. Second goal is to evaluate the performance of PETSc GPU
implementation on HECToR GPU cluster. Specifically we are looking at performanceof Krylov Subspace solvers for solving large sparse linear system.
1.4Contributions and OutlineThe contributions of this project report are as follows:
x To discuss the performance of the CUSP and Thrust libraries with large sparsematrices from real world applications;
x To present an initial implementation to support different sparse matrix storageschemes in PETSc using CUSP and Thrust;
xTo evaluate the performance of the PETSc GPU implementation for solvinglarge sparse linear systems;
x To evaluate the performance benefits of the new implementation on single andmulti-GPU applications;
x To compare the overall performance of PETSc GPU implementation on theHECToR GPU cluster and the HECToR (Phase 2b) system.
Chapter 2 presents background information, which includes GPGPU, CUDA
Programming model, CUSP &Thrust libraries, PDEs and Iterative methods for sparse
linear algebra. Chapter 3 discusses the design of PETSc library and implementation of
GPU support in PETSc. Chapter 4 presents different sparse matrix storage schemes
suitable for vector processors which are available in CUSP library. We have also
evaluated the performance of CUSP with large sparse matrices from real world
applications. Chapter 5 presents our initial implementation to support sparse matrix
storage schemes in PETSc using CUSP library. Chapter 6 discusses the wrappers codes
developed for matrix conversion, performance analysis and benchmarking. Chapter 7
presents performance results of PETSc GPU implementation with different sparse
matrix formats on the HECToR GPU cluster as well as the main HECToR (Phase 2b)
system. Chapter 8 discuss the challenges of multi-GPU parallelisation encountered
during performance analysis and outlines the future work in this area. Chapter 9
presents the conclusion of this project and summarize the results.
1.5Change in Project planDuring the project preparation phase (Semester-II), the idea of the MSc project was to
extend the HMPP programming model for the C++ language. Specifically our aim was
to implement a generic meta-programming framework for HMPP using C++ templates.
HMPP is now an open standard developed by CAPS enterprise and Pathscale Inc.
HMPP provides directive based GPU programming model similar to PGI accelerators.
For this MSc project, external organisation was expected to provide ENZO compiler
suite with HMPP C++ support by May 2011. But this compiler was not available until
first week of June 2011 due to complexity of C++ compiler implementation. So wedecided to change our project plan. With the great support of Dr. Michele Weiland and
Dr. Chris Maynard we were quickly able to work out the alternative project plan and
-
7/29/2019 Pramod Kumbha r
15/88
5
decided to work on Fluidity project, especially the PETSc GPU implementation. This
change in project affected the planned schedule of the project. But with the continuous
advice and support of supervisors, I am able to complete this project successfully.
-
7/29/2019 Pramod Kumbha r
16/88
6
Chapter 2
Background
2.1GPGPUGPUs (Graphics Processing Units) have a distinct architecture specifically designed for
high floating point operations and fine grained concurrency. In the past, GPUs were
mostly used for improving the performance of graphics operations like pixel shading,
texture mapping and rendering. But in last few years, GPUs have been effectively used
to speed-up the performance of non-graphics applications from different areas of
science like computational fluid dynamics, molecular dynamics, medical imaging,
climate modelling etc. The term GPGPU (General Purpose Computing on GPUs) is
normally used to refer to the use of GPUs for accelerating non-graphics applications
traditionally executing on CPUs.
The primary reason for the popularity of GPUs in the area of scientific computing is
their performance to cost ratio. For example, the NVIDIA Tesla C2070 GPU has 448
cores capable of achieving a theoretical peak performance of 515 GFlops which is 50
times more than the Intel Xeon (E5620) quad-core processor. But if we compare the
prices, Tesla GPU is only five times costlier than Intel Xeon processor. Various
applications ported to the GPUs show significant performance benefits (10-50x
speedup) compared to the CPUs. More details about these applications can be found in
[13].
2.
2GPU Programming modelsProgramming models like ARB, OpenGL, Direct3D and Cg were commonly used for
development graphics applications. But these programming models do not fit well for
development of HPC applications. Research in GPU technology helped to understand
the use of GPUs in general purpose computing. Various programming models like
CUDA, OpenCL, AMD Stream and PGI directives are available for programming these
special purpose devices and CUDA is the most popular programming among them.
2.2.1CUDANVIDIA introduced the CUDA programming model, which enables a large developer
community to exploit GPUs for general purpose computing. The programming
-
7/29/2019 Pramod Kumbha r
17/88
7
interfaces are exposed through C, C++, FORTRAN languages and third party wrappers
for other languages like Python, Ruby etc. A CUDA application consists of code that
runs on CPU as well as GPU. Compute intensive functions of the program, which
executes on GPU, are called as kernels. The nvcc compiler translates this kernel code tothe PTX assembly code, which is executed on GPU. More details about the CUDA
programming model can be found in [14].
CUDA Architectu re: We will discuss the CUDA architecture considering a Tesla 10series device. A Tesla C1060 GPU device consists of 30 multithreaded streaming
multiprocessors (SM) and each SM consists of 8 streaming processors (SP), two special
function units, on chip shared memory and an instruction unit. Figure 2.1 shows the
organisation of SM, SP, registers and shared memory in a Tesla device. The SM
creates, manages and schedules a group of 32 threads in a batch called as warp. A
single SM has hardware resources that can hold state of three warps at a time [15]. Forthe C1060 devices there can be 23,040 threads (30 SM * 8 SP * 32 threads * 3 Warps)available for execution. Out of this, only 960 threads (30 SM * 32 threads) can be
executed concurrently at a given time. All threads within a warp executes in SIMT
(Single Instruction Multiple Threads) fashion.
Figure 2.1: CUDA Architecture and Memory Hierarchy (Adapted from [15] and [56] )
-
7/29/2019 Pramod Kumbha r
18/88
8
There are different memory types: register, shared, local, global and caches. Each of
these types has different sizes, latencies, bandwidths and performance characteristics.
Each SM has on-chip registers and shared memory. These memories are small in size
and have very low latency. Local and global memory are largest in size and have veryhigh latency. Data access from global or local memory is very costly and requires 400-
500 cycles. Texture and constant memory have similar latency. But these can be
automatically cached by hardware and hence can be effectively used if kernel exhibits
temporal locality. L1 and L2 caches are introduced in the newer Fermi architecture
giving benefits similar to CPU caches. More detailed descriptions on memory
organisation and performance can be found in [16].
Whenever a CUDA kernel is launched on a GPU, thousands of threads are created,
which are organised into grid. The grid is a 1-D or 2-D array of thread blocks and each
thread block is a 1-D, 2-D or 3-D array of threads. The thread blocks are assigned toavailable SMs. All threads within a thread block execute in a time-multiplexed fashion
on a single SM. The grid and block dimensions largely depend on the hardware
resource requirements of the executing kernel. More information on this can be found
in [14].
2.3CUSP and ThrustCUSP and Thrust are open source C++ template libraries developed using CUDA and
providing high level interfaces for GPU programming. We have used these libraries for
implementing the sparse matrix storage schemes support in PETSc.
2.3.1ThrustThrust is an open source, template library [17] developed on top of CUDA. The main
advantage of Thrust is that it provides a high level interface for GPU programming and
enables rapid development of complex HPC applications. Another important benefit is
that Thrust, being a C++ template library, supports generic programming and Object
Oriented (OO) paradigms. Three main components of Thrust are: Containers, Iterators
and Algorithms.
Containers: A container can store a collection of objects. The containers are usuallyimplemented as template objects so that they can be used with different data types. For
example, common data structures used in programming languages like linked lists,
stacks, queues and heaps arrays are implemented as containers. In Thrust, there are two
main containers: thrust::host_vector and thrust::device_vector. A
host_vector and a device_vector represent an array of element on the CPU (host)
and the GPU (device) memory respectively. The major benefit of containers is that they
handle memory management for the underlying objects. For example, whenever we
create a host_vector, they automatically allocate memory on the CPU. Similarly the
device_vector container handles the memory allocation, deallocation on GPUs.
Whenever we assign a host_vector to device_vector, they automatically make a
-
7/29/2019 Pramod Kumbha r
19/88
9
YHFWRU FRS\ IURP &38 WR *38 PHPRU\ 6R ORZHU OHYHO $3,V OLNH cudaMalloc,cudaMemcpy,cudaFree etc are completely hidden from application developers.
Iterators: An Iteratoris a generalisation of pointers in C and can be thought of as anobject in C++ which can point to other objects. They are usually used for traversing
over the container objects and are similar to C pointers, thus we can perform pointer
arithmetic. There are different types of Thrust iterators like input, output, constant,
permutation or transform iterator [17]. For example, the input iterator provides the
functionality of accessing the value of a containerREMHFWEXWZHFDQWFKDQJHWKHYDOXH
of that object. It is possible to write generic algorithms by using templates and
parameterized by Iterators.
Algorithms: Thrusts implements more than sixty basic algorithms like merge sort,
radix sort, inclusive scan, reduce or parallel prefix. These algorithms are implementedas templatedobjects so that they can work with all basic data types. With the help ofiterators, the algorithmic implementation does not have to worry about the underlying
object type or object access methods. Algorithms do not directly access the container
data, but use iterators to access the underlying data elements. For example, there is a
single implementation of the radix sort for all data types. Depending on the data type,
an Iterator provides a way to access the data elements.
The mechanism of using containers, iterators and algorithms together can be explained
with following simple example:
In the above example, we first create the host container to store one million float
elements. We then randomly fill this vector using the thrust::generate method. The
vec_h.begin()and vec_h.end() calls provide iterators pointing to the start and end
of vec_h container respectively. When we assign the host container to the device
container, Thrust automatically allocates memory on the GPU using cudaMalloc() and
/*Thrust headers */void main(){
/*allocate storage for one million numbers using host container*/thrust::host_vector vec_h(1000000);/* generate one million numbers on host using iterators*/thrust::generate(vec_h.begin(),vec_h.end(), rand);
/*transparent copy of host vector to device vector*/thrust::device_vector vec_d = vec_h;
/*use of Thrust algorithms: passing iterators as parameters*/
thrust::sort(vec_d.begin(),vec_d.end());
/* transparent copy from device to host memory*/thrust::copy(vec_d.begin(),vec_d.end(),vec_h.begin());
}
Figure 2.3: Simple example to sort one million float elements on GPU using Thrust
-
7/29/2019 Pramod Kumbha r
20/88
10
calls cudaMemcpy() to make a host-to-device memory copy. In this example, we are
using the thrust::sortmethod to invoke the default sorting algorithm (Merrill's radix
sort [18]) on GPU. Finally we use the thrust::copy method to copy back vector data
from GPU to CPU memory.
2.3.2CUSPCUSP is also an open source C++ template library [19] developed on top of CUDA, but
it specifically targets sparse linear algebra and sparse matrix computations. Similar to
Thrust, this library provides a high-level programming interface and internally uses the
functionality of Thrust and CUBLAS. CUSP provides following five sparse matrix
storage schemes:
xCompressed Sparse Row (CSR)
x Coordinate (COO)x ELLAPACK (ELL)x Diagonal (DIA)x Hybrid (HYB)
We will discuss these storage formats in detail in Section 4.2. CUSP provides an easy
interface for building different sparse matrix formats and a transparent conversion
between these formats. This is explained in the following example:
In the above example, we create a sparse matrix object in COO format. CUSP provides
the cusp::gallery interface for generating sample matrices for a Poisson or
Diffusion problem on a 2-D mesh. When we assign the COO matrix object on the host
to the ELL matrix object on the device, CUSP automatically allocates memory on
GPU, performs the COO to ELL conversion, and copies the matrix data from CPU to
GPU. We have discussed this mechanism in Section 5.2.
/*CUSP headers */void main(){
/*sparse matrix in COO format on host*/cusp::coo_matrix coo_mat;
/*matrix corresponding to 2-D Poisson problem on 15x15 mesh */cusp::gallery::poisson5pt(A, 15, 15);
cusp::ell_matrix ell_mat;
/*performs memory allocation, conversion from COO to ELL,and memory allocation and copy to device */
ell_mat = coo_mat;}
Figure 2.4: Sparse matrix construction and transparent conversion using CUSP
-
7/29/2019 Pramod Kumbha r
21/88
11
In addition to sparse matrix storage and operations, CUSP provides following features:
x File I/O interface for reading and writing large sparse matrices to/from matrixmarket files.
x Krylov subspace solvers like Conjugate-Gradient (CG), Multi-mass Conjugate-Gradient (CG-M), Biconjugate-Gradient (BiCG) and Generalized Minimum
Residual (GMRES) on GPUs.
x Preconditioners like Algebric Multigrain (AMG), Diagonal and ApproximateInverse (AINV).
We have used the FILE I/O interface for converting matrices stored in matrix market
format (ASCII format) to PETSc binary format. For implementing sparse storage
schemes in PETSc, we have extensively used CUSP and Thrust. Also we have
developed small benchmark to measure the performance of CUSP linear solvers onGPUs.
2.43'(V: Source of Sparse MatricesPartial Differential Equations (PDEs) provide a mathematical model for many scientific
and engineering applications. These equations relate partial derivatives of physical
quantities like force, velocity, momentum, temperature etc. In fluid dynamics, the
Navier-Stokes equations [20] are a set of nonlinear PDEs, which can be used to
describe the flow of incompressible fluids as,
where is the flow velocity, is the viscosity, P is the pressure, is the density offluids and is vector differential operator. Most commonly, we solve these PDEs byapproximating them with equations with a finite number of unknowns. This process of
approximation is called discretisation. There are two commonly used techniquesavailable, Finite Difference Methods (FDM) and Finite Element Methods (FEM),
explained in [21] .
We will illustrate the process of discretisation by using the common example of a PDE
WKDWDSSHDUVLQPDQ\HQJLQHHULQJDUHDVLH3RLVVRQVHTXDWLRQ :
where is a real valued function and two space variables , LQGRPDLQ&RQVLGHUsimple problem where we want to find a function such that
= 1
-
7/29/2019 Pramod Kumbha r
22/88
12
in the solution domain and on the boundary . To find thenumerical approximation of, we discretised the PDE using finite differences and sub-divide domain
into grid of .We solve for different variables
where i,j=0, 1,2,3...... In this case, the grid spacingis given by h=1/(N+1).
On the 2-D grid, we can write discretised equations (using forward difference for first
derivative and backward difference for second derivative) as
The right hand side of above expression is called as five point stencil because every
point on the lattice is averaged with its four nearest neighbours as shown in Figure 2.2.
A finite difference approximation to the above equation is given by
This results in linear equations with unknowns . The resulting matrix A fromthe linear system is very large, sparse and with a banded structure. Forexample, forN=4, the matrix shown in Figure 2.3 is of order 16x16 and only contains
25% nonzero elements.
Figure 2.2: 2-D Grid and Five-point stencil
-
7/29/2019 Pramod Kumbha r
23/88
13
There are different storage schemes available to store these sparse matrices. Some
formats like ELL or Diagonal are better suited for GPUs. We will discuss these formats
in Section 4.2, considering their performance on GPUs.
2.5Iterative Methods for Sparse Linear SystemsIterative methods are commonly used for solving large linear systems. These methods
try to find the solution of linear system of equationAx=b by generating sequence of
improving approximate solutions. (Here, iterative meaning repetitive application ofoperations to improve the approximate solution). These methods use initial guess as the
first approximate solution and then improves this solution over the successive
iterations. There are two main classes of iterative methods: Stationary Iterative
Methods and Krylov Subspace Methods. Jacobi, Gauss-Seidel and Successive Over-Relaxation (SOR) methods are examples of stationary methods and they are easy to
implement and analyse. But the convergence of these methods is not guaranteed for all
class of matrices.
Krylov Subspace Methods are class of iterative methods which are considered as most
important iterative techniques currently available for solving linear and non-linear
system of equations. These methods are widely adopted because they are efficient and
reliable. Examples of Krylov Subspace Methods are Conjugate-Gradient, Biconjugate
Gradient and GMRES (Generalized Minimal Residual). These methods are based on
the Krylov Subspace. The m-order Krylov Subspace is defined as
Figure 2.2: Sparse matrix for 5x5 grid (Poisson Problem, 25% Non-Zero elements)
-
7/29/2019 Pramod Kumbha r
24/88
14
where A is nxn matrix and b is vector of length n. The Research in Krylov Subspacetechniques has brought various new methods. Detailed explanation of all methods is
beyond the scope of this project. We will discuss one such method of Krylov Subspace
solver i.e. GMRES, that we have used in our performance analysis example.
GMRES Method: GMRES is an iterative method which approximates the solution bythe vector in Krylov subspace with minimal residual [22]. GMRES approximate
solution by Euclidian norm of the residualAx-b over the Krylov Subspace. This
method is designed to solve non-symmetric linear systems. Most popular form of
GMRES is based on the Gram-Schmidt orthogonalization process. Gram-Schmidt
process takes a set of linearly independent vectors S= { in Euclideanspace and computes set of orthogonal vectors } = which spanssame subspace of fork
-
7/29/2019 Pramod Kumbha r
25/88
15
Chapter 3
PETSc GPU Implementation
3.1PETScPETSc is a scalable solver library, which has been used in the development of a large
number of HPC applications [26]. It provides infrastructure for rapid prototyping and
algorithmic design, which eases the development of scientific applications while
maintaining the scalability on large numbers of processors. The design of PETSc
allows transparent use of different linear/non-linear solvers and preconditioners in the
applications. The programming interface is provided through C, C++, FORTRAN and
Python.
In this section we will discuss the PETSc design and architecture, which will help to
understand our further implementation to support sparse matrix storage schemes.
3.1.1PETSc K ernelsPETSc kernels are basic sets of services on top of which the scalable solver library is
built. These kernels are shown in Figure 3.1.
These kernels have a modular structure and are
designed to maintain portability across different
architectures and platforms. For example, instead
of float or integer data types, PETSc provides
new data types like PetscInt, PetscScalar or
PetscMPIInt. These data types are internally
mapped to corresponding int, float, float64 or
double data types supported on the underlying
platform. For our implementation, if we want to
add new memory management routines, we can
implement those in the corresponding kernel and
make them available to applications and other
kernels. These PETSc kernels are explained
with more detail in [27].Figure 3.1: PETSc Kernels
-
7/29/2019 Pramod Kumbha r
26/88
16
3.1.2PETSc ComponentsPETSc is developed using object oriented paradigms and its architecture allows easy
integration of new features from external developer communities. PETSc consists of
various sub-components listed below:
x Vectorsx Matricesx Distributed Arraysx Preconditionersx Krylov Subspace Solversx Non-linear Solversx Index Setsx Timesteppers
PETSc allows easy customisation and extensions to these components. For example,
we can implement a new matrix subclass or preconditioner that can be transparently
used by all KSP solvers without any modifications. The algorithmic implementation
is separated from the parallel library layer that allows code reusability and easy
addition of new solvers, preconditioners and the data structures. Figure 3.2 shows the
organisation of different PETSc libraries and the levels of abstraction at which they
are exposed.
Figure 3.2: PETSc Library Organisation [54]
-
7/29/2019 Pramod Kumbha r
27/88
17
PETSc internally uses a number of libraries like BLAS, ParMetis, MPI and HDF5 to
provide the infrastructure required for large HPC applications. PETSc provides much
flexibility for users to choose among different libraries for different classes of
applications. But most of the functionalities of underlying libraries are hidden fromapplication developers due to parallel library layer.
3.1.3PETSc Object DesignIn PETSc, classes like Vector, Matrix and Distributed Array represent data objects.
These objects define various methods for data manipulation in sequential or parallel
implementation. The internal representation of an object i.e. the data structure, is not
exposed to applications and is only available through exposed APIs. This is shown in
Figure 3.3. For example, the Vector class can be used for representing the right hand
side of a linear systemAx=b or discrete solutions of PDEs and stores values in a simplearray format similar to C or FORTRAN array convention. This class defines various
methods for vector operations like the dot product, the vector norm, scaling, scatter or
gather operations. For parallel applications, PETSc automatically distributes these
vector elements within the communicators and uses the functionality of the underlying
MPI library to perform collective or point-to-point MPI operations.
In the parallel implementation, the Matrix or Preconditioner objects do not have
access to the internal data structure directly. Instead, they just call exposed APIs
through the PETSc interface and the internal object representation manages
communication within a MPI communicator. For example, for parallel vectors,
internally aVecScatter object is created to manage data communication across MPI
processes.VecScatterBegin() andVecScatterEnd() routines are used to perform
vector scatter operation across the communicator. To access internalVector data, the
application uses subroutines likeVecGetArray().Only Preconditioners (PC) objects
are implemented in a data structure specific way, so they access and manipulateVectororMatrix data structures directly.
Data Structure
Data Manipulation
Routines
Exposed APIs
(Abstraction)
Applications
PETSc Interface
Matrix Vector Index Set
Figure 3.3: PETSc Objects and Application Level Interface
-
7/29/2019 Pramod Kumbha r
28/88
18
3.2PETSc GPU ImplementationRecently, GPU support has been added to the PETSc solver library. Currently it is
under development and available in the PETSc development release [11]. The initialimplementation allows for transparent use of GPUs without modifying the existing
application source code. Instead of writing completely new CUDA code, PETSc uses
the open source CUSP and Thrust libraries discussed in Section 2.3. This helps to keep
the GPU implementation separate from the existing PETSc code.
We will discuss this new implementation in more detail, as our development work will
be an extension to current development.
3.2.1Sequential ImplementationThe current implementation assumes that every MPI process has access to a single
GPU. A new GPU specific Vector class calledVecCUSP has been implemented. It uses
CUBLAS, CUSP, as well as Thrust library routines to perform vector operations on a
GPU. The idea behind using these libraries is to use already developed, fine tuned
CUDA implementations with PETSc instead of developing new ones. The PETSc
implementation acts as an interface between PETSc data structures and external CUDA
libraries, i.e. Thrust and CUSP.
Whenever we execute a program with GPU support, two copies of any vector are
created, one on the CPU and another on the GPU. In the existingVec class, a new flag
is added called valid_GPU_array. This flag has the following four possible values and
corresponding meaning:
PETSC_CUSP_UNALLOCATED : Object is not yet allocated on GPU
PETSC_CUSP_CPU : Object is allocated and a valid copy is available on CPU only
PETSC_CUSP_GPU : Object is allocated and a valid copy is available on GPU only
PETSC_CUSP_BOTH : Object is allocated and valid copies are on CPU & GPU (both)
Initially this flag has the value PETSC_CUSP_UNALLOCATED. When an application creates
a Vector object, theVecCUSPCopyToGPU() subroutine creates a new vector copy on theGPU and sets the valid_GPU_array flag to PETSC_CUSP_BOTH indicating that both
copies are now valid and contain recent values. Now all vector operations can be
performed on the GPU. Whenever theVecCUSPCopyToGPU() function gets called, it
makes a copy to GPU only if the vector object is modified on CPU, i.e. value of
valid_GPU_array flag changed. Memory copies between host and device are
managed through the subroutinesVecCUDACopyToGPU()andVecCUDACopyFromGPU().For example, when an application callsVecGetArrayRead() to access vector data,
internally it first callsVecCUDACopyFromGPU() to copy recent vector values from
GPU and then sets valid_GPU_array to PETSC_CUSP_BOTH, indicating that both copies
are now valid and contain recent values. This mechanism can be illustrated byimplementation of simple vector operation AXPY i.e. y = alpha x + y:
-
7/29/2019 Pramod Kumbha r
29/88
19
For the above vector operation, the VecCUDACopyToGPU() subroutine allocates
memory and copies vector data on to the GPU if the flag value is
PETSC_CUSP_UNALLOCATED. If flag value is PETSC_CUSP_CPU, that means memory is
already allocated on GPU but copy on CPU is recently modified. So it makes CPU to
GPU vector copy and then it makes call to CUBLAS library routines and sets
valid_GPU_array to PETSC_CUDA_GPU.
3.2.2Parallel ImplementationIn the parallel implementation, the parallel Vector and Matrix objects are implemented
on top of the sequential implementation. The Rows of a matrix are partitioned among
the processes in a communicator. This is shown in Figure 3.5. In the PETSc
implementation, a sparse matrix is stored in two parts: the on-diagonal part and the off-
diagonal part. The on-diagonal portion of the matrix, say Ad, stores values of the
column associated with rows owned by that process. These matrix elements are shown
by red colour. All remaining entries from the off-diagonal portion are stored in another
component, say Ao.
VecAXPY(){
/*copy vectors from CPU to GPU: if modified*/ierr = VecCUDACopyToGPU(xin);/*copy vectors from CPU to GPU: if modified*/ierr = VecCUDACopyToGPU(yin);
try {/*perform AXPY using CuBLAS library routine*/cusp::blas::axpy(*((Vec_CUDA*)xin->spptr)->GPUarray,
*((Vec_CUDA*)yin->spptr)->GPUarray,alpha);/*now updated copy is present on GPU*/yin->valid_GPU_array = PETSC_CUDA_GPU;/*wait until all thread finishes*/ierr = WaitForGPU();
}catch(char *ex) {.........}
}
Figure 3.4: VecAXPY implementation in PETSc using CUSP & CuBLAS [11]
Figure 3.5: Parallel Matrix with on-diagonal and off-diagonal elements
for two MPI process
-
7/29/2019 Pramod Kumbha r
30/88
20
The sparse matrix vector product is calculated by using two steps: Firstly, calculating
the product related to the on-diagonal entries of matrix, i.e. Ad, by using associated
vector entries of vectorx, i.e. xd. Then we calculate the product of off-diagonal matrix
entries and the associated vector, which gets added to previous result ydas
yd = Ad * xd
yd += Ao * xo
For this operation, updated entries of vector xo must be communicated within
communicators. As only few AoHOHPHQWVDUHQRQ]HURZHGRQWKDYHWRFRPPXQLFDWH
all xo entries. This communication is managed through theVecScatter object, which
handles parallel gather and scatter operations using non-blocking MPI calls. The
VecScatter object stores two arrays of indices: one array stores global indices of
vector elements that will be received as updated entries from other processes in thecommunicator. These received vector elements are stored in the local array. Second
vector stores the mapping between global index of vector element and its position in the
local array.
Communication starts withVecScatterBegin()call, which copies data into message
buffers. For the GPU implementation, the updated vector entries first get copied from
GPU memory to CPU usingVecCUDACopyFromGPU()function. The communication
completes after theVecScatterEnd()call, which waits for completion of non-
blocking MPI calls posted byVecScatterBegin(). This implementation of parallel
matrix-vector operation shown below:
More information about this implementation can be found in [11] [28].
3.3Applications running with PE TSc GPU supportPETSc allows transparent use of GPUs without any changes in the application source
code. Most of the existing PETSc applications can run on the GPUs. New class of
Vector i.e.VecCUSP and matrix i.e. MatCUSP is added to PETSc which performs all
matrix-vector operations on GPUs. To run any existing application on GPU, user has to
VecScatterBegin(a->Mvctx, xd, hatxo,INSERT_VALUES,
SCATTER_FORWARD);
MatMult(Ad, xd, yd);
VecScatterEnd(a->Mvctx, xd, hatxo,INSERT_VALUES,
SCATTER_FORWARD);
MatMultAdd(hatAo, hatxo, yd, yd);
Figure 3.6: Parallel Matrix-Vector multiplication in PETSc GPU implementation [11]
-
7/29/2019 Pramod Kumbha r
31/88
21
set Vector type toVECCUSP and matrix type toMATCUSP usingVecSetType() and
MatSetType() routines respectively. User can also set these Vector and Matrix type,
using option database keys vec_type seqcusp and mat_type seqaijcusp.
All of the Krylov Subspace methods except KSPIBCGS (Improved Stabilized version
of BiConjugate Gradient Squared) are supported on the GPU. Currently, Jacobi, AMG
(Algebraic Multigrid) and AINV (Approximate Inverse) preconditioners are supported
on the GPUs.
-
7/29/2019 Pramod Kumbha r
32/88
22
Chapter 4
Sparse Matrices
As discussed in Section 2.4, the discretisation of PDEs results in large sparse matrices.
A matrix with only few non-zero elements can be considered as sparse. In a practicalsense, a matrix can be considered as sparse if specialised techniques can be used to take
advantage of the sparsity and the sparsity pattern of the matrix. Depending on the
sparsity pattern, we can divide matrices into two broad categories: structured and
unstructured. A matrix with non-zero elements in a specific regular pattern is called a
structured sparse matrix. For example, all non-zero elements along few diagonals of the
matrix or non-zero elements in small dense sub-blocks, which result into regular
patterns. The Application of FDM or linear FEM on rectangular grids results in
structured sparse matrices. On the other hand, for irregular meshes this results into
unstructured sparse matrices with no specific structure or pattern of non-zero elements.
Figure 4.1 and Figure 4.2 shows example of structured and unstructured matricesrespectively.
Depending on the sparsity pattern, different storage schemes or data structures can be
used. Importantly, the performance of the matrix operations depends on these storage
schemes and processor architecture. This becomes more apparent for vector processors
and GPUs. In this section we will discuss sparse matrix representation and different
storage schemes with their storage efficiency and performance.
Figure 4.2: Example of Unstructured
matrix from bipartite graph
Figure 4.1: Example of Structured Matrix
from structured problem
-
7/29/2019 Pramod Kumbha r
33/88
23
4.1Sparse Matrix RepresentationThe structure of sparse matrices can be ideally represented by adjacency graphs. Graph
theory techniques have been used effectively for parallelising various iterative methodsand implementing preconditioners [29]. A graph G=(V,E) is represented by set of
vertices and set of edges where areelements ofV. In 2-D plane, the graph G is represented by a set of points which areconnected by edges between these points. In case of the adjacency graph of a sparse
matrix, the nvertices in Vrepresent nunknown variables, and the edges in Erepresentthe binary relation between those vertices. There is an edge from node ito nodejwhen
matrix element . An adjacency graph can be directed or undirected dependingon the symmetry of non zeros. When a sparse matrix has a symmetric non-zero pattern
(i.e.
, the adjacency graph is undirected,
otherwise it is directed.
This adjacency graph representation can be used for parallelisation. In case of
parallelising a Gaussian elimination, at a given stage of the elimination we can find
unknowns which are independent of each other from above binary relation. For
example, in the case of a diagonal matrix all unknowns are independent of each other,
which is not true for dense matrices. More information about sparse matrix
representation and parallelisation strategies can be found in [29].
4.2Sparse Matrix Storage SchemesThere are two main reasons for different sparse matrix storage formats: memory
requirements and computational efficiency. It may not be feasible to store a large sparse
matrix in main memory. Importantly, it is not necessary to store zero matrix elements.
Various storage schemes (i.e. data structures) have been proposed to effectively utilise
sparsity and sparsity patterns of the matrices. 7KHUHLVQRVLQJOHbest VWRUDJHVFKHPHfor all sparse matrices; but a few are suitable for matrices with structured sparsity
patterns, some are general purpose and others are storage schemes for matrices with
arbitrary nonzero patterns. Each storage scheme has different storage costs,
computational costs and performance characteristics. In this section we will discuss
various storage schemes and their performance on GPUs.
Figure 4.3: Spase matrix representation with directed adjacency graph
1 2 3
456
-
7/29/2019 Pramod Kumbha r
34/88
24
4.2.1Coordinate ListThe coordinate list (COO) is a simple and the most flexible storage format, where we
store every non-zero element of a matrix with three vectors: data,
rowand indices. Thedatavector stores nonzero elements of matrix in row major order. The rowand indicesvectors explicitly stores associated row and column index of every element in the datavector. This is explained in following Figure:
Figure 4.4: Sparse matrix and corresponding COO storage representation
This is a general purpose and robust storage scheme, which can be used for matrices
with arbitrary sparsity patterns. The above example shows that the storage cost of COO
format is proportional to the number of nonzero elements. ForMxNsparse matrix with
k non-zero elements, it requires bytes.4.2.2Compressed Sparse RowCompressed Sparse Row (CSR) is popular and the most general purpose storage
format. This can be used for storing matrices with arbitrary sparsity patterns as it makes
no assumptions about the structure of the nonzero elements. Like COO, this format also
stores only nonzero elements. These elements are stored using three vectors: data,indicesand row_ptr. The dataand indicesvectors are same as for the COO format. Foran MxNsparse matrix, the row_ptrvector has length ofM+1and stores indexes whereeach row of the matrix starts in the valvector. The last entry ofrow_ptrcorresponds tothe number of nonzero elements in the matrix. This storage scheme is explained in the
Figure below:
Figure 4.4: Sparse matrix and corresponding CSR storage representation
4 1 5 2 6 3 9 7 84 0 1 0 0
0 5 0 2 0
0 0 6 0 3
9 0 0 7 0
0 0 0 0 8
0 0 1 1 2 2 3 3 4
0 2 1 3 2 4 0 3 4
4 1 5 2 3 7 6 8 94 0 0 1 0
0 5 0 2 0
0 0 0 0 3
7 0 0 6 0
0 8 0 0 9
0 3 1 3 4 0 3 1 4
0 2 4 5 7 9
data
row
indices
indices
row_ptr
data
-
7/29/2019 Pramod Kumbha r
35/88
25
There are some advantages of using CSR over COO. The CSR format takes less
storage compared to COO due to the compression of the row indices explained in
above Figure. Also, with the row_ptrvector we can easily compute the number of non-
zero elements in the row as . In parallel algorithms,row_ptr values allow fast row slicing operations and fast access to matrix elementsusing pointer indirection. This is a commonly used sparse matrix storage scheme on
CPUs. For a MxN sparse matrix with k non-zero elements, it requires bytes.4.2.3DiagonalApplications of stencils to regular grids result in bandedsparse matrices, where non-zero elements are restricted to few sub-diagonals of the matrix. For these matrices, the
diagonal (DIA) format can be effectively used. The DIA format uses only two vectors,dataand offsets. The datavector stores nonzero elements of the sub-diagonals of thematrix. The offsetsvector stores offset of every sub-diagonal from the main diagonal ofthe matrix. By convention, the main diagonal has the offset 0, the diagonals below the
main diagonal have negative offset and those are above the main diagonal have positive
offsets. This is illustrated with example in the Figure bellow:
Figure4.5: Sparse matrix and corresponding DIA storage scheme
Unlike CSR and COO, this storage format stores few zero elements explicitly. As we
can see in above Figure, the diagonal with offset -3 has only two non-zero elements.
But to store it in diagonal format, the elements of this diagonal are padded with
DUELWUDU\YDOXHLH6RWKHUHLVVRPHH[WUDVWRUDJHRYHUKHDGDVVRFLDWHGZLWKLW%XWthere are more storage benefits due to fact that we do not have to store column or row
indices explicitly. Usually, the data vector stores non-zero elements in the column
major order which ensures memory coalescing for GPU devices. We will discuss this in
more detail in the performance analysis section 4.3. For a MxN square matrix with dsub-diagonals having at least one non-zero element, it requires bytes.This is not a general purpose storage scheme like CSR and COO. It is very sensitive to
the sparsity pattern and is useful for matrices with an ordered banded structure. For
example, consider a matrix in Figure 4.6. This matrix has a banded structure, but this is
not suitable for DIA storage scheme. The nonzero elements structure is exactly in
opposite order of what is ideally suited for DIA format.
3 0 8 0 0
0 4 0 9 0
0 0 5 0 101 0 0 6 0
0 2 0 0 7
* 3 8
* 4 9
* 5 10
1 6 *
2 7 *
-3 0 2offsetsdata
-
7/29/2019 Pramod Kumbha r
36/88
26
When we store above matrix with diagonal storage format, we end up storing all sub-
diagonals, each containing single nonzero element and four padding elements.
4.2.4E L L or Padded ITPAC KLike DIA, ELL format is also suitable for vector architectures. This format can be used
for storing sparse matrices arising from semi-structured meshes where the average
numbers of non-zero element per row are nearly same. ForMx Nsparse matrix with amaximum ofknon-zeros per row, we store the matrix in a Mx kdense dataarray. If a
particular row has less than knon-zeros, that row is padded with zeros. The indicesarray stores the column index of every element in the dataarray. These elements arestored in column major order. Figure 4.6 illustrate an example of ELL storage scheme:
Figure4.6: Sparse matrix and corresponding ELL storage scheme
Compare to DIA, ELL is again a PRUHJHQHUDOVWRUDJHIRUPDWDQGLWVQRWQHFHVVDU\WR
have a banded structure of non-zero elements. But the average number of non-zeros
must be the same across all rows of matrix, otherwise we end up padding large numbersof zero elements. ForMxNsparse matrix with maximum of NNZ_PER_ROWnonzero per
row, it requires * bytes for storage.
4.2.5HybridAlthough the ELL format is well suited for vector architectures, most of the time sparse
matrices arising from complex geometries that do not have the same number of nonzero
per row [12]. As the number of non-zero elements starts varying to a larger extent, we
end up storing a large number of padding elements. Consider the example of the sparse
matrix shown in Figure 4.7. In this case, except for the first row all other rows have
Figure 4.5: Banded nonzero pattern which is not suitable for DIA format
1 2
3 45 6
7 8