Irregular Applications –Sparse Matrix Vector Multiplication
Transcript of Irregular Applications –Sparse Matrix Vector Multiplication
![Page 1: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/1.jpg)
Irregular Applications –Sparse Matrix Vector
Multiplication
CS/EE217 GPU Architecture and Parallel Programming
Lecture 20
![Page 2: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/2.jpg)
SparseMatrix-VectorMul2plica2on
Manyslidesfrompresenta2onby:KaiHe1,SheldonTan1,
EstebanTlelo-Cuautle2,HaiWang3andHeTang3
1UniversityofCalifornia,Riverside2Ins6tuteNa6onalAstrophysics,Op6calandElectrics,Mexico
3UESTC,China
![Page 3: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/3.jpg)
Outline
Ø Background
Ø Reviewofexis6ngmethodsofSparseMatrix-Vector(SpMV)Mul6plica6ononGPU
Ø SegSpMVmethodonGPU
Ø Experimentalresults
Ø Conclusion
![Page 4: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/4.jpg)
Sparse Matrices
l Matrices where most elements are zeros l Come up frequently in many science,
engineering, modeling, computer science, etc.. Problems l Example, solvers for linear systems of equations l Example, graphs (e.g., representation of the web
connectivity) l What impact/drawback do they have on
computations?
![Page 5: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/5.jpg)
SparseMatrixFormat
Structured Unstructured
![Page 6: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/6.jpg)
CompressedSparseRow(CSR)
2 14
2
3
96
8
1
4
3
7
Ø RowPtr 146791113
Ø ColIdx 125123461546
Ø Values 213142639874
![Page 7: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/7.jpg)
CompressedSparseColumn(CSC)
2 14
2
3
96
8
1
4
3
7
Ø ColPtr 146791113
Ø RowIdx 125123461546
Ø Values 219142673834
![Page 8: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/8.jpg)
Characteris2csofSpMVMul2plica2on
Ø Memorybound:Ø FLOP:Bytera6oisverylow
Ø Generallyirregular&unstructuredØ Unlikedensematrixopera6on
![Page 9: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/9.jpg)
Objec2vesofSpMVImplementa2ononGPU
Ø ExposesufficientparallelismØ Developthousandsofindependentthreads
Ø Minimizeexecu6onpathdivergenceØ SIMDu6liza6on
Ø MinimizememoryaccessdivergenceØ Memorycoalescing
![Page 10: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/10.jpg)
Compu2ngSparseMatrix-VectorMul2plica2on
Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]
![Page 11: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/11.jpg)
Compu2ngSparseMatrix-VectorMul2plica2on
Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]
![Page 12: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/12.jpg)
Compu2ngSparseMatrix-VectorMul2plica2on
Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]
![Page 13: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/13.jpg)
Compu2ngSparseMatrix-VectorMul2plica2on
Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]
![Page 14: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/14.jpg)
Compu2ngSparseMatrix-VectorMul2plica2on
Row laid out in sequence Middle[elemid] = val[elemid] * vec[col[elemid]]
![Page 15: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/15.jpg)
Sequential loop to implement SpMV
for (int row=0; row < num_rows; row++) { float dot = 0; int row_start = row_ptr[row]; int row_end = row_ptr[row+1];
for(int elem = row_start; elem < row_end; elem++) { dot+= data[elem] * x [col_index[elem]]; } y[row] =dot;
}
![Page 16: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/16.jpg)
Discussion
l How do we assign threads to work?
l How will this parallelize? l Think of memory access patterns l Think of control divergence
![Page 17: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/17.jpg)
Algorithm:gbmv_s
Ø Onethreadperrow
… … … …
![Page 18: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/18.jpg)
gbmv_s
__global__ void SpMV_CSR (int num_rows, float *data, int *col_index, int *row_ptr, float *x, float *y) {
int row=blockIdx.x * blockDim.x + threadIdx.x;
if(row < num_rows) { float dot = 0; int row_start = row_ptr[row]; int row_end = row_ptr[row+1]; for(int elem = row_start; elem < row_end; elem++) { dot+= data[elem] * x [col_index[elem]]; } y[row] =dot; }
}
• Issues? • Poor coalescing
![Page 19: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/19.jpg)
Algorithm:gbmv_w
Ø OnewarpperrowØ Par6almemorycoalescingØ vec[col[elemid]]cannotbecoalescedbecausethecol[elemid]canbe
arbitraryØ Howdoyousizethewarps?
![Page 20: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/20.jpg)
VectorExpansion
1 2 3 3 221 5 4 55 3Col Index
3
7
6
9
1
3
7
1
3
7
6
1
6
7
9
6
1
Original Vector Expanded
Vector
The expansion operation will be done on CPU.
vec_expanded[elemid] = vec[col[elemid]]
![Page 21: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/21.jpg)
Algorithm:PS(Product&Summa2on)
Ø Onethreadpernon-zeroelementØ Memorycoalescing
Matrix in CSR Format
Expanded Vector
Product Kernel
![Page 22: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/22.jpg)
Algorithm:PS(Product&Summa2on)
Product (SMEM)
Multiplication result
Summation Kernel
![Page 23: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/23.jpg)
Algorithm:PS(Product&Summa2on)
Ø Weneedtoreadrowpointerarraytogetrow_beginandrow_endforsumma6on
Ø Thenumberofaddi6ononeachthreadsaredifferentØ Maynothaveenoughsharedmemorywhentherearetoomany
nonzerosinarow
![Page 24: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/24.jpg)
Algorithm:segSpMV
Ø DividethematrixinCSRformatandexpandedvectorintosmallsegmentswithconstantlength
Ø Segmentswithoutenoughelementswillbeappendedzeros.
0 0
Segment length = 2
![Page 25: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/25.jpg)
segSpMVKernel
segSpMV Kernel
0 0Matrix in CSR Format
Expanded Vector
Product (SMEM)
0 0
0 0
![Page 26: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/26.jpg)
segSpMVKernel(Cont.)
segSpMV Kernel
0 0Product (SMEM)
Intermediate result
![Page 27: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/27.jpg)
FinalSumma2on
Ø Finalsumma6onisdoneonCPU
Ø Therewillbeveryfewaddi6onsinfinalsumma6onifwechoosesegmentlengthcarefully
Intermediate result
Multiplication result
![Page 28: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/28.jpg)
Algorithm:segSpMV
Ø Advantages:Ø Canenjoyfullcoalesced
memoryaccess
Ø Lesssynchroniza6onoverheadbecauseallsegmentshavesamelength
Ø Canstoreallintermediateresultsinsharedmemory
CPU part
CPU part
Read matrix in CSR format and original vector
Vector Expansion
Final Summation
SpMV KernelGPU part
![Page 29: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/29.jpg)
ExperimentResults
Matrices size nnz nzperrow Seg_lengthmc2depi 525825 2100225 3.9942 4
mac_econ_fwd500 206500 1273389 6.1665 8
scircuit 170998 958936 5.6079 8
webbase-1M 1000005 6105536 6.1055 8
cop20k_A 121192 1362087 11.2391 16
cant 62451 2034917 32.5842 32
consph 83334 3046907 36.5626 32
pwtk 217918 5926171 27.1945 32
qcd5_4 49152 1916928 39 32
shipsec1 140874 3977139 28.2319 32
pdb1HYS 36417 2190591 60.153 64
rma10 46835 2374001 50.6886 64
![Page 30: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/30.jpg)
ResultsofShortSegment
0 2 4 6 8
10 12 14 16
gbmv_s gbmv_w ps segSpMV_r segSpMV
(ms)
![Page 31: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/31.jpg)
SpeedupofSpMV
0
4
8
12
16
20
gbmv_s gbmv_w ps SpMV_r
![Page 32: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/32.jpg)
ResultsofLongSegment
0
2
4
6
8
10
12
gbmv_s gbmv_w ps segSpMV_r segSpMV
(ms)
![Page 33: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/33.jpg)
Reference
Ø [0]Thispaper:KaiHe,SheldonTan,EstebanTlelo-Cuautle,HaiWanganHeTang,“ANewSegmenta6on-BasedGPUAcceleratedSparseMatrixVectorMul6plica6on”,IEEEMWCAS2014
Ø [1]WangBoXiongDavidandMuShuai,“TamingIrregularEDAApplica6onsonGPUs”,ICCAD2009.
Ø [2]NathanBellandMichaelGarland,“SparseMatrix-VectorMul6plica6ononThroughput-OrientedProcessors”,NVIDIAResearch.
Ø [3]M.Naumov,“ParallelSolu6onofSparseTriangularLinearSystemsinthePrecondi6onedItera6veMethodsontheGPU”,NVIDIATechnicalReport,NVR-2011-001,2011.
![Page 34: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/34.jpg)
Another Alternative; ELL
l ELL representation; alternative to CSR l Start with CSR l PAD each row to the size of the longest row l Transpose it
![Page 35: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/35.jpg)
__global__ void SpMV_ELL(int num_rows, float *data, int *col_index, int num_elem, float *x, float *y) {
int row=blockIdx.x * blockDim.x + threadIdx.x;
if(row < num_rows) { float dot = 0; for(int i = 0; i < num_elem; i++) { dot+= data[row+i*num_rows] * x [col_index[row+i*num_rows]]; } y[row] =dot; }
}
![Page 36: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/36.jpg)
Example/Intuition
l Lets look at Thread 0 l First iteration: 3 * col[0] l Second iteration 1 * col [2] l Third iteration? l Is this the correct result? l What do you think of this implementation?
![Page 37: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/37.jpg)
Problems?
l What if one (or a few) rows are dense? l A lot of padding l Big storage/memory overhead l Shared memory overhead if you try to use it l Can actually become slower than CSR
l What can we do? l In comes COO (Coordinate Format)
![Page 38: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/38.jpg)
Main idea (see 10.4 in the book)
l Start with ELL l Store not only column index, but also row index
l This now allow us to reorder elements l See code above
l What can we gain from doing that?
for (int i = 0; i < num_elem; i++) y[row_index[i]]+= data[i]* x[col_index[i]];
![Page 39: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/39.jpg)
Main idea (cont’d)
l Convert CSR to ELL l As you convert, if there is a long row, you
remove some elements from it and keep them in COO format
l Send ELL (missing the COO elements) to GPU for SpMV_ELL kernel to chew on
l Once the results are back, the CPU adds in the contributions from the missing COO elements l Which should be very few
![Page 40: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/40.jpg)
Other optimizations?
l Sorting rows by length (see 10.5) in the book l Keep track of original row order l Create different kernels/grids for rows of similar size l Reshuffle y at the end to its original position
![Page 41: Irregular Applications –Sparse Matrix Vector Multiplication](https://reader033.fdocuments.net/reader033/viewer/2022041516/62527d0399cecd6ac263f15b/html5/thumbnails/41.jpg)