Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of...
-
Upload
hillary-carpenter -
Category
Documents
-
view
221 -
download
0
Transcript of Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of...
![Page 1: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/1.jpg)
Automatic Performance Tuning of SpMV on GPGPU
Xianyi Zhang
Lab of Parallel Computing
Institute of Software Chinese Academy of Sciences
![Page 2: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/2.jpg)
Outline
Motivation SpMV Introduction AMD Stream Computing GOSpMV Overview GOSpMV Performance Evaluation Conclusion & Future Work
![Page 3: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/3.jpg)
Motivation
Sparse Matrix-Vector Multiplication (SpMV) y=y+Ax The important kernel in scientific
applicationsPDE solver, simulation, etc.
Low performance Irregular memory access pattern
![Page 4: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/4.jpg)
Motivation
GPU Huge computation power
Jason Yang, James Goodman. Symmetric Key Cryptography on Modern Graphics Hardware. http://ati.amd.com/technology/streamcomputing/asiacrypt2007.pdf
![Page 5: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/5.jpg)
SpMV Introduction
CSR (Compressed Sparse Row)
3
2
1
3
2
1
1
0
2
0
4
0
0
0
1
b
b
b A_val=[1,2,4,1] A_col=[0,2,1,2] A_ptr=[0,2,3,4]
for(i = 0; i < n ; i++)
{ value = 0;
for(j = A_ptr[i]; j < A_ptr[i+1] ; j++)
value = value + A_val[j]*x[A_col[j]];
y[i] += value;
} x is accessed irregularly
x is accessed indirectly
![Page 6: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/6.jpg)
SpMV Introduction
BCSR (Block Compressed Sparse Row) BCSR 2 × 3
![Page 7: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/7.jpg)
AMD Stream Computing
Programming Model
AMD Stream Computing User Guide
![Page 8: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/8.jpg)
AMD Stream Computing
AMD Brook+
AMD Stream Computing User Guide
![Page 9: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/9.jpg)
GOSpMV Overview
GOSpMV Software Architecture
![Page 10: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/10.jpg)
GOSpMV Overview
BCSR SpMV implementation on GPGPU
![Page 11: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/11.jpg)
GOSpMV Overview
Automatic Performance Tuning
![Page 12: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/12.jpg)
GOSpMV Overview
Off-line GPGPU Benchmark Dense matrix (different size) Every BCSR block size
0500
100015002000250030003500400045005000
2500
4000
0
1225
00
2500
00
4225
00
6400
00
9025
00
1210
000
1562
500
1960
000
2402
500
2890
000
3422
500
4000
000
nzCount
MFLO
PS
1x12x23x34x4
![Page 13: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/13.jpg)
GOSpMV Overview
Run-Time Evaluation(search optimal BCSR block size)
Input: Sparse Matrix A, GPGPU Benchmark data Pdense(block-format, nzd)
Output: the maximum P (A, block-format, σ), optimal BCSR block size
For each BCSR r × c block,
do
calculate fill ratio fErc(A, σ) with sample rate σ
Psp(block-format, nzEBCSR)= Pdense(block-format, nzd), nzd
is nearest to nzEBCSR
P (A, block-format, σ) = P (block-format, nzEBCSR)/ fErc(A, σ)
done
![Page 14: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/14.jpg)
GOSpMV Performance Evaluation
Test box Intel Pentium Dual Core E2160/1.8GHz, 2.0GB memory GPU
AMD Radeon HD 3690 (RV670), theoretical peak:428.8 GigaFlOPS (single precision)
AMD Stream SDK v1.1-beta Ubuntu 8.04, Linux 2.6.24, gcc 4.2.3
Test matrices 8 sparse matrices, different size (small, medium, large)
Small (nonzeros < 100,000) Medium (100,000 < nonzeros < 1,000,000) Large (nonzeros >= 1,000,000)
Matrix Market and UF Sparse Matrix Collection .
![Page 15: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/15.jpg)
GOSpMV Performance Evaluation
Test matrices
![Page 16: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/16.jpg)
GOSpMV Performance Evaluation
AMD Radeon HD 3690 Result SpMV BCSR on GPGPU (1500 iterations)
0
500
1000
1500
2000
2500
3000
bcss
tk17
. RSA
bcss
tk28
. RSA
epb1
. rua
fida
p037
. rua
raef
sky2
. rb
raef
sky3
. rb
twot
one.
rua
venk
at01
. rb
MFLO
PS
1x12x23x34x4CPU
![Page 17: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/17.jpg)
GOSpMV Performance Evaluation
Different iterations (100,300,500,1000,1500)
![Page 18: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/18.jpg)
GOSpMV Performance Evaluation
The automatic performance tuning (1500 iterations)
The average speedup: 3.11
![Page 19: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/19.jpg)
Conclusion
GOSpMV Performance Speedup AMD Radeon HD 3690
average: 3.11, max: 5.96, 1500 iterations
GOSpMV is suited for Medium matrices, Large matrices Iteration number>= 300 Regular matrices (low fill ratio)
In general, GOSpMV selects the better BCSR block size by automatic performance tuning technology.
![Page 20: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/20.jpg)
Future Work
Double precision Support other BCSR block size (e.g. 8x8) New HW (AMD RV770) Automatic performance tuning strategy
Re-ordering matrix
![Page 21: Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences zxy@mail.rdcps.ac.cn.](https://reader031.fdocuments.net/reader031/viewer/2022032414/56649eec5503460f94bfe176/html5/thumbnails/21.jpg)
Thank you !Q&A