SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for...
Transcript of SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for...
SMASHCo-Designing Software Compression
and Hardware-Accelerated Indexing
for Efficient Sparse Matrix Operations
Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula,
Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi,
Taha Shahroodi, Juan Gomez Luna, Onur Mutlu
2
Executive Summary
• Many important workloads heavily make use of sparse matrices
• Shortcomings of existing compression formats
• Expensive discovery of the positions of non-zero elements or
• Narrow applicability
• SMASH: hardware/software cooperative mechanism for efficient
sparse matrix storage and computation
• Software: Efficient compression scheme using
a hierarchy of bitmaps
• Hardware: Hardware unit interprets the bitmap hierarchy
and accelerates indexing
• Performance improvement: 38% and 44% for SpMV and SpMM
over the widely used CSR format
• SMASH is highly efficient, low-cost and widely applicable
• Matrices are extremely sparse
• Compression is essential to avoid storage/computational overheads
3
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
4
Recommender Systems Graph Analytics Neural Networks
• Collaborative Filtering
• PageRank
• Breadth First Search
• Betweenness
Centrality
• Sparse DNNs
• Graph Neural Networks
Sparse Matrix Operations are Widespread Today
5
0.0003%non-zero elements
2.31% non-zero elements
Real World Matrices Have High Sparsity
Sparse matrix compression
is essential to enable
efficient storage and computation
6
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
7
Limitations of Existing Compression Formats
General formats
optimize for storage
Expensive discovery of
the positions
of non-zero elements
1
8
• Used in multiple libraries & frameworks:
Intel MKL, TACO, Ligra, Polymer
• Stores only the non-zero elements
of the sparse matrix
• Provides high compression ratio
Widely Used Format: Compressed Sparse Row
9
0row_ptr
col_ind
values
5 0 0 0
0 0 3 4
0 8 0 0
0 0 0 3
Α =
MATRIX CSR
Basics of CSR: Indexing Overhead
1 3 4 5
0 2 3 1 3
5 3 4 8 3
Requires multiple
data-dependent memory accesses
Multiple instructions are needed to discover
the position of a non-zero element
10
Sparse Matrix Vector Multiplication (SpMV)
Indexing Overhead in Sparse Kernels
Sparse Matrix Matrix Multiplication (SpMM)
Indexing is expensive
for major sparse matrix kernels
Indexing for every non-zero element of the sparse matrix to multiply with the corresponding element of the vector.
Index matching for every inner product between the 2 sparse matrices.
11
Performing Indexing with Zero Cost
Reducing the cost of indexing
can accelerate sparse matrix operations
2.21 2.13
2.81
0.00
0.50
1.00
1.50
2.00
2.50
3.00
SpMatAdd SpMV SpMM
Spe
ed
up
CSR ZERO-COST INDEXING
12
Limitations of Existing Compression Formats
General formats
optimize for storage
Specialized formats assume
specific matrix structures
and patterns (e.g., diagonals)
Expensive discovery of
the positions
of non-zero elements
Narrow
applicability
1
2
13
• Minimizes the indexing overheads
• Can be used across a wide range of
sparse matrices and sparse matrix operations
• Enables high compression ratio
Design a sparse matrix compression mechanism that:
Our Goal
14
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
Software Compression Scheme
Hardware Acceleration Unit
Cross-layer Interface
15
SMASH: Key Idea
Efficient
compression
using a Hierarchy
of Bitmaps
Software
Unit that scans
bitmaps to
accelerate
indexing
Hardware
SMASH ISA
Hardware/Software cooperative mechanism:
• Enables highly-efficient sparse matrix compression and computation
• General across a diverse set of sparse matrices and sparse matrix operations
16
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
Software Compression Scheme
Hardware Acceleration Unit
Cross-layer Interface
17
• Encodes the positions of non-zero elementsusing bits to maintain low storage overhead
SMASH: Software Compression Scheme
Efficient
compression
using a Hierarchy
of Bitmaps
Software
18
1 0 0 0
0 1 0 0
0 0 0 00 0 1 1
Encodes if a block of the matrix contains
any non-zero element
1
BITMAP
0 0 0
0 1 0 0
0 0 0 00 0 1 1
Might contain high number of zero bits
SMASH: Software Compression Scheme
BITMAP:
MATRIX
NZ ZEROS ZEROS ZEROS
ZEROS NZ ZEROS ZEROS
ZEROS ZEROS ZEROS ZEROS
ZEROS ZEROS NZ NZ
Idea: Apply the same encoding recursively
to compress more effectively
19
NON-ZERO VALUES ARRAY
NZ
Z
Z Z
NZ NZ
Z Z Z
Z Z Z
Z Z
Z Z
Bitmap 01 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1 0 0 1
1 1
Bitmap 1
Bitmap 2
Compression Ratio 4:1
Compression Ratio 2:1 MATRIX
NZ NZ NZ
Z Zero Region
NZ Non-Zero Region
Hierarchy of Bitmaps
Storage
Efficient
Fast
Indexing
Hardware
Friendly
20
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
Software Compression Scheme
Hardware Acceleration Unit
Cross-layer Interface
21
SMASH: Hardware Acceleration Unit
Unit that scans
bitmaps to
accelerate
indexing
Hardware
• Buffers the Bitmap Hierarchy and scans it using bitwise operations
• Interprets the Bitmap Hierarchy used in software
22
SRAM BUFFER 2
SRAM BUFFER 1
SRAM BUFFER 0
Matrix Parameters
Bitmap Parameters
Hardware LogicNON
ZERO
INDEX
SCAN
READ
UPDATE
Gro
up
2
Gro
up
3
BITMAP
BUFFERS
CPU
Bitmap Management Unit (BMU)
BMU
SMASH ISA
23
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
Software Compression Scheme
Hardware Acceleration Unit
Cross-Layer Interface
24
MATINFOBMAPINFORDBMAP
RDIND
PBMAP
SMASH: Cross-Layer Interface
Need for a cross-layer interface that
enables software to control the BMU
• Communicate the parameters needed
to calculate the index
• Query the BMU to retrieve the index
of the next non-zero element
Enables SMASH to flexibly accelerate
a diverse range of operations
on any sparse matrix
SMASH ISA
25
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
Software Compression Scheme
Hardware Acceleration Unit
Cross-Layer Interface
26
Simulator: ZSim Simulator [1]
Workloads:• Sparse Matrix Kernels
SpMV & SpMM from TACO [2]
• Graph Applications
PageRank & Betweenness Centrality from Ligra [3]
Input datasets:15 diverse sparse matrices & 4 graphs
from the Sparse Suite Collection [4]
Methodology
[1] Sanchez et al. “ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems” ISCA‘13[2] Kjolstad et al. “ taco: A Tool to Generate Tensor Algebra Kernels“ ASE’17[3] Shun et al. “Ligra: A Lightweight Graph Processing Framework for Shared Memory” PPoPP’13[4] Davis et al. “The University of Florida Sparse Matrix Collection” TOMS’11
27
Evaluated Sparse Matrices
28
Performance Improvement Using SMASH
SMASH provides significant performance
improvements over state-of-the-art formats
1.00 1.00 1.00 1.001.041.10
1.191.26
1.38 1.44
1.27 1.31
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
SpMV SpMM PageRank BC
Spe
ed
up
ove
r TA
CO
-CSR
TACO-CSR TACO-BCSR SMASH
29
Number of Executed Instructions
SMASH significantly reduces
the number of executed instructions
1.00 1.00 1.00 1.00
0.890.81 0.82
0.78
0.630.54
0.75 0.70
0.00
0.20
0.40
0.60
0.80
1.00
1.20
SpMV SpMM PageRank BC
No
rmal
ize
d N
um
be
r o
f In
stru
ctio
ns
ove
r TA
CO
-CSR
TACO-CSR TACO-BCSR SMASH
30
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
Spe
ed
up
INCREASING SPARSITY→
TACO-CSR TACO-BCSR SMASH
Sparsity Sweep for SpMV
SMASH provides speedups regardless
of the sparsity of the matrix
31
SMASH configuration
• Support for 4 matrices in the BMU
• 256 bytes / SRAM buffer
• 140 bytes for registers & counters
0.076% area overhead over an Intel Xeon CPU
SMASH incurs negligible area overhead
Hardware Area Overhead
32
• Compression ratio sensitivity analysis
• Distribution of non-zero elements
• Detailed results for SpMM
• Conversion from CSR to SMASH overhead
• Software-only approaches executed
in a real machine
Other Results in the Paper
33
1. Sparse Matrix Basics
4. Evaluation
5. Conclusion
Presentation Outline
2. Compression Formats & Shortcomings
3. SMASH
Software Compression Scheme
Hardware Acceleration Unit
Cross-layer Interface
34
Summary & Conclusion
• Many important workloads heavily make use of sparse matrices
• Shortcomings of existing compression formats
• Expensive discovery of the positions of non-zero elements or
• Narrow applicability
• SMASH: hardware/software cooperative mechanism for efficient
sparse matrix storage and computation
• Software: Efficient compression scheme using
a hierarchy of bitmaps
• Hardware: Hardware unit interprets the bitmap hierarchy
and accelerates indexing
• Performance improvement: 38% and 44% for SpMV and SpMM
over the widely used CSR format
• SMASH is highly efficient, low-cost and widely applicable
• Matrices are extremely sparse
• Compression is essential to avoid storage/computational overheads
SMASHCo-Designing Software Compression
and Hardware-Accelerated Indexing
for Efficient Sparse Matrix Operations
Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula,
Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi,
Taha Shahroodi, Juan Gomez Luna, Onur Mutlu
36
SMALL BITMAPS VS ZERO COMPUTATION
MATRIX MATRIX
1
Bit
ma
p-0
Bit
ma
p-0
NZANZA
1 A B A B
A BA B
Two 0’sSix 0’s
Compression Ratio
37
MATRIX A MATRIX BΑ DΒ
E
F
C 0 1 2
Α Β C D E F
0 1 3
INDEX MATCHINGFOR EVERY INNER PRODUCT
col_ind row_ind
Sparse Matrix Matrix Multiplication with CSR
38
CSR-based SpMV
VECTOR012345
INDIRECT ACCESS TO LOAD THE VECTOR
0 1 3 4 5row_ptr
0 2 3 1 3col_ind
CSR-based SpMV
39
NZ
NZAVECTOR
BMU
0 1 0 0 1 0
0 1 0 1 0 0
0 1 0 1 0 0
Matrix Parameters
Bitmap Parameters
COLUMNINDEX
ROWINDEX
NZ NZ
NZ NZ NZ
NZ NZ NZ
NZ NZ NZ
STREAM THROUGH THE NON-ZERO BLOCKS
INDEX THE VECTOR USING THE BMU
Use Case: SpMV
40
Effect of Compression Ratio
41
CSR compresses more efficiently SMASH > CSR
Storage Efficiency
42
0.8
0.9
1
1.1
1.2
1.3
Spe
ed
up
HIGHLY SCATTERED → HIGHLY GATHERED
MATRIX 13
Distribution of non-zero elements
43
SMASH enables highly-efficient storage of sparse matrices
1
10
100
1000
10000
GMEAN
TOTA
L C
OM
PR
ESSI
ON
RA
TIO
INCREASING DENSITY →
SMASH CSR
Storage Efficiency