SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for...

SMASHCo-Designing Software Compression

and Hardware-Accelerated Indexing

for Efficient Sparse Matrix Operations

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula,

Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi,

Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

2

Executive Summary

• Many important workloads heavily make use of sparse matrices

• Shortcomings of existing compression formats

• Expensive discovery of the positions of non-zero elements or

• Narrow applicability

• SMASH: hardware/software cooperative mechanism for efficient

sparse matrix storage and computation

• Software: Efficient compression scheme using

a hierarchy of bitmaps

• Hardware: Hardware unit interprets the bitmap hierarchy

and accelerates indexing

• Performance improvement: 38% and 44% for SpMV and SpMM

over the widely used CSR format

• SMASH is highly efficient, low-cost and widely applicable

• Matrices are extremely sparse

• Compression is essential to avoid storage/computational overheads

3

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

4

Recommender Systems Graph Analytics Neural Networks

• Collaborative Filtering

• PageRank

• Breadth First Search

• Betweenness

Centrality

• Sparse DNNs

• Graph Neural Networks

Sparse Matrix Operations are Widespread Today

5

0.0003%non-zero elements

2.31% non-zero elements

Real World Matrices Have High Sparsity

Sparse matrix compression

is essential to enable

efficient storage and computation

6


4. Evaluation

5. Conclusion



3. SMASH

7

Limitations of Existing Compression Formats

General formats

optimize for storage

Expensive discovery of

the positions

of non-zero elements

1

8

• Used in multiple libraries & frameworks:

Intel MKL, TACO, Ligra, Polymer

• Stores only the non-zero elements

of the sparse matrix

• Provides high compression ratio

Widely Used Format: Compressed Sparse Row

9

0row_ptr

col_ind

values

5 0 0 0

0 0 3 4

0 8 0 0

0 0 0 3

Α =

MATRIX CSR

Basics of CSR: Indexing Overhead

1 3 4 5

0 2 3 1 3

5 3 4 8 3

Requires multiple

data-dependent memory accesses

Multiple instructions are needed to discover

the position of a non-zero element

10

Sparse Matrix Vector Multiplication (SpMV)

Indexing Overhead in Sparse Kernels

Sparse Matrix Matrix Multiplication (SpMM)

Indexing is expensive

for major sparse matrix kernels

Indexing for every non-zero element of the sparse matrix to multiply with the corresponding element of the vector.

Index matching for every inner product between the 2 sparse matrices.

11

Performing Indexing with Zero Cost

Reducing the cost of indexing

can accelerate sparse matrix operations

2.21 2.13

2.81

0.00

0.50

1.00

1.50

2.00

2.50

3.00

SpMatAdd SpMV SpMM

Spe

ed

up

CSR ZERO-COST INDEXING

12

Limitations of Existing Compression Formats

General formats

optimize for storage

Specialized formats assume

specific matrix structures

and patterns (e.g., diagonals)

Expensive discovery of

the positions

of non-zero elements

Narrow

applicability

1

2

13

• Minimizes the indexing overheads

• Can be used across a wide range of

sparse matrices and sparse matrix operations

• Enables high compression ratio

Design a sparse matrix compression mechanism that:

Our Goal

14


4. Evaluation

5. Conclusion



3. SMASH

Software Compression Scheme

Hardware Acceleration Unit

Cross-layer Interface

15

SMASH: Key Idea

Efficient

compression

using a Hierarchy

of Bitmaps

Software

Unit that scans

bitmaps to

accelerate

indexing

Hardware

SMASH ISA

Hardware/Software cooperative mechanism:

• Enables highly-efficient sparse matrix compression and computation

• General across a diverse set of sparse matrices and sparse matrix operations

16


4. Evaluation

5. Conclusion



3. SMASH




17

• Encodes the positions of non-zero elementsusing bits to maintain low storage overhead

SMASH: Software Compression Scheme

Efficient

compression

using a Hierarchy

of Bitmaps

Software

18

1 0 0 0

0 1 0 0

0 0 0 00 0 1 1

Encodes if a block of the matrix contains

any non-zero element

1

BITMAP

0 0 0

0 1 0 0

0 0 0 00 0 1 1

Might contain high number of zero bits

SMASH: Software Compression Scheme

BITMAP:

MATRIX

NZ ZEROS ZEROS ZEROS

ZEROS NZ ZEROS ZEROS

ZEROS ZEROS ZEROS ZEROS

ZEROS ZEROS NZ NZ

Idea: Apply the same encoding recursively

to compress more effectively

19

NON-ZERO VALUES ARRAY

NZ

Z

Z Z

NZ NZ

Z Z Z

Z Z Z

Z Z

Z Z

Bitmap 01 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

1 0 0 1

1 1

Bitmap 1

Bitmap 2

Compression Ratio 4:1

Compression Ratio 2:1 MATRIX

NZ NZ NZ

Z Zero Region

NZ Non-Zero Region

Hierarchy of Bitmaps

Storage

Efficient

Fast

Indexing

Hardware

Friendly

20


4. Evaluation

5. Conclusion



3. SMASH




21

SMASH: Hardware Acceleration Unit

Unit that scans

bitmaps to

accelerate

indexing

Hardware

• Buffers the Bitmap Hierarchy and scans it using bitwise operations

• Interprets the Bitmap Hierarchy used in software

22

SRAM BUFFER 2

SRAM BUFFER 1

SRAM BUFFER 0

Matrix Parameters

Bitmap Parameters

Hardware LogicNON

ZERO

INDEX

SCAN

READ

UPDATE

Gro

up

2

Gro

up

3

BITMAP

BUFFERS

CPU

Bitmap Management Unit (BMU)

BMU

SMASH ISA

23


4. Evaluation

5. Conclusion



3. SMASH



Cross-Layer Interface

24

MATINFOBMAPINFORDBMAP

RDIND

PBMAP

SMASH: Cross-Layer Interface

Need for a cross-layer interface that

enables software to control the BMU

• Communicate the parameters needed

to calculate the index

• Query the BMU to retrieve the index

of the next non-zero element

Enables SMASH to flexibly accelerate

a diverse range of operations

on any sparse matrix

SMASH ISA

25


4. Evaluation

5. Conclusion



3. SMASH



Cross-Layer Interface

26

Simulator: ZSim Simulator [1]

Workloads:• Sparse Matrix Kernels

SpMV & SpMM from TACO [2]

• Graph Applications

PageRank & Betweenness Centrality from Ligra [3]

Input datasets:15 diverse sparse matrices & 4 graphs

from the Sparse Suite Collection [4]

Methodology

[1] Sanchez et al. “ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems” ISCA‘13[2] Kjolstad et al. “ taco: A Tool to Generate Tensor Algebra Kernels“ ASE’17[3] Shun et al. “Ligra: A Lightweight Graph Processing Framework for Shared Memory” PPoPP’13[4] Davis et al. “The University of Florida Sparse Matrix Collection” TOMS’11

27

Evaluated Sparse Matrices

28

Performance Improvement Using SMASH

SMASH provides significant performance

improvements over state-of-the-art formats

1.00 1.00 1.00 1.001.041.10

1.191.26

1.38 1.44

1.27 1.31

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

SpMV SpMM PageRank BC

Spe

ed

up

ove

r TA

CO

-CSR

TACO-CSR TACO-BCSR SMASH

29

Number of Executed Instructions

SMASH significantly reduces

the number of executed instructions

1.00 1.00 1.00 1.00

0.890.81 0.82

0.78

0.630.54

0.75 0.70

0.00

0.20

0.40

0.60

0.80

1.00

1.20

SpMV SpMM PageRank BC

No

rmal

ize

d N

um

be

r o

f In

stru

ctio

ns

ove

r TA

CO

-CSR


30

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

Spe

ed

up

INCREASING SPARSITY→


Sparsity Sweep for SpMV

SMASH provides speedups regardless

of the sparsity of the matrix

31

SMASH configuration

• Support for 4 matrices in the BMU

• 256 bytes / SRAM buffer

• 140 bytes for registers & counters

0.076% area overhead over an Intel Xeon CPU

SMASH incurs negligible area overhead

Hardware Area Overhead

32

• Compression ratio sensitivity analysis

• Distribution of non-zero elements

• Detailed results for SpMM

• Conversion from CSR to SMASH overhead

• Software-only approaches executed

in a real machine

Other Results in the Paper

33


4. Evaluation

5. Conclusion



3. SMASH




34

Summary & Conclusion

• Many important workloads heavily make use of sparse matrices

• Shortcomings of existing compression formats

• Expensive discovery of the positions of non-zero elements or

• Narrow applicability

• SMASH: hardware/software cooperative mechanism for efficient

sparse matrix storage and computation

• Software: Efficient compression scheme using

a hierarchy of bitmaps

• Hardware: Hardware unit interprets the bitmap hierarchy

and accelerates indexing

• Performance improvement: 38% and 44% for SpMV and SpMM

over the widely used CSR format

• SMASH is highly efficient, low-cost and widely applicable

• Matrices are extremely sparse

• Compression is essential to avoid storage/computational overheads

SMASHCo-Designing Software Compression

and Hardware-Accelerated Indexing

for Efficient Sparse Matrix Operations

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula,

Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi,

Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

36

SMALL BITMAPS VS ZERO COMPUTATION

MATRIX MATRIX

1

Bit

ma

p-0

Bit

ma

p-0

NZANZA

1 A B A B

A BA B

Two 0’sSix 0’s

Compression Ratio

37

MATRIX A MATRIX BΑ DΒ

E

F

C 0 1 2

Α Β C D E F

0 1 3

INDEX MATCHINGFOR EVERY INNER PRODUCT

col_ind row_ind

Sparse Matrix Matrix Multiplication with CSR

38

CSR-based SpMV

VECTOR012345

INDIRECT ACCESS TO LOAD THE VECTOR

0 1 3 4 5row_ptr

0 2 3 1 3col_ind

CSR-based SpMV

39

NZ

NZAVECTOR

BMU

0 1 0 0 1 0

0 1 0 1 0 0

0 1 0 1 0 0

Matrix Parameters

Bitmap Parameters

COLUMNINDEX

ROWINDEX

NZ NZ

NZ NZ NZ

NZ NZ NZ

NZ NZ NZ

STREAM THROUGH THE NON-ZERO BLOCKS

INDEX THE VECTOR USING THE BMU

Use Case: SpMV

40

Effect of Compression Ratio

41

CSR compresses more efficiently SMASH > CSR

Storage Efficiency

42

0.8

0.9

1

1.1

1.2

1.3

Spe

ed

up

HIGHLY SCATTERED → HIGHLY GATHERED

MATRIX 13

Distribution of non-zero elements

43

SMASH enables highly-efficient storage of sparse matrices

1

10

100

1000

10000

GMEAN

TOTA

L C

OM

PR

ESSI

ON

RA

TIO

INCREASING DENSITY →

SMASH CSR

Storage Efficiency

SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for...

Documents

Transcript of SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for...