SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for...

43
SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

Transcript of SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for...

Page 1: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

SMASHCo-Designing Software Compression

and Hardware-Accelerated Indexing

for Efficient Sparse Matrix Operations

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula,

Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi,

Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

Page 2: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

2

Executive Summary

• Many important workloads heavily make use of sparse matrices

• Shortcomings of existing compression formats

• Expensive discovery of the positions of non-zero elements or

• Narrow applicability

• SMASH: hardware/software cooperative mechanism for efficient

sparse matrix storage and computation

• Software: Efficient compression scheme using

a hierarchy of bitmaps

• Hardware: Hardware unit interprets the bitmap hierarchy

and accelerates indexing

• Performance improvement: 38% and 44% for SpMV and SpMM

over the widely used CSR format

• SMASH is highly efficient, low-cost and widely applicable

• Matrices are extremely sparse

• Compression is essential to avoid storage/computational overheads

Page 3: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

3

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Page 4: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

4

Recommender Systems Graph Analytics Neural Networks

• Collaborative Filtering

• PageRank

• Breadth First Search

• Betweenness

Centrality

• Sparse DNNs

• Graph Neural Networks

Sparse Matrix Operations are Widespread Today

Page 5: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

5

0.0003%non-zero elements

2.31% non-zero elements

Real World Matrices Have High Sparsity

Sparse matrix compression

is essential to enable

efficient storage and computation

Page 6: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

6

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Page 7: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

7

Limitations of Existing Compression Formats

General formats

optimize for storage

Expensive discovery of

the positions

of non-zero elements

1

Page 8: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

8

• Used in multiple libraries & frameworks:

Intel MKL, TACO, Ligra, Polymer

• Stores only the non-zero elements

of the sparse matrix

• Provides high compression ratio

Widely Used Format: Compressed Sparse Row

Page 9: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

9

0row_ptr

col_ind

values

5 0 0 0

0 0 3 4

0 8 0 0

0 0 0 3

Α =

MATRIX CSR

Basics of CSR: Indexing Overhead

1 3 4 5

0 2 3 1 3

5 3 4 8 3

Requires multiple

data-dependent memory accesses

Multiple instructions are needed to discover

the position of a non-zero element

Page 10: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

10

Sparse Matrix Vector Multiplication (SpMV)

Indexing Overhead in Sparse Kernels

Sparse Matrix Matrix Multiplication (SpMM)

Indexing is expensive

for major sparse matrix kernels

Indexing for every non-zero element of the sparse matrix to multiply with the corresponding element of the vector.

Index matching for every inner product between the 2 sparse matrices.

Page 11: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

11

Performing Indexing with Zero Cost

Reducing the cost of indexing

can accelerate sparse matrix operations

2.21 2.13

2.81

0.00

0.50

1.00

1.50

2.00

2.50

3.00

SpMatAdd SpMV SpMM

Spe

ed

up

CSR ZERO-COST INDEXING

Page 12: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

12

Limitations of Existing Compression Formats

General formats

optimize for storage

Specialized formats assume

specific matrix structures

and patterns (e.g., diagonals)

Expensive discovery of

the positions

of non-zero elements

Narrow

applicability

1

2

Page 13: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

13

• Minimizes the indexing overheads

• Can be used across a wide range of

sparse matrices and sparse matrix operations

• Enables high compression ratio

Design a sparse matrix compression mechanism that:

Our Goal

Page 14: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

14

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Software Compression Scheme

Hardware Acceleration Unit

Cross-layer Interface

Page 15: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

15

SMASH: Key Idea

Efficient

compression

using a Hierarchy

of Bitmaps

Software

Unit that scans

bitmaps to

accelerate

indexing

Hardware

SMASH ISA

Hardware/Software cooperative mechanism:

• Enables highly-efficient sparse matrix compression and computation

• General across a diverse set of sparse matrices and sparse matrix operations

Page 16: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

16

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Software Compression Scheme

Hardware Acceleration Unit

Cross-layer Interface

Page 17: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

17

• Encodes the positions of non-zero elementsusing bits to maintain low storage overhead

SMASH: Software Compression Scheme

Efficient

compression

using a Hierarchy

of Bitmaps

Software

Page 18: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

18

1 0 0 0

0 1 0 0

0 0 0 00 0 1 1

Encodes if a block of the matrix contains

any non-zero element

1

BITMAP

0 0 0

0 1 0 0

0 0 0 00 0 1 1

Might contain high number of zero bits

SMASH: Software Compression Scheme

BITMAP:

MATRIX

NZ ZEROS ZEROS ZEROS

ZEROS NZ ZEROS ZEROS

ZEROS ZEROS ZEROS ZEROS

ZEROS ZEROS NZ NZ

Idea: Apply the same encoding recursively

to compress more effectively

Page 19: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

19

NON-ZERO VALUES ARRAY

NZ

Z

Z Z

NZ NZ

Z Z Z

Z Z Z

Z Z

Z Z

Bitmap 01 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

1 0 0 1

1 1

Bitmap 1

Bitmap 2

Compression Ratio 4:1

Compression Ratio 2:1 MATRIX

NZ NZ NZ

Z Zero Region

NZ Non-Zero Region

Hierarchy of Bitmaps

Storage

Efficient

Fast

Indexing

Hardware

Friendly

Page 20: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

20

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Software Compression Scheme

Hardware Acceleration Unit

Cross-layer Interface

Page 21: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

21

SMASH: Hardware Acceleration Unit

Unit that scans

bitmaps to

accelerate

indexing

Hardware

• Buffers the Bitmap Hierarchy and scans it using bitwise operations

• Interprets the Bitmap Hierarchy used in software

Page 22: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

22

SRAM BUFFER 2

SRAM BUFFER 1

SRAM BUFFER 0

Matrix Parameters

Bitmap Parameters

Hardware LogicNON

ZERO

INDEX

SCAN

READ

UPDATE

Gro

up

2

Gro

up

3

BITMAP

BUFFERS

CPU

Bitmap Management Unit (BMU)

BMU

SMASH ISA

Page 23: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

23

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Software Compression Scheme

Hardware Acceleration Unit

Cross-Layer Interface

Page 24: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

24

MATINFOBMAPINFORDBMAP

RDIND

PBMAP

SMASH: Cross-Layer Interface

Need for a cross-layer interface that

enables software to control the BMU

• Communicate the parameters needed

to calculate the index

• Query the BMU to retrieve the index

of the next non-zero element

Enables SMASH to flexibly accelerate

a diverse range of operations

on any sparse matrix

SMASH ISA

Page 25: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

25

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Software Compression Scheme

Hardware Acceleration Unit

Cross-Layer Interface

Page 26: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

26

Simulator: ZSim Simulator [1]

Workloads:• Sparse Matrix Kernels

SpMV & SpMM from TACO [2]

• Graph Applications

PageRank & Betweenness Centrality from Ligra [3]

Input datasets:15 diverse sparse matrices & 4 graphs

from the Sparse Suite Collection [4]

Methodology

[1] Sanchez et al. “ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems” ISCA‘13[2] Kjolstad et al. “ taco: A Tool to Generate Tensor Algebra Kernels“ ASE’17[3] Shun et al. “Ligra: A Lightweight Graph Processing Framework for Shared Memory” PPoPP’13[4] Davis et al. “The University of Florida Sparse Matrix Collection” TOMS’11

Page 27: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

27

Evaluated Sparse Matrices

Page 28: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

28

Performance Improvement Using SMASH

SMASH provides significant performance

improvements over state-of-the-art formats

1.00 1.00 1.00 1.001.041.10

1.191.26

1.38 1.44

1.27 1.31

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

SpMV SpMM PageRank BC

Spe

ed

up

ove

r TA

CO

-CSR

TACO-CSR TACO-BCSR SMASH

Page 29: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

29

Number of Executed Instructions

SMASH significantly reduces

the number of executed instructions

1.00 1.00 1.00 1.00

0.890.81 0.82

0.78

0.630.54

0.75 0.70

0.00

0.20

0.40

0.60

0.80

1.00

1.20

SpMV SpMM PageRank BC

No

rmal

ize

d N

um

be

r o

f In

stru

ctio

ns

ove

r TA

CO

-CSR

TACO-CSR TACO-BCSR SMASH

Page 30: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

30

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

Spe

ed

up

INCREASING SPARSITY→

TACO-CSR TACO-BCSR SMASH

Sparsity Sweep for SpMV

SMASH provides speedups regardless

of the sparsity of the matrix

Page 31: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

31

SMASH configuration

• Support for 4 matrices in the BMU

• 256 bytes / SRAM buffer

• 140 bytes for registers & counters

0.076% area overhead over an Intel Xeon CPU

SMASH incurs negligible area overhead

Hardware Area Overhead

Page 32: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

32

• Compression ratio sensitivity analysis

• Distribution of non-zero elements

• Detailed results for SpMM

• Conversion from CSR to SMASH overhead

• Software-only approaches executed

in a real machine

Other Results in the Paper

Page 33: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

33

1. Sparse Matrix Basics

4. Evaluation

5. Conclusion

Presentation Outline

2. Compression Formats & Shortcomings

3. SMASH

Software Compression Scheme

Hardware Acceleration Unit

Cross-layer Interface

Page 34: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

34

Summary & Conclusion

• Many important workloads heavily make use of sparse matrices

• Shortcomings of existing compression formats

• Expensive discovery of the positions of non-zero elements or

• Narrow applicability

• SMASH: hardware/software cooperative mechanism for efficient

sparse matrix storage and computation

• Software: Efficient compression scheme using

a hierarchy of bitmaps

• Hardware: Hardware unit interprets the bitmap hierarchy

and accelerates indexing

• Performance improvement: 38% and 44% for SpMV and SpMM

over the widely used CSR format

• SMASH is highly efficient, low-cost and widely applicable

• Matrices are extremely sparse

• Compression is essential to avoid storage/computational overheads

Page 35: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

SMASHCo-Designing Software Compression

and Hardware-Accelerated Indexing

for Efficient Sparse Matrix Operations

Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula,

Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi,

Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

Page 36: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

36

SMALL BITMAPS VS ZERO COMPUTATION

MATRIX MATRIX

1

Bit

ma

p-0

Bit

ma

p-0

NZANZA

1 A B A B

A BA B

Two 0’sSix 0’s

Compression Ratio

Page 37: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

37

MATRIX A MATRIX BΑ DΒ

E

F

C 0 1 2

Α Β C D E F

0 1 3

INDEX MATCHINGFOR EVERY INNER PRODUCT

col_ind row_ind

Sparse Matrix Matrix Multiplication with CSR

Page 38: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

38

CSR-based SpMV

VECTOR012345

INDIRECT ACCESS TO LOAD THE VECTOR

0 1 3 4 5row_ptr

0 2 3 1 3col_ind

CSR-based SpMV

Page 39: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

39

NZ

NZAVECTOR

BMU

0 1 0 0 1 0

0 1 0 1 0 0

0 1 0 1 0 0

Matrix Parameters

Bitmap Parameters

COLUMNINDEX

ROWINDEX

NZ NZ

NZ NZ NZ

NZ NZ NZ

NZ NZ NZ

STREAM THROUGH THE NON-ZERO BLOCKS

INDEX THE VECTOR USING THE BMU

Use Case: SpMV

Page 40: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

40

Effect of Compression Ratio

Page 41: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

41

CSR compresses more efficiently SMASH > CSR

Storage Efficiency

Page 42: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

42

0.8

0.9

1

1.1

1.2

1.3

Spe

ed

up

HIGHLY SCATTERED → HIGHLY GATHERED

MATRIX 13

Distribution of non-zero elements

Page 43: SMASH - ETH Z...SMASH Co-Designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina

43

SMASH enables highly-efficient storage of sparse matrices

1

10

100

1000

10000

GMEAN

TOTA

L C

OM

PR

ESSI

ON

RA

TIO

INCREASING DENSITY →

SMASH CSR

Storage Efficiency