PhD Thesis - Niels Bohr weifeng/thesis/phdthesis_liu.pdf · PDF filePhD Thesis Weifeng...

Click here to load reader

  • date post

    07-Mar-2019
  • Category

    Documents

  • view

    216
  • download

    0

Embed Size (px)

Transcript of PhD Thesis - Niels Bohr weifeng/thesis/phdthesis_liu.pdf · PDF filePhD Thesis Weifeng...

U N I V E R S I T Y O F C O P E N H A G E N F A C U L T Y O F S C I E N C E

PhD Thesis Weifeng Liu

Parallel and Scalable Sparse Basic Linear Algebra Subprograms [Undertitel p afhandling]

Academic advisor: Brian Vinter

Submitted: 26/10/2015

ii

To Huamin and Renheng

for enduring my absence while working on the PhD and the thesis

To my parents and parents-in-law

for always helping us to get through hard times

iii

iv

Acknowledgments

I am immensely grateful to my supervisor, Professor Brian Vinter, for his support,feedback and advice throughout my PhD years. This thesis would not have beenpossible without his encouragement and guidance. I would also like to thank Profes-sor Anders Logg for giving me the opportunity to work with him at the Departmentof Mathematical Sciences of Chalmers University of Technology and University ofGothenburg.

I would like to thank my co-workers in the eScience Center who supported me inevery respect during the completion of this thesis: James Avery, Jonas Bardino, TroelsBlum, Klaus Birkelund Jensen, Mads Ruben Burgdorff Kristensen, Simon AndreasFrimann Lund, Martin Rehr, Kenneth Skovhede, and Yan Wang.

Special thanks goes to my PhD buddy James Avery, Huamin Ren (Aalborg Uni-versity), and Wenliang Wang (Bank of China) for their insightful feedback on mymanuscripts before being peer-reviewed.

I would like to thank Jianbin Fang (Delft University of Technology and NationalUniversity of Defense Technology), Joseph L. Greathouse (AMD), Shuai Che (AMD),Ruipeng Li (University of Minnesota) and Anders Logg for their insights on my papersbefore being published. I thank Jianbin Fang, Klaus Birkelund Jensen, Hans HenrikHappe (University of Copenhagen) and Rune Kildetoft (University of Copenhagen)for access to the Intel Xeon and Xeon Phi machines. I thank Bo Shen (Inspur GroupCo., Ltd.) for helpful discussion about Xeon Phi programming. I also thank ArashAshari (The Ohio State University), Intel MKL team, Joseph L. Greathouse, MayankDaga (AMD) and Felix Gremse (RWTH Aachen University) for sharing source code,libraries or implementation details of their SpMV and SpGEMM algorithms with me.

I would also like to thank our anonymous reviewers of conferences SPAA 13,PPoPP 13, IPDPS 14, PPoPP 15 and ICS 15, of workshops GPGPU-7 and PMAA 14,and of journals Journal of Parallel and Distributed Computing and Parallel Computing fortheir invaluable suggestions and comments on my manuscripts.

This research project was supported by the Danish National Advanced TechnologyFoundation (grant number J.nr. 004-2011-3).

v

vi

Abstract

Sparse basic linear algebra subprograms (BLAS) are fundamental building blocks fornumerous scientific computations and graph applications. Compared with DenseBLAS, parallelization of Sparse BLAS routines entails extra challenges due to the irreg-ularity of sparse data structures. This thesis proposes new fundamental algorithmsand data structures that accelerate Sparse BLAS routines on modern massively paral-lel processors: (1) a new heap data structure named ad-heap, for faster heap operationson heterogeneous processors, (2) a new sparse matrix representation named CSR5, forfaster sparse matrix-vector multiplication (SpMV) on homogeneous processors suchas CPUs, GPUs and Xeon Phi, (3) a new CSR-based SpMV algorithm for a variety oftightly coupled CPU-GPU heterogeneous processors, and (4) a new framework andassociated algorithms for sparse matrix-matrix multiplication (SpGEMM) on GPUsand heterogeneous processors.

The thesis compares the proposed methods with state-of-the-art approaches onsix homogeneous and five heterogeneous processors from Intel, AMD and nVidia.Using in total 38 sparse matrices as a benchmark suite, the experimental results showthat the proposed methods obtain significant performance improvement over the bestexisting algorithms.

vii

viii

Resume

Sparse liner algebra biblioteket kaldet BLAS, er en grundlggende byggesteni mange videnskabelige- og graf-applikationer. Sammenlignet med almindeligeBLAS operationer, er paralleliseringen af sparse BLAS specielt udfordrende, grun-det den manglende struktur i matricernes data. Denne afhandling prsenterergrundlggende nye datastrukturer og algoritmer til hurtigere sparse BLAS afviklingpa moderne, massivt parallelle, processorer: (1) en ny hob baseret datastruktur, kaldetad-heap, for hurtigere hob operationer pa heterogene processorer. (2) en ny reprsen-tation for sparse data, kaldet CSR5, der tillader hurtigere matrix-vektor multiplikation(SpMV) pa homogene processorer som CPUer, GPGPUer og Xeon Phi, (3) en nyCSR baseret SpMV algoritme for en rkke, tt forbundne, CPU-GPGPU heterogene-processorer, og (4) et nyt vrktj for sparse matrix-matrix multiplikationer (SpGEMM)on GPGPUs og heterogene processorer.

Afhandlingen sammenligner de prsenterede lsninger med state-of-the-artlsninger pa seks homogene og fem heterogene processorer fra Intel, AMD, og nVidia.Pa baggrund af 38 sparse matricer som benchmarks, ses at de nye metoder derprsenteres i denne afhandling kan give signifikante forbedringer over de kendtelsninger.

ix

x

Contents

iii

1 Introduction 11.1 Organization of The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Authors Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

I Foundations 5

2 Sparsity and Sparse BLAS 72.1 What Is Sparsity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Where Are Sparse Matrices From? . . . . . . . . . . . . . . . . . . . . . 102.2.1 Finite Element Methods . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Sparse Representation of Signals . . . . . . . . . . . . . . . . . . 12

2.3 What Are Sparse BLAS? . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 Level 1: Sparse Vector Operations . . . . . . . . . . . . . . . . . 142.3.2 Level 2: Sparse Matrix-Vector Operations . . . . . . . . . . . . . 162.3.3 Level 3: Sparse Matrix-Matrix Operations . . . . . . . . . . . . . 17

2.4 Where Does Parallelism Come From? . . . . . . . . . . . . . . . . . . . 182.4.1 Fork: From Matrix to Submatrix . . . . . . . . . . . . . . . . . . 192.4.2 Join: From Subresult to Result . . . . . . . . . . . . . . . . . . . 20

2.5 Challenges of Parallel and Scalable Sparse BLAS . . . . . . . . . . . . . 212.5.1 Indirect Addressing . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.2 Selection of Basic Primitives . . . . . . . . . . . . . . . . . . . . . 212.5.3 Data Decomposition, Load Balancing and Scalability . . . . . . 222.5.4 Sparse Output of Unknown Size . . . . . . . . . . . . . . . . . . 22

3 Parallelism in Architectures 253.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Multicore CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Manycore GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Manycore Coprocessor and CPU . . . . . . . . . . . . . . . . . . . . . . 27

xi

xii CONTENTS

3.5 Tightly Coupled CPU-GPU Heterogeneous Processor . . . . . . . . . . 28

4 Parallelism in Data Structures and Algorithms 334.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Simple Data-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Vector Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.2 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Scan (Prefix Sum) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4.1 An Example Case: Eliminating Unused Entries . . . . . . . . . . 364.4.2 All-Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4.3 Segmented Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4.4 Segmented Sum Using Inclusive Scan . . . . . . . . . . . . . . . 40

4.5 Sorting and Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.5.1 An Example Case: Permutation . . . . . . . . . . . . . . . . . . . 424.5.2 Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5.3 Odd-Even Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.4 Ranking Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.5 Merge Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.6 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6 Ad-Heap and k-Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.6.1 Implicit d-heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.6.2 ad-heap design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 594.6.4 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . 62

II Sparse Matrix Representation 65

5 Existing Storage Formats 675.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Basic Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2.1 Diagonal (DIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2.2 ELLPACK (ELL) . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2.3 Coordinate (COO) . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.4 Compressed Sparse Row