CUDA Linear Algebra Library and Next Generation Yukai Hung [email protected] Department of...

CUDA Linear Algebra Library and Next GenerationCUDA Linear Algebra Library and Next GenerationYukai Hung

[email protected] of MathematicsNational Taiwan University

Yukai [email protected]

Department of MathematicsNational Taiwan University

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

3


Dense approach is wasteful - unclear how to map work to parallel processors - irregular elements accessing for global memory

4


stru

ctur

edun

stru

ctur

ed

DIA - diagonal format

ELL - ellpack format

CSR - compressed row format

HYB - hybrid format

COO - coordinate format

5


Diagonal format - diagonal should be mostly populated format - high parallelism to map one thread for one row - good parallel efficiency and good memory behavior

global memory coalescing

6


Ellpack format - assign one thread to compute one row again - but the load imbalance hurts parallel efficiency

7


Coordinate format - insensitive to sparsity pattern but slower than ellpack - assign one thread for one element and combine the results from all elements in a row to get output element

8


Hybrid format - combine regular ellpack format and flexible coo format

typical exceptional

9


Property comparison Matrix Format Granularity Coalescing

DIA thread/row fullELL thread/row full

CSR(scalar) thread/row rareCSR(vector) warp/row partial

COO thread/nonzero fullHYB thread/row full

fixed number of nonzeros and variable matrix size

10


Sparse matrices for parallel efficiency: ellpack format - one thread per row is efficient for memory accessing

Sparse matrices for load imbalance: coordinate format - one thread per element is insensitive to matrix structure

Conclusion for all structures - hybrid structure gives the best performance averagely - irregularity is manageable if regularize the common case

11


Performance comparison

12



13



Linear Algebra LibraryLinear Algebra Library

15


CUBLAS: CUDA Basic Linear Algebra Subroutines - implement basic linear algebra subroutines on runtime level - only available for single device not implement for multiple devices

CUFFT: CUDA Fast Fourier Transforms Library - use divide-and-conquer algorithm for discrete transform - support real and complex data for in-place or out-of-place - support the stream operation for simultaneous execution

- use complex-to-complex to replace real-to-complex - problem size in power-of-two gives best performance

16


CUDPP: CUDA Data Parallel Primitive Library - a library of data-parallel algorithm primitives - parallel prefix-sum and sorting and data reduction - stream compaction and random number generator

17


CUDPP: CUDA Data Parallel Primitive Library comparison with multicore CPU

18


CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

linear system solvesleast square solvers

orthogonal factorizationsymmetric eigenproblem

non-symmetric eigenproblemsingular value decompositions

19



double precision LU-factorizationdouble precision QR-factorization

20



double precision QR-factorizationdouble precision symmetric eigenvalue problem

21



double precision symmetric eigenvalue problemdouble precision singular value decomposition

22


MAGMA: Matrix Algebra on GPU and Multicore Architecture - open source project to develop a dense linear algebra library similar to basic linear algebra package but for heterogeneous and hybrid architecture with manycore CPUs and GPUs systems

23


MAGMA: Matrix Algebra on GPU and Multicore Architecture

double precision matrix-matrix multiplicationsingle precision QR-factorization

24



single precision QR-factorizationsolving Ax=b by using LU-factorization

25



solving Ax=b by using LU-factorizationsingle precision Cholesky-factorization

26


Thrust - thrust is a CUDA library of parallel algorithm with an interface resembling the C++ Standard Template Library STL to provide flexible high-level interface that greatly enhance productivity

27


int main(int argc,char** argv){ //allocate memory space on the host thrust::host_vector<float> hvec(1024);

//generate random number on the host thrust::generate(hvec.begin(),hvec.end(),rand); //allocate and transfer data to device thrust::device_vector<float> dvec=hvec; //manipulate device values from the host dvec[0]=(float)rand()/(float)(RAND_MAX-1); dvec[1]=(float)rand()/(float)(RAND_MAX-1); //sum all data on device by parallel reduction sum=thrust::reduce(dvec.begin(),dvec.end());

//sort all data on device by radix sort thrust::sort(dvec.begin(),dvec.end());

//transfer final data back to host thrust::copy(dvec.begin(),dvec.end(),hvec.begin());}

28


int main(int argc,char** argv){ //create list container on the host std::list<int> hlist; hlist.push_back(13); hlist.push_back(27); //copy host data from list into device vector thrust::device_vector<int> dvec(hlist.size()); thrust::copy(hlist.begin(),hlist.end(),dvec.begin()); //alternative method to convert from host to device thrust::device_vector<int> dvec(hlist.begin(),hlist.end());

//obtain raw pointer from device memory int* dpointer=thrust::raw_pointer_cast(dvec);

//launch device kernel function kernel<<<blocknum,blocksize>>>(dpointer,dvec.size());

//deallocate device memory cudaFree(dpointer); }

29


CUSP: Generic Parallel Algorithm for Sparse Matrix Computations - cusp provides a high-level and flexible interface for manipulating sparse matrix and solving sparse linear systems by iterative method - cusp is implemented on the thrust template interface structure

"Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“ Nathan Bell and Michael Garland, in "Supercomputing 09", 2009

30


CUSP: Generic Parallel Algorithm for Sparse Matrix Computations

31


Matrix format - cusp natively supports several sparse matrix formats - cusp make it is easy to transfer sparse matrix data between host and device and convert between sparse matrix format //allocate storage space for a CSR matrix on the

//host with 5 row 8 column and 12 nonzero elements cusp::csr_matrix<int,float,cusp::host_memory> A(5,8,12);

//allocate and transfer from host to device memory cusp::csr_matrix<int,float,cusp::device_memory> B=A;

//convert the CSR matrix format to HYB matrix formatcusp::hyb_matrix<int,float,cusp::device_memory> C=A;

32


Algorithm and iterative solver - matrix-vector multiplication sand transpose - conjugate gradient and biconjugate gradient stab

//matrix-vector multiplicationcusp::multiply(A,x,y)

//sparse matrix transposecusp::transpose(A,At)

//conjugate gradientcusp::krylov::cg(A,x,b)

//biconjugate gradient stabcusp::krylov::bicgstab(A,x,b)

33


int main(int argc,char** argv){ //create an empty HYB sparse matrix structure cusp::hyb_matrix<int,float,cusp::device_memory> A;

//load a matrix stored in the matrix market format cusp::io::read_matrix_market_file(A,”5pt_10x10.mtx”);

//allocate storage for solution x and right-hand side b cusp::array1d<float,cusp::device_memory> x(A.num_rows,0); cusp::array1d<float,cusp::device_memory> b(A.num_rows,1); //set the iteration and residual stopping criteria cusp::verbose_monitor<ValueType> monitor(100,1e-6); //setup the matrix preconditioner cusp::precond::diagonal<ValueType,MemorySpace> M(A);

//solve the linear system with conjugate gradient method cusp::krylov::cg(A,x,b,monitor,M);

return 0; }

34


OpenNL: Open Numerical Library

- efficient sparse matrix data structure

- sparse direct linear solver for SuperLU

- matrix preconditioner for Jacobi and SSOR

- iterative builder for sparse least-square problems

- iterative solvers for conjugate gradient, BICGSTAB, GMRES

35


ViennaCL - a basic linear algebra for computations on GPUs based on OpenCL - support basic linear algebra subroutines - generalized minimal residual method - direct linear system solver with LU-factorization - sparse conjugate gradient and biconjugate gradient - optimal incomplete LU preconditioner with threshold

GATLAS: GPU Automatically Tuned Linear Algebra Subroutines - automatically tuned the kernel of level 3 BLAS based on OpenCL

Next Generation ArchitectureNext Generation Architecture

37


Next GPU generation architecture is called Fermi

38


Next GPU generation architecture is called Fermi

39


Third generation Streaming Multiprocessor

32 processors for each SMdouble precision 50% of single

(8X faster than GT200)

dual thread/warp scheduler

4 special function units

64 KB of RAM for shared memory and configurable L1 cache

40


Second generation Parallel Thread Execution

IEEE 754-2008 floating point standard, surpassing even the most advanced CPU

Fused multiply-add FMA instruction for both single and double precision

Newly designed 32-bit integer ALU and extended precision operations

41


Improved Memory System

first GPU architecture to support true cache hierarchy in combination with

on-chip shared memory

L1 cache for each multiprocessorimprove bandwidth/reduce latency

unified L2 cache (768 KB)coherent data sharing across all cores

ECC support GDDR5 memory interfacewhich is almost 2X faster than GDDR3

42


GigaThread Hardware Scheduler

Hierarchically manage thousandsof simultaneously active threads

10X faster application context switching to support concurrent

kernel execution

43



concurrent kernel execution + faster context switch

44



Dual DMA engines for simultaneous data transfer to fully overlap with CPU and GPU processing time

45



fully pipeline of integer arithmetic logic unit and floating-point unit

improve floating-point arithmetic from IEEE 745-1985 to IEEE 745-2008

to support FMA instruction

improve integer ALU from 24-bit precision into 32-bit precision

46


What is NEW on the floating-point operation? - support fused multiply-add instructions for both single and double

x =

=

original multiply-add

A B product+

C result

truncate extra digits

fused multiply-add

retain all digits

47


What is NEW on the floating-point operation?

- support subnormal numbers for both single and double precision which are small numbers that lie between the zero and smallest normalized number of a given floating point number system

- prior generation flush subnormal operand and results to zero

- CPU typically perform subnormal calculation in exception-handling software taking thousands of cycles, but Fermi handle subnormal calculations in hardware with no additional performance penalty

48



16 load/store units to allow source and destination addresses to be

calculated for 16 threads per cycle

32 single precision FMA units16 double precision FMA units

49



double precision application performance

50



two warp scheduler andinstruction dispatch units

51



dual warp scheduler allowing two warps to be issued and executed concurrently for 32 cores

52



two warp scheduler andinstruction dispatch units

64KB configurable shared memory and L1 cache

53


64KB configurable shared memory and L1 cache - 48KB shared memory and 16KB L1 cache - 16KB shared memory and 48KB L1 cache

radix sort using shared memory

54


Unified memory address space - combine three separate addresses space for load and store - this feature enable Fermi to support all C++ specific programs virtual function, function pointer, new and delete object, try and catch

55


summary table

56


L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

scheduler bottleneck

57


old bottleneck

new bottleneck

58

Reference - Mark Harris http://www.markmark.net/

- Wei-Chao Chen http://www.cs.unc.edu/~ciao/

- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php

http://www.markmark.net/

http://www.cs.unc.edu/~ciao/

http://impact.crhc.illinois.edu/people/current/hwu.php

CUDA Linear Algebra Library and Next Generation Yukai Hung [email protected] Department of...

Documents

Transcript of CUDA Linear Algebra Library and Next Generation Yukai Hung [email protected] Department of...