CUDA Linear Algebra Library and Next Generation Yukai Hung [email protected] Department of...
-
Upload
nickolas-ramsey -
Category
Documents
-
view
238 -
download
3
Transcript of CUDA Linear Algebra Library and Next Generation Yukai Hung [email protected] Department of...
![Page 1: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/1.jpg)
CUDA Linear Algebra Library and Next GenerationCUDA Linear Algebra Library and Next GenerationYukai Hung
[email protected] of MathematicsNational Taiwan University
Yukai [email protected]
Department of MathematicsNational Taiwan University
![Page 2: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/2.jpg)
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
![Page 3: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/3.jpg)
3
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Dense approach is wasteful - unclear how to map work to parallel processors - irregular elements accessing for global memory
![Page 4: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/4.jpg)
4
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
stru
ctur
edun
stru
ctur
ed
DIA - diagonal format
ELL - ellpack format
CSR - compressed row format
HYB - hybrid format
COO - coordinate format
![Page 5: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/5.jpg)
5
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Diagonal format - diagonal should be mostly populated format - high parallelism to map one thread for one row - good parallel efficiency and good memory behavior
global memory coalescing
![Page 6: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/6.jpg)
6
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Ellpack format - assign one thread to compute one row again - but the load imbalance hurts parallel efficiency
![Page 7: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/7.jpg)
7
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Coordinate format - insensitive to sparsity pattern but slower than ellpack - assign one thread for one element and combine the results from all elements in a row to get output element
![Page 8: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/8.jpg)
8
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Hybrid format - combine regular ellpack format and flexible coo format
typical exceptional
![Page 9: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/9.jpg)
9
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Property comparison Matrix Format Granularity Coalescing
DIA thread/row fullELL thread/row full
CSR(scalar) thread/row rareCSR(vector) warp/row partial
COO thread/nonzero fullHYB thread/row full
fixed number of nonzeros and variable matrix size
![Page 10: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/10.jpg)
10
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Sparse matrices for parallel efficiency: ellpack format - one thread per row is efficient for memory accessing
Sparse matrices for load imbalance: coordinate format - one thread per element is insensitive to matrix structure
Conclusion for all structures - hybrid structure gives the best performance averagely - irregularity is manageable if regularize the common case
![Page 11: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/11.jpg)
11
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Performance comparison
![Page 12: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/12.jpg)
12
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Performance comparison
![Page 13: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/13.jpg)
13
Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication
Performance comparison
![Page 14: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/14.jpg)
Linear Algebra LibraryLinear Algebra Library
![Page 15: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/15.jpg)
15
Linear Algebra LibraryLinear Algebra Library
CUBLAS: CUDA Basic Linear Algebra Subroutines - implement basic linear algebra subroutines on runtime level - only available for single device not implement for multiple devices
CUFFT: CUDA Fast Fourier Transforms Library - use divide-and-conquer algorithm for discrete transform - support real and complex data for in-place or out-of-place - support the stream operation for simultaneous execution
- use complex-to-complex to replace real-to-complex - problem size in power-of-two gives best performance
![Page 16: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/16.jpg)
16
Linear Algebra LibraryLinear Algebra Library
CUDPP: CUDA Data Parallel Primitive Library - a library of data-parallel algorithm primitives - parallel prefix-sum and sorting and data reduction - stream compaction and random number generator
![Page 17: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/17.jpg)
17
Linear Algebra LibraryLinear Algebra Library
CUDPP: CUDA Data Parallel Primitive Library comparison with multicore CPU
![Page 18: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/18.jpg)
18
Linear Algebra LibraryLinear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface
linear system solvesleast square solvers
orthogonal factorizationsymmetric eigenproblem
non-symmetric eigenproblemsingular value decompositions
![Page 19: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/19.jpg)
19
Linear Algebra LibraryLinear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface
double precision LU-factorizationdouble precision QR-factorization
![Page 20: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/20.jpg)
20
Linear Algebra LibraryLinear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface
double precision QR-factorizationdouble precision symmetric eigenvalue problem
![Page 21: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/21.jpg)
21
Linear Algebra LibraryLinear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface
double precision symmetric eigenvalue problemdouble precision singular value decomposition
![Page 22: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/22.jpg)
22
Linear Algebra LibraryLinear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture - open source project to develop a dense linear algebra library similar to basic linear algebra package but for heterogeneous and hybrid architecture with manycore CPUs and GPUs systems
![Page 23: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/23.jpg)
23
Linear Algebra LibraryLinear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture
double precision matrix-matrix multiplicationsingle precision QR-factorization
![Page 24: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/24.jpg)
24
Linear Algebra LibraryLinear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture
single precision QR-factorizationsolving Ax=b by using LU-factorization
![Page 25: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/25.jpg)
25
Linear Algebra LibraryLinear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture
solving Ax=b by using LU-factorizationsingle precision Cholesky-factorization
![Page 26: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/26.jpg)
26
Linear Algebra LibraryLinear Algebra Library
Thrust - thrust is a CUDA library of parallel algorithm with an interface resembling the C++ Standard Template Library STL to provide flexible high-level interface that greatly enhance productivity
![Page 27: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/27.jpg)
27
Linear Algebra LibraryLinear Algebra Library
int main(int argc,char** argv){ //allocate memory space on the host thrust::host_vector<float> hvec(1024);
//generate random number on the host thrust::generate(hvec.begin(),hvec.end(),rand); //allocate and transfer data to device thrust::device_vector<float> dvec=hvec; //manipulate device values from the host dvec[0]=(float)rand()/(float)(RAND_MAX-1); dvec[1]=(float)rand()/(float)(RAND_MAX-1); //sum all data on device by parallel reduction sum=thrust::reduce(dvec.begin(),dvec.end());
//sort all data on device by radix sort thrust::sort(dvec.begin(),dvec.end());
//transfer final data back to host thrust::copy(dvec.begin(),dvec.end(),hvec.begin());}
![Page 28: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/28.jpg)
28
Linear Algebra LibraryLinear Algebra Library
int main(int argc,char** argv){ //create list container on the host std::list<int> hlist; hlist.push_back(13); hlist.push_back(27); //copy host data from list into device vector thrust::device_vector<int> dvec(hlist.size()); thrust::copy(hlist.begin(),hlist.end(),dvec.begin()); //alternative method to convert from host to device thrust::device_vector<int> dvec(hlist.begin(),hlist.end());
//obtain raw pointer from device memory int* dpointer=thrust::raw_pointer_cast(dvec);
//launch device kernel function kernel<<<blocknum,blocksize>>>(dpointer,dvec.size());
//deallocate device memory cudaFree(dpointer); }
![Page 29: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/29.jpg)
29
Linear Algebra LibraryLinear Algebra Library
CUSP: Generic Parallel Algorithm for Sparse Matrix Computations - cusp provides a high-level and flexible interface for manipulating sparse matrix and solving sparse linear systems by iterative method - cusp is implemented on the thrust template interface structure
"Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“ Nathan Bell and Michael Garland, in "Supercomputing 09", 2009
![Page 30: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/30.jpg)
30
Linear Algebra LibraryLinear Algebra Library
CUSP: Generic Parallel Algorithm for Sparse Matrix Computations
![Page 31: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/31.jpg)
31
Linear Algebra LibraryLinear Algebra Library
Matrix format - cusp natively supports several sparse matrix formats - cusp make it is easy to transfer sparse matrix data between host and device and convert between sparse matrix format //allocate storage space for a CSR matrix on the
//host with 5 row 8 column and 12 nonzero elements cusp::csr_matrix<int,float,cusp::host_memory> A(5,8,12);
//allocate and transfer from host to device memory cusp::csr_matrix<int,float,cusp::device_memory> B=A;
//convert the CSR matrix format to HYB matrix formatcusp::hyb_matrix<int,float,cusp::device_memory> C=A;
![Page 32: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/32.jpg)
32
Linear Algebra LibraryLinear Algebra Library
Algorithm and iterative solver - matrix-vector multiplication sand transpose - conjugate gradient and biconjugate gradient stab
//matrix-vector multiplicationcusp::multiply(A,x,y)
//sparse matrix transposecusp::transpose(A,At)
//conjugate gradientcusp::krylov::cg(A,x,b)
//biconjugate gradient stabcusp::krylov::bicgstab(A,x,b)
![Page 33: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/33.jpg)
33
Linear Algebra LibraryLinear Algebra Library
int main(int argc,char** argv){ //create an empty HYB sparse matrix structure cusp::hyb_matrix<int,float,cusp::device_memory> A;
//load a matrix stored in the matrix market format cusp::io::read_matrix_market_file(A,”5pt_10x10.mtx”);
//allocate storage for solution x and right-hand side b cusp::array1d<float,cusp::device_memory> x(A.num_rows,0); cusp::array1d<float,cusp::device_memory> b(A.num_rows,1); //set the iteration and residual stopping criteria cusp::verbose_monitor<ValueType> monitor(100,1e-6); //setup the matrix preconditioner cusp::precond::diagonal<ValueType,MemorySpace> M(A);
//solve the linear system with conjugate gradient method cusp::krylov::cg(A,x,b,monitor,M);
return 0; }
![Page 34: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/34.jpg)
34
Linear Algebra LibraryLinear Algebra Library
OpenNL: Open Numerical Library
- efficient sparse matrix data structure
- sparse direct linear solver for SuperLU
- matrix preconditioner for Jacobi and SSOR
- iterative builder for sparse least-square problems
- iterative solvers for conjugate gradient, BICGSTAB, GMRES
![Page 35: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/35.jpg)
35
Linear Algebra LibraryLinear Algebra Library
ViennaCL - a basic linear algebra for computations on GPUs based on OpenCL - support basic linear algebra subroutines - generalized minimal residual method - direct linear system solver with LU-factorization - sparse conjugate gradient and biconjugate gradient - optimal incomplete LU preconditioner with threshold
GATLAS: GPU Automatically Tuned Linear Algebra Subroutines - automatically tuned the kernel of level 3 BLAS based on OpenCL
![Page 36: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/36.jpg)
Next Generation ArchitectureNext Generation Architecture
![Page 37: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/37.jpg)
37
Next Generation ArchitectureNext Generation Architecture
Next GPU generation architecture is called Fermi
![Page 38: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/38.jpg)
38
Next Generation ArchitectureNext Generation Architecture
Next GPU generation architecture is called Fermi
![Page 39: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/39.jpg)
39
Next Generation ArchitectureNext Generation Architecture
Third generation Streaming Multiprocessor
32 processors for each SMdouble precision 50% of single
(8X faster than GT200)
dual thread/warp scheduler
4 special function units
64 KB of RAM for shared memory and configurable L1 cache
![Page 40: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/40.jpg)
40
Next Generation ArchitectureNext Generation Architecture
Second generation Parallel Thread Execution
IEEE 754-2008 floating point standard, surpassing even the most advanced CPU
Fused multiply-add FMA instruction for both single and double precision
Newly designed 32-bit integer ALU and extended precision operations
![Page 41: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/41.jpg)
41
Next Generation ArchitectureNext Generation Architecture
Improved Memory System
first GPU architecture to support true cache hierarchy in combination with
on-chip shared memory
L1 cache for each multiprocessorimprove bandwidth/reduce latency
unified L2 cache (768 KB)coherent data sharing across all cores
ECC support GDDR5 memory interfacewhich is almost 2X faster than GDDR3
![Page 42: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/42.jpg)
42
Next Generation ArchitectureNext Generation Architecture
GigaThread Hardware Scheduler
Hierarchically manage thousandsof simultaneously active threads
10X faster application context switching to support concurrent
kernel execution
![Page 43: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/43.jpg)
43
Next Generation ArchitectureNext Generation Architecture
GigaThread Hardware Scheduler
concurrent kernel execution + faster context switch
![Page 44: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/44.jpg)
44
Next Generation ArchitectureNext Generation Architecture
GigaThread Hardware Scheduler
Dual DMA engines for simultaneous data transfer to fully overlap with CPU and GPU processing time
![Page 45: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/45.jpg)
45
Next Generation ArchitectureNext Generation Architecture
Third generation Streaming Multiprocessor
fully pipeline of integer arithmetic logic unit and floating-point unit
improve floating-point arithmetic from IEEE 745-1985 to IEEE 745-2008
to support FMA instruction
improve integer ALU from 24-bit precision into 32-bit precision
![Page 46: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/46.jpg)
46
Next Generation ArchitectureNext Generation Architecture
What is NEW on the floating-point operation? - support fused multiply-add instructions for both single and double
x =
=
original multiply-add
A B product+
C result
truncate extra digits
fused multiply-add
retain all digits
![Page 47: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/47.jpg)
47
Next Generation ArchitectureNext Generation Architecture
What is NEW on the floating-point operation?
- support subnormal numbers for both single and double precision which are small numbers that lie between the zero and smallest normalized number of a given floating point number system
- prior generation flush subnormal operand and results to zero
- CPU typically perform subnormal calculation in exception-handling software taking thousands of cycles, but Fermi handle subnormal calculations in hardware with no additional performance penalty
![Page 48: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/48.jpg)
48
Next Generation ArchitectureNext Generation Architecture
Third generation Streaming Multiprocessor
16 load/store units to allow source and destination addresses to be
calculated for 16 threads per cycle
32 single precision FMA units16 double precision FMA units
![Page 49: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/49.jpg)
49
Next Generation ArchitectureNext Generation Architecture
Third generation Streaming Multiprocessor
double precision application performance
![Page 50: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/50.jpg)
50
Next Generation ArchitectureNext Generation Architecture
Third generation Streaming Multiprocessor
two warp scheduler andinstruction dispatch units
![Page 51: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/51.jpg)
51
Next Generation ArchitectureNext Generation Architecture
Third generation Streaming Multiprocessor
dual warp scheduler allowing two warps to be issued and executed concurrently for 32 cores
![Page 52: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/52.jpg)
52
Next Generation ArchitectureNext Generation Architecture
Third generation Streaming Multiprocessor
two warp scheduler andinstruction dispatch units
64KB configurable shared memory and L1 cache
![Page 53: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/53.jpg)
53
Next Generation ArchitectureNext Generation Architecture
64KB configurable shared memory and L1 cache - 48KB shared memory and 16KB L1 cache - 16KB shared memory and 48KB L1 cache
radix sort using shared memory
![Page 54: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/54.jpg)
54
Next Generation ArchitectureNext Generation Architecture
Unified memory address space - combine three separate addresses space for load and store - this feature enable Fermi to support all C++ specific programs virtual function, function pointer, new and delete object, try and catch
![Page 55: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/55.jpg)
55
Next Generation ArchitectureNext Generation Architecture
summary table
![Page 56: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/56.jpg)
56
Next Generation ArchitectureNext Generation Architecture
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Work Distribution Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
scheduler bottleneck
![Page 57: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/57.jpg)
57
Next Generation ArchitectureNext Generation Architecture
old bottleneck
new bottleneck
![Page 58: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.](https://reader030.fdocuments.net/reader030/viewer/2022020111/56649e385503460f94b2963c/html5/thumbnails/58.jpg)
58
Reference - Mark Harris http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php