Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does...
Transcript of Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does...
![Page 1: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/1.jpg)
Lecture 5
Matrix-Matrix Product
Based on the lecture materials of Hwu (UIUC) and Kirk (NVIDIA)
![Page 2: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/2.jpg)
Matrix Multiplication
• Simple version first– illustrate basic features of memory and thread
management in CUDA programs
– Thread ID usage
– Memory data transfer API between host and device
– Analyze performance
• Extend to version which employs shared memory
![Page 3: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/3.jpg)
Square Matrix Multiplication
• P = M * N of size WIDTH x WIDTH
• Without tiling:
– One thread calculates one element
of P
– M and N are loaded WIDTH times
from global memory
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
![Page 4: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/4.jpg)
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
Memory Layout of a Matrix
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
M0,0
M0,1
M0,2
M0,3
M1,0
M1,1
M1,2
M1,3
M2,0
M2,1
M2,2
M2,3
M3,0
M3,1
M3,2
M3,3
C orderFortran/Matlab
order
This order will be important to compute the location of the element
in the matrix according to thread and block indices
![Page 5: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/5.jpg)
Step 1: Simple Host Version
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
// Matrix multiplication on the (CPU) hostvoid MatrixMulOnHost(float* M, float* N, float* P, int Width)
{
for (int i = 0; i < Width; ++i)�
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * width + j];
sum += a * b;
}
P[i * Width + j] = sum;
}
}i
k
k
j
![Page 6: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/6.jpg)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)�
{int size = Width * Width * sizeof(float); float* Md, Nd, Pd;
…
// 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
// Allocate P on the devicecudaMalloc(&Pd, size);
Step 2: Transfer Data to Device from Host
![Page 7: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/7.jpg)
Step 3: Output Matrix Data Transfer(Host-side Code)�
2. // Kernel invocation code – to be shown later
…
3. // Read P from the device
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
// Free device matrices
cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}
![Page 8: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/8.jpg)
Step 4: Kernel Function
// Matrix multiplication kernel – per thread code
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)�
{
// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;
![Page 9: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/9.jpg)
Nd
Md Pd
WID
TH
WID
TH
WIDTH WIDTH
Step 4: Kernel Function (cont.)�
for (int k = 0; k < Width; ++k)� {
float Melement = Md[threadIdx.y*Width+k];float Nelement = Nd[k*Width+threadIdx.x];Pvalue += Melement * Nelement;
}
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;}
ty
tx
ty
tx
k
k
![Page 10: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/10.jpg)
// Setup the execution configuration
dim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);
// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Step 5: Kernel Invocation
(Host-side Code)
![Page 11: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/11.jpg)
First version: One Thread Block • One Block of threads compute
matrix Pd
– Each thread computes one element of Pd
• Each thread
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Perform one multiply and addition for each pair of Mdand Nd elements
– Compute to off-chip memory access ratio close to 1:1 (not very high)�
• Size of matrix limited by the number of threads allowed in a thread block
– It is 512. So the number allowed is <23
Grid 1
Block 1
3 2 5 4
2
4
2
6
48
Thread
(2, 2)�
WIDTH
Md Pd
Nd
![Page 12: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/12.jpg)
Extend to Arbitrary Sized Square Matrices
• Use more than one block
• Have each 2D thread block to compute
a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
Md
Nd
Pd
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
by
bx
You still need to put a loop
around the kernel call for cases
where WIDTH/TILE_WIDTH
is greater than max grid size
(64K)!
TILE_WIDTH
![Page 13: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/13.jpg)
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty210
TILE_WIDTH-1
2
1
0
TIL
E_
WID
TH
E
WID
TH
WID
TH
Matrix Multiplication Using
Multiple Blocks
• Break-up Pd into tiles
• Each block calculates
one tile
– Each thread calculates one element
– Block size equal tile size
![Page 14: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/14.jpg)
P1,0P0,0
P0,1
P2,0 P3,0
P1,1
P0,2 P2,2 P3,2P1,2
P3,1P2,1
P0,3 P2,3 P3,3P1,3
Block(0,0) Block(1,0)
Block(1,1)Block(0,1)
TILE_WIDTH = 2
A Small Example
![Page 15: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/15.jpg)
Pd1,0
A Small Example: Multiplication
Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0 Pd3,0
Nd0,3 Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2 Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3 Pd3,3Pd1,3
![Page 16: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/16.jpg)
Revised Matrix Multiplication
Kernel using Multiple Blocks__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column index of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;}
Note how the Row and Column indices are computed.
![Page 17: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/17.jpg)
Analysis of this version
• Each thread loads 2*Width elements and from global memory and does that many floating point computations– So this version does one flop per 4 byte memory load
• On a G80 bandwidth of memory transfer from global memory is ~ 86 GB/sec, and so we are limited to ~ 21 G floating point loads– Flop rate is also limited to this number.
• But the 8800 GTX is supposed to achieve ~ 340 Gflops– Need to use shared memory and fo more
computations per global memory access.
![Page 18: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/18.jpg)
Hardware Implementation:
Memory Architecture• The local, global, constant,
and texture spaces are
regions of device memory
• Each multiprocessor has:
– A set of 32-bit registers per processor
– On-chip shared memory
• Where the shared memory
space resides
– A read-only constant cache
• To speed up access to the
constant memory space
– A read-only texture cache
• To speed up access to the
texture memory space
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
Constant
Cache
Texture
Cache
Global, constant, texture memories
![Page 19: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/19.jpg)
More Terminology Review
• device = GPU = set of multiprocessors
• Multiprocessor = set of processors & shared memory
• Kernel = GPU program
• Grid = array of thread blocks that execute a kernel
• Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory
One threadRead/writeNoOff-chipLocal
All threads in a block
Read/writeN/A -resident
On-chipShared
All threads + hostRead/writeNoOff-chipGlobal
All threads + hostReadYesOff-chipConstant
All threads + hostReadYesOff-chipTexture
WhoAccessCachedLocationMemory
![Page 20: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/20.jpg)
Access Times• Register – dedicated HW - single cycle
• Shared Memory – dedicated HW - single cycle
• Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
• Constant Memory – DRAM, cached,
1…10s…100s of cycles, depending on cache
locality
• Texture Memory – DRAM, cached,
1…10s…100s of cycles, depending on cache
locality
• Instruction Memory (invisible) – DRAM, cached
![Page 21: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/21.jpg)
Language Extensions:Variable Type Qualifiers
• __device__ is optional when used with __local__, __shared__, or __constant__
• Automatic variables without any qualifier reside in a register– Except arrays that reside in local memory
blockblockshared__device__ __shared__ int SharedVar;
applicationgridglobal__device__ int GlobalVar;
threadthreadlocal__device__ __local__ int LocalVar;
grid
Scope
applicationconstant__device__ __constant__ int ConstantVar;
LifetimeMemory
![Page 22: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/22.jpg)
Variable Type Restrictions
• Pointers can only point to memory allocated or declared in global memory:
– Allocated in the host and passed to the
kernel:
__global__ void KernelFunc(float*
ptr)
– Obtained as the address of a global variable: float* ptr = &GlobalVar;
![Page 23: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/23.jpg)
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on
the device is to tile data to take advantage of fast
shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
![Page 24: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/24.jpg)
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory
(DRAM) - much slower access than shared
memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access patterns
– R/Only � constant memory (very fast if in cache)
– R/W shared within Block � shared memory (very fast)
– R/W within each thread � registers (very fast)
– R/W inputs/results � global memory (very slow)
Texture memory -- later
![Page 25: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/25.jpg)
Idea: Use Shared Memory to reuse
global memory data• Each input element is
read by Width threads.
• Load each element into
Shared Memory and
have several threads
use the local version to
reduce the memory
bandwidth
– Tiled algorithms
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
![Page 26: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/26.jpg)
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty210
TILE_WIDTH-1
2
1
0
TIL
E_
WID
TH
TIL
E_
WID
TH
TIL
E_
WID
TH
E
WID
TH
WID
TH
Tiled Multiply
• Break up the execution of the
kernel into phases so that the
data accesses in each phase is
focused on one subset (tile) of
Md and Nd
![Page 27: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/27.jpg)
Pd1,0Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0 Pd3,0
Nd0,3 Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2 Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3 Pd3,3Pd1,3
Breaking Md and Nd into Tiles
![Page 28: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/28.jpg)
Each phase of a Thread Block uses
one tile from Md and one from Nd
PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1
Nd1,3
↓
Nds1,1
Md3,1
↓
Mds1,1
PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1
Nd1,1
↓
Nds1,1
Md1,1
↓
Mds1,1
T1,1
PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1
Nd0,3
↓
Nds0,1
Md2,1
↓
Mds0,1
PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1
Nd0,1
↓
Nds0,1
Md0,1
↓
Mds0,1
T0,1
PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1
Nd1,2
↓
Nds1,0
Md3,0
↓
Mds1,0
PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1
Nd1,0
↓
Nds1,0
Md1,0
↓
Mds1,0
T1,0
PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1
Nd0,2
↓
Nds0,0
Md2,0
↓
Mds0,0
PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1
Nd0,0
↓
Nds0,0
Md0,0
↓
Mds0,0
T0,0
Step 6Step 5Step 4Phase 1 Phase 2
time
![Page 29: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/29.jpg)
Threads, Warps, Blocks
• There are (up to) 32 threads in a Warp
– Only <32 when there are fewer than 32 total threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on
a single SM
• G80 has 16 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory) allow, more than 1 Block can occupy each
SM
![Page 30: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/30.jpg)
First-order Size Considerations in G80
• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
• There should be many thread blocks
– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
• Each thread block perform 2*256 = 512 float loads
from global memory for 256 * (2*16) = 8,192
mul/add operations.
– Memory bandwidth no longer a limiting factor
![Page 31: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/31.jpg)
CUDA Code – Kernel
Execution Configuration// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
![Page 32: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/32.jpg)
Tiled Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;
4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory
9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11. __syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;
}
![Page 33: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/33.jpg)
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty210
TILE_WIDTH-1
2
1
0
TIL
E_
WID
TH
TIL
E_
WID
TH
TIL
E_
WID
TH
E
WID
TH
WID
TH
Tiled Multiply
• Each block computes one
square sub-matrix Pdsub of size TILE_WIDTH
• Each thread computes one
element of Pdsub
m
kbx
by
k
m
![Page 34: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/34.jpg)
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
![Page 35: Lecture 5 Matrix-Matrix Product - UMIACSramani/cmsc828e_gpusci/Lecture5.pdfglobal memory and does that many floating point computations – So this version does one flop per 4 byte](https://reader033.fdocuments.net/reader033/viewer/2022051813/6032ba2a01bf0656860fbaeb/html5/thumbnails/35.jpg)
Tiling Size Effects
0
10
20
30
40
50
60
70
80
90
100
tiled
only
tiled &
unro
lled
tiled
only
tiled &
unro
lled
tiled
only
tiled &
unro
lled
tiled
only
tiled &
unro
lled
not tiled 4x4 tiles 8x8 tiles 12x12 tiles 16x16 tiles