Lecture 1: Introduction · Basic Programming Model on GPU Chapter 2: Programming Model CUDA C...
Transcript of Lecture 1: Introduction · Basic Programming Model on GPU Chapter 2: Programming Model CUDA C...
GPU Programming
Lecture 1: Introduction
Miaoqing HuangUniversity of Arkansas
Fall 2013
1 / 27
Outline
Course Introduction
GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model
Programming GPUs using Nvidia CUDA
2 / 27
Outline
Course Introduction
GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model
Programming GPUs using Nvidia CUDA
3 / 27
Course ObjectiveI GPU programming
I Learn how to program massively parallel processors and achieveI High performanceI Scalability across future generations
I Acquire technical knowledge required to achieve the above goalsI Principles and patterns of parallel programmingI Processor architecture features and constraintsI Programming APIs, tools and techniques
4 / 27
Lab assignments, Projects, and Course gradingI No homework, no examI Constituent components of course grading
I 7 lab assignmentsI 2 points for lab-0, 8 points each for the remaining labs
I 1 projectI 40 points, start development from the middle of the semester
I Classroom presentationI Lab discussion and project
I Class attendanceI All lab assignments and the project are supposed to be carried
out individually
5 / 27
Lab EquipmentI GPU computing
I Fermi GPU: GeForce GTX 480I 480 Fermi CUDA coresI All the workstations in JBHT 237 are equipped with one GTX 480
GPUI Fermi GPU: Telsa C2075
I 448 Fermi CUDA coresI Low power consumption, high double-precision floating-point
performanceI Kepler GPU: Telsa K20
I 2496 CUDA coresI High double-precision floating-point performance, advanced
features
6 / 27
Outline
Course Introduction
GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model
Programming GPUs using Nvidia CUDA
7 / 27
Why Massively Parallel Processor
I A quiet revolution and potential build-upI Performance advantages again multicore CPU
I GFLOPS: 1,000 vs. 100 (in year 2009)I Memory bandwidth (GB/s): 200 vs. 20
I GPU in every PC and workstation - massive volume and potentialimpact
8 / 27
Different Design Philosophies
DRAM
Cache
ALU
Control
ALU
ALU
ALU
DRAM
I CPU: sequential executionI Multiple complicated ALU designI Complicated control logic, e.g., branch predictionI Big cache
I GPU: parallel computingI Many simple processing coresI Simple control and scheduling logicI No or small cache
9 / 27
Architecture of Fermi GPU
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
LD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/ST
Spe
cial
Fu
nctio
n U
nit
Spe
cial
Fu
nctio
n U
nit
Spe
cial
Fu
nctio
n U
nit
Spe
cial
Fu
nctio
n U
nit
Register File (32,768 × 32-bit)
Dispatch Unit Dispatch UnitWarp Scheduler Warp Scheduler
Instruction Cache
Interconnect Network64 KB Shared Memory / L1 Cache
Fermi Streaming Multiprocessor (SM)
FermiSM
L2 Cache
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
FermiSM
Gig
athr
ead
Hos
t In
terfa
ceD
RA
MD
RA
M
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
16-SM Fermi GPU
Dispatch PortOperant Collector
Result Queue
FP Unit INT Unit
CUDA Core
I 512 Fermi streaming processors in 16 streaming multiprocessors
10 / 27
Architecture of Kepler GPU
I 2,880 streaming processors in 15 streaming multiprocessors
11 / 27
Architecture of Kepler Streaming Multiprocessor
I Each streaming multiprocessor contains 192 single-precisioncores and 64 double-precision cores
12 / 27
Basic Programming Model on GPU Chapter 2: Programming Model
CUDA C Programming Guide Version 3.1 9
Figure 2-1. Grid of Thread Blocks
The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.
Each block within the grid can be identified by a one-dimensional or two-dimensional index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
Grid
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Block (2, 1) Block (1, 1) Block (0, 1)
Block (2, 0) Block (1, 0) Block (0, 0)
I Issue hundreds of thousandsof threads targetingthousands of processors
13 / 27
Execution Model on GPU
Beyond Programmable Shading: Fundamentals
32
Hiding shader stalls Time
(clocks) Frag 1 … 8
Fetch/ Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data
ALU ALU ALU ALU
ALU ALU ALU ALU
14 / 27
Execution Model on GPU
Beyond Programmable Shading: Fundamentals 33
Hiding shader stalls Time
(clocks)
Fetch/ Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
15 / 27
Execution Model on GPU
Beyond Programmable Shading: Fundamentals 34
Hiding shader stalls Time
(clocks)
Stall
Runnable
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
16 / 27
Execution Model on GPU
Beyond Programmable Shading: Fundamentals 35
Hiding shader stalls Time
(clocks)
Stall
Runnable
1 2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
17 / 27
Execution Model on GPU
Beyond Programmable Shading: Fundamentals 36
Hiding shader stalls Time
(clocks)
1 2 3 4
Stall
Stall
Stall
Stall
Runnable
Runnable
Runnable
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
18 / 27
Execution Model on GPU
Beyond Programmable Shading: Fundamentals 37
Throughput! Time
(clocks)
Stall
Runnable
2 3 4
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
Stall
Runnable
Done!
1
Increase run time of one group To maximum throughput of many groups
Start
Start
Start
19 / 27
How to deal with branches?
Beyond Programmable Shading: Fundamentals 25
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
20 / 27
How to deal with branches?
Beyond Programmable Shading: Fundamentals 26
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
21 / 27
How to deal with branches?
Beyond Programmable Shading: Fundamentals 27
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
Not all ALUs do useful work! Worst case: 1/8 performance
22 / 27
How to deal with branches?
Beyond Programmable Shading: Fundamentals 28
But what about branches?
ALU 1 ALU 2 . . . ALU 8 . . . Time
(clocks)
2 ... 1 ... 8
if (x > 0) {
} else {
}
<unconditional shader code>
<resume unconditional shader code>
y = pow(x, exp);
y *= Ks;
refl = y + Ka;
x = 0;
refl = Ka;
T T T F F F F F
23 / 27
Partition of an applicationSequential portions
Traditional CPU coverage
Parallel portions
GPU coverage
Obstacles
I Increase the data parallel portion of an applicationI Analyze an existing applicationI Expand the data volume of the parallel part
24 / 27
Outline
Course Introduction
GPUs as Parallel ComputersTrend and Design PhilosophiesProgramming and Execution Model
Programming GPUs using Nvidia CUDA
25 / 27
Nvidia CUDAI CUDA driver
I Handle the communication with Nvidia GPUsI CUDA toolkit
I Contain the tools needed to compile and build a CUDA applicationI CUDA SDK
I Include sample projects that provide source code and otherresources for constructing CUDA programs
26 / 27
Program GPUs in JBHT 2371. CUDA 5, including drivers and toolkits, has been installed on all
computers in JBHT 2372. The environment variables have been properly set to compile
your code3. Log into the machine using your uark id: gacl\username4. Remote log into the machine, say, hostname is csce-t7500-13:
ssh gacl\\[email protected]
27 / 27