Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON,...
-
Upload
todd-nicholson -
Category
Documents
-
view
238 -
download
4
Transcript of Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON,...
![Page 1: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/1.jpg)
Graphics Processing UnitsREFERENCES:
•COMPUTER ARCHITECTURE 5TH EDITION, HENNESSY AND PATTERSON, 2012
•HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA_FERMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF
•HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT093009110932&P=1
•HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML
•HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK
![Page 2: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/2.jpg)
CPU vs. GPU
http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html
• CPU: small fraction of chip used for arithmetic
![Page 3: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/3.jpg)
CPU vs GPU
http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite
• GPU: large fraction of chip used for arithmetic
![Page 4: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/4.jpg)
CPU vs GPU
Intel Haswell 170 GFlops on quad-core at 3.4GHz
AMD Radeon R9 290 4800 GFlops at 9.5GHz
Nvidia GTX 970 5000 Gflops at 1.05GHz
![Page 5: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/5.jpg)
GPGPU
General Purpose GPU programming Massively parallel
Scientific computing, brain simulations, etc
In supercomputers 53 of top500.org supercomputers used
NVIDIA/AMD GPUs (Nov 2014 ranking)
Including 2nd and 6th places
![Page 6: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/6.jpg)
OpenCL vs CUDA
Both for GPGPU
OpenCL Open standard
Supported on AMD, NVIDIA, Intel, Altera, …
CUDA Proprietary (Nvidia)
Losing ground to OpenCL?
Similar performance
![Page 7: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/7.jpg)
CUDA
Programming on Parallel Machines, Norm Matloff, Chapter 5
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Uses a thread hierarchy Thread
Block
Grid
![Page 8: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/8.jpg)
Thread
Executes an instance of a kernel (program)
ThreadID (within block), program counter, registers, private memory, input and output parameters
Private memory for register spills, function calls, array variables
Nvidia Fermi Whitepaper pg 6
![Page 9: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/9.jpg)
Block
Set of concurrently executing threads
Cooperate via barrier synchronization and shared memory (fast but small)
BlockID (within grid)
Nvidia Fermi Whitepaper pg 6
![Page 10: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/10.jpg)
Grid
Array of thread blocks running same kernel
Read and write global memory (slow – hundreds of cycles)
Synchronize between dependent kernel calls
Nvidia Fermi Whitepaper pg 6
![Page 11: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/11.jpg)
Hardware Mapping
GPU executes 1+ kernel (program) grids
Streaming Multiprocessor (sm) executes 1+ thread blocks
CUDA core executes thread
![Page 12: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/12.jpg)
Fermi Architecture
Debuted in 2010
512 CUDA cores executes 1 FP or integer instruction per cycle
32 CUDA cores per SM
16 SMs per GPU
6 64-bit memory ports
PCI-Express interface to CPU
GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware)
fast context switch
3 billion transistors
![Page 13: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/13.jpg)
Nvid
ia F
erm
i W
hit
epaper
pg 7
![Page 14: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/14.jpg)
CUDA core
pipelined integer and FP units
IEEE 754-2008 FP fused multiply-add
integer unit boolean, shift, move, compare, ...
Nvidia Fermi Whitepaper pg 8
![Page 15: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/15.jpg)
Streaming Multiprocessor (SM) 32 CUDA cores
16 ld/st units calculate source/destination
addresses
Special Function Units sin, cosine, reciprocal, sqrt
Nvidia Fermi Whitepaper pg 8
![Page 16: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/16.jpg)
Warps
32 threads from a block are bundled into warps which execute the same instr/cycle
this becomes the minimum size of SIMD data
warps are implicitly synchronized if threads branch in different directions, they step
through both using predicated instructions
two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs
![Page 17: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/17.jpg)
Maxwell Architecture
2014
16 streaming multiprocessors * 128 cores/SM = 2048 cores
![Page 18: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/18.jpg)
Programming CUDA
C code
daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i];}
![Page 19: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/19.jpg)
Programming CUDA
CUDA code
__host__int nblocks=(n+511)/512; // grid sizedaxpy<<<nblocks,512>>(n,2.0,x,y);// 512 threads/block
__global__void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i];}
![Page 20: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/20.jpg)
n=8192, 512 threads/blockgrid block0 warp0 Y[0]=A*X[0]+Y[0]
...
Y[31]=A*X[31]+Y[31]
...
warp15 Y[480]=A*X[480]+Y[480]
...
Y[511]=A*X[511]+Y[511]
...
block15 warp0 Y[7680]=A*X[7680]+Y[7680]
...
Y[7711]=A*X[7711]+Y[7711]
...
warp15 Y[8160]=A*X[8160]+Y[8160]
...
Y[8191]=A*X[8191]+Y[8191]
![Page 21: Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON, 2012 HTTP://.](https://reader036.fdocuments.net/reader036/viewer/2022062320/56649d825503460f94a67e8c/html5/thumbnails/21.jpg)
Moving data between host and GPU
int main() {double *x, *y, a, *dx, *dy;x = (double *)malloc(sizeof(double)*n);y = (double *)malloc(sizeof(double)*n);// initialize x and y…cudaMalloc(dx, n*sizeof(double));cudaMalloc(dy, n*sizeof(double));cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); …daxpy<<<nblocks,512>>(n,2.0,x,y);cudaThreadSynchronize();cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost);cudaMemFree(dx); cudaMemFree(dy);free(x); free(y);
}