IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
description
Transcript of IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
![Page 1: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/1.jpg)
IAP09 CUDA@MIT / 6.963
Supercomputing on your desktop:Programming the next generation of cheap
and massively parallel hardware using CUDA
Lecture 04
CUDA Advanced #1-
Nicolas Pinto (MIT)
![Page 2: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/2.jpg)
During this course,
we’ll try to
and use existing material ;-)
“ ”
adapted for 6.963
![Page 3: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/3.jpg)
![Page 4: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/4.jpg)
![Page 5: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/5.jpg)
warp != wrap
![Page 6: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/6.jpg)
Todayyey!!
![Page 7: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/7.jpg)
Textures & OpenGLAsync API
LibrariesInterfacing CUDA
Performance
IAP09 CUDA@MIT / 6.963
![Page 8: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/8.jpg)
CUDA Textures and OpenGL
IAP09 CUDA@MIT / 6.963
![Page 9: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/9.jpg)
CUDA Texture Functionality
© NVIDIA Corporation 2008 160
Textures in CUDA
Different hardware path to memory
Benefits of CUDA textures:Texture fetches are cached
Optimized for 2D locality
Textures are addressable in 2DUsing integer or normalized coordinates
Means fewer addressing calculations in code
Provide filtering for free
Free wrap modes (boundary conditions)Clamp to edge / repeat
Limitations of CUDA textures:Read-only
Currently either 1D or 2D (3D will be added)
9-bit accuracy of filter weights
Textures
![Page 10: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/10.jpg)
© NVIDIA Corporation 2008 161
Two CUDA Texture Types
Bound to linear memoryGlobal memory is bound to a texture
Only 1D
Integer addressing
No filtering, no addressing modes
Bound to CUDA arraysCUDA array is bound to a texture
1D or 2D
Float addressing (size-based or normalized)
Filtering
Addressing modes (clamping, repeat)
Both:Return either element type or normalized float
© NVIDIA Corporation 2008 162
CUDA Texturing Steps
Host (CPU) code:Allocate/obtain memory (global linear, or CUDA array)
Create a texture reference object
Currently must be at file-scope
Bind the texture reference to memory/array
When done:
Unbind the texture reference, free resources
Device (kernel) code:Fetch using texture reference
Linear memory textures:
tex1Dfetch()
Array textures:
tex1D() or tex2D()
Textures
![Page 11: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/11.jpg)
© NVIDIA Corporation 2008 161
Two CUDA Texture Types
Bound to linear memoryGlobal memory is bound to a texture
Only 1D
Integer addressing
No filtering, no addressing modes
Bound to CUDA arraysCUDA array is bound to a texture
1D or 2D
Float addressing (size-based or normalized)
Filtering
Addressing modes (clamping, repeat)
Both:Return either element type or normalized float
© NVIDIA Corporation 2008 162
CUDA Texturing Steps
Host (CPU) code:Allocate/obtain memory (global linear, or CUDA array)
Create a texture reference object
Currently must be at file-scope
Bind the texture reference to memory/array
When done:
Unbind the texture reference, free resources
Device (kernel) code:Fetch using texture reference
Linear memory textures:
tex1Dfetch()
Array textures:
tex1D() or tex2D()
Textures
![Page 12: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/12.jpg)
© NVIDIA Corporation 2008 163
Texture ReferenceImmutable parameters (compile-time)
Type: type returned when fetchingBasic int, float typesCUDA 1-, 2-, 4-element vectors
Dimensionality:Currently 1 or 2 (3 will be supported in the future)
Read Mode:cudaReadModeElementTypecudaReadModeNormalizedFloat (valid for 8- or 16-bit ints)– returns [-1,1] for signed, [0,1] for unsigned
Mutable parameters (run-time, only for array-textures)Normalized:
non-zero = addressing range [0, 1]Filter Mode:
cudaFilterModePointcudaFilterModeLinear
Address Mode:cudaAddressModeClampcudaAddressModeWrap
© NVIDIA Corporation 2008 164
Example: Host code for linear mem
// declare texture reference (must be at file-scope)texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef;
...
// set up linear memoryunsigned short *dA = 0;cudaMalloc((void**)&dA, numBytes);cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice);
// bind texture reference to arraycudaBindTexture(NULL, texRef, dA);
Textures
![Page 13: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/13.jpg)
© NVIDIA Corporation 2008 163
Texture ReferenceImmutable parameters (compile-time)
Type: type returned when fetchingBasic int, float typesCUDA 1-, 2-, 4-element vectors
Dimensionality:Currently 1 or 2 (3 will be supported in the future)
Read Mode:cudaReadModeElementTypecudaReadModeNormalizedFloat (valid for 8- or 16-bit ints)– returns [-1,1] for signed, [0,1] for unsigned
Mutable parameters (run-time, only for array-textures)Normalized:
non-zero = addressing range [0, 1]Filter Mode:
cudaFilterModePointcudaFilterModeLinear
Address Mode:cudaAddressModeClampcudaAddressModeWrap
© NVIDIA Corporation 2008 164
Example: Host code for linear mem
// declare texture reference (must be at file-scope)texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef;
...
// set up linear memoryunsigned short *dA = 0;cudaMalloc((void**)&dA, numBytes);cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice);
// bind texture reference to arraycudaBindTexture(NULL, texRef, dA);
Textures
![Page 14: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/14.jpg)
© NVIDIA Corporation 2008 165
cudaArray Type
Channel format, width, height
cudaChannelFormatDesc structureint x, y, z, w: bits for each component
enum cudaChannelFormatKind – one of:cudaChannelFormatKindSigned
cudaChannelFormatKindUnsigned
cudaChannelFormatKindFloat
some predefined constructors:cudaCreateChannelDesc<float>(void);
cudaCreateChannelDesc<float4>(void);
Management functions:cudaMallocArray, cudaFreeArray,
cudaMemcpyToArray, cudaMemcpyFromArray, ...
© NVIDIA Corporation 2008 166
Example: Host code for 2D array tex
// declare texture reference (must be at file-scope)texture<float, 2, cudaReadModeElementType> texRef;
...
// set up the CUDA arraycudaChannelFormatDesc cf = cudaCreateChannelDesc<float>();cudaArray *texArray = 0;cudaMallocArray(&texArray, &cf, dimX, dimY);cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice);
// specify mutable texture reference parameterstexRef.normalized = 0;texRef.filterMode = cudaFilterModeLinear;texRef.addressMode = cudaAddressModeClamp;
// bind texture reference to arraycudaBindTextureToArray(texRef, texArray);
Textures
![Page 15: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/15.jpg)
© NVIDIA Corporation 2008 165
cudaArray Type
Channel format, width, height
cudaChannelFormatDesc structureint x, y, z, w: bits for each component
enum cudaChannelFormatKind – one of:cudaChannelFormatKindSigned
cudaChannelFormatKindUnsigned
cudaChannelFormatKindFloat
some predefined constructors:cudaCreateChannelDesc<float>(void);
cudaCreateChannelDesc<float4>(void);
Management functions:cudaMallocArray, cudaFreeArray,
cudaMemcpyToArray, cudaMemcpyFromArray, ...
© NVIDIA Corporation 2008 166
Example: Host code for 2D array tex
// declare texture reference (must be at file-scope)texture<float, 2, cudaReadModeElementType> texRef;
...
// set up the CUDA arraycudaChannelFormatDesc cf = cudaCreateChannelDesc<float>();cudaArray *texArray = 0;cudaMallocArray(&texArray, &cf, dimX, dimY);cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice);
// specify mutable texture reference parameterstexRef.normalized = 0;texRef.filterMode = cudaFilterModeLinear;texRef.addressMode = cudaAddressModeClamp;
// bind texture reference to arraycudaBindTextureToArray(texRef, texArray);
Textures
![Page 16: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/16.jpg)
© NVIDIA Corporation 2008 177
OpenGL Interoperability
OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory
Vertex buffer objects
Pixel buffer objects
Direct3D9 Vertex objects can be mapped
Data can be accessed like any other global data in the device code
Image data can be displayed from pixel buffer objects using glDrawPixels / glTexImage2D
Requires copy in video memory, but still fast
© NVIDIA Corporation 2008 178
OpenGL Interop Steps
Register a buffer object with CUDAcudaGLRegisterBufferObject(GLuint buffObj);
OpenGL can use a registered buffer only as a sourceUnregister the buffer prior to rendering to it by OpenGL
Map the buffer object to CUDA memorycudaGLMapBufferObject(void **devPtr, GLuint buffObj);
Returns an address in global memoryBuffer must registered prior to mapping
Launch a CUDA kernel to process the buffer
Unmap the buffer object prior to use by OpenGLcudaGLUnmapBufferObject(GLuint buffObj);
Unregister the buffer objectcudaGLUnregisterBufferObject(GLuint buffObj);
Optional: needed if the buffer is a render target
Use the buffer object in OpenGL code
OpenGL
![Page 17: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/17.jpg)
© NVIDIA Corporation 2008 177
OpenGL Interoperability
OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory
Vertex buffer objects
Pixel buffer objects
Direct3D9 Vertex objects can be mapped
Data can be accessed like any other global data in the device code
Image data can be displayed from pixel buffer objects using glDrawPixels / glTexImage2D
Requires copy in video memory, but still fast
© NVIDIA Corporation 2008 178
OpenGL Interop Steps
Register a buffer object with CUDAcudaGLRegisterBufferObject(GLuint buffObj);
OpenGL can use a registered buffer only as a sourceUnregister the buffer prior to rendering to it by OpenGL
Map the buffer object to CUDA memorycudaGLMapBufferObject(void **devPtr, GLuint buffObj);
Returns an address in global memoryBuffer must registered prior to mapping
Launch a CUDA kernel to process the buffer
Unmap the buffer object prior to use by OpenGLcudaGLUnmapBufferObject(GLuint buffObj);
Unregister the buffer objectcudaGLUnregisterBufferObject(GLuint buffObj);
Optional: needed if the buffer is a render target
Use the buffer object in OpenGL code
OpenGL
![Page 18: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/18.jpg)
© NVIDIA Corporation 2008 179
Interop Scenario:Dynamic CUDA-generated texture
Register the texture PBO with CUDA
For each frame:
Map the buffer
Generate the texture in a CUDA kernel
Unmap the buffer
Update the texture
Render the textured object
unsigned char *p_d=0;
cudaGLMapBufferObject((void**)&p_d, pbo);
prepTexture<<<height,width>>>(p_d, time);
cudaGLUnmapBufferObject(pbo);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glBindTexture(GL_TEXTURE_2D, texID);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,
GL_BGRA, GL_UNSIGNED_BYTE, 0);
© NVIDIA Corporation 2008 180
Interop Scenario:Frame Post-processing by CUDA
For each frame:
Render to PBO with OpenGL
Register the PBO with CUDA
Map the buffer
Process the buffer with a CUDA kernel
Unmap the buffer
Unregister the PBO from CUDA
unsigned char *p_d=0;
cudaGLRegisterBufferObject(pbo);
cudaGLMapBufferObject((void**)&p_d, pbo);
postProcess<<<blocks,threads>>>(p_d);
cudaGLUnmapBufferObject(pbo);
cudaGLUnregisterBufferObject(pbo);
...
OpenGL
![Page 19: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/19.jpg)
© NVIDIA Corporation 2008 179
Interop Scenario:Dynamic CUDA-generated texture
Register the texture PBO with CUDA
For each frame:
Map the buffer
Generate the texture in a CUDA kernel
Unmap the buffer
Update the texture
Render the textured object
unsigned char *p_d=0;
cudaGLMapBufferObject((void**)&p_d, pbo);
prepTexture<<<height,width>>>(p_d, time);
cudaGLUnmapBufferObject(pbo);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glBindTexture(GL_TEXTURE_2D, texID);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,
GL_BGRA, GL_UNSIGNED_BYTE, 0);
© NVIDIA Corporation 2008 180
Interop Scenario:Frame Post-processing by CUDA
For each frame:
Render to PBO with OpenGL
Register the PBO with CUDA
Map the buffer
Process the buffer with a CUDA kernel
Unmap the buffer
Unregister the PBO from CUDA
unsigned char *p_d=0;
cudaGLRegisterBufferObject(pbo);
cudaGLMapBufferObject((void**)&p_d, pbo);
postProcess<<<blocks,threads>>>(p_d);
cudaGLUnmapBufferObject(pbo);
cudaGLUnregisterBufferObject(pbo);
...
OpenGL
![Page 20: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/20.jpg)
CUDAAsync API
IAP09 CUDA@MIT / 6.963
![Page 21: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/21.jpg)
97
!"#$%&'($()"*+,+('#*%(-#
!"#$%&'($()"*&(".*!" /,01%,*+,+('#*%(-#*2('*
-34,56(%7,/*+,+('#*2',,"*)-*89:*($*366*8:;!*
%3-3<6,*/,01%,"
=0,'63-*1+-6,+,$.,/*<#*)"1$4*3*8:;!*".',3+
8:;!*>.',3+*?*>,@),$%,*(2*8:;!*(-,'3.1($"*.&3.*
,A,%).,*1$*('/,'
>.',3+*!9BC
D3%&*".',3+*&3"*3$*B;C*E*?*/,23)6.*".',3+
cudaMemcpyAsync(dst, src, size, 0);
Async
![Page 22: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/22.jpg)
98
!"#$%&'()#$*#%(&*+(,#,-$.(/-'.
0-*/1$$#*2(#3#/124-*(-5(&()#$*#%(&*+(&(6-72(!"
+#"4/#(,#,-$.(/-'.(5-$('&8#9%-/)#+(,#,-$.
0-,'12#(/&'&:4%42.(;<(=>=(?@AB(&*+(1'C
D"&4%&:%#(&7(&('$#"4#E(5#&21$#(4*(0FGD(=>=
!"#$%&'7()#$*#%(#3#/124-*(4*(-*#(72$#&,(E426(&(,#,-$.(
/-'.(5$-,(&*-26#$(72$#&,
H2$#&,(DIJK
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(dst, src, size, stream1);
kernel<<<grid, block, 0, stream2>>>(…);
cudaStreamQuery(stream2);
-"#$%&''#+
Async
![Page 23: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/23.jpg)
95
!"#$%&'()*%$+,
&'()*-%./(%0)-(/*(1%2/(34/1(15%0)*4%!"#$%3.66%-*/(.7-
"-.8(%-3()./04-9
7(.-:/(%(6.;-(1%*07(%<4/%!"#$%3.66-%23643=%3>36(%;/(30-04)5
?:(/>%*@(%-*.*:-%4<%.)%.->)3@/4)4:-%!"#$%3.66
A643=%!+"%:)*06%!"#$%3.66-%;/04/%*4%*@(%('()*%./(%347;6(*(1
.->)3$+, -.7;6(%0)%!"#$%B#C
3:1.&'()*D* -*./*E%-*4;F
3:1.&'()*!/(.*(2G-*./*5F 3:1.&'()*!/(.*(2G-*4;5F
3:1.&'()*H(34/12-*./*E%I5F
=(/)(6JJJ8/01E%A643=KKK2LLL5F
3:1.&'()*H(34/12-*4;E%I5F
3:1.&'()*B>)3@/4)0M(2-*4;5F
<64.*%(*F
3:1.&'()*&6.;-(1N07(2G(*E%-*./*E%-*4;5F
3:1.&'()*#(-*/4>2-*./*5F 3:1.&'()*#(-*/4>2-*4;5F
95
Async
![Page 24: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/24.jpg)
CUDALibraries
IAP09 CUDA@MIT / 6.963
![Page 25: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/25.jpg)
9M02: High Performance Computing with CUDA
CUDA librariesCUDA libraries
CUDA includes 2 widely used libraries
CUBLAS: BLAS implementation
CUFFT: FFT implementation
CUDPP (Data Parallel Primitives), available from
http://www.gpgpu.org/developer/cudpp/ :
Reduction
Scan
Sort
Library
![Page 26: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/26.jpg)
10M02: High Performance Computing with CUDA
Closely Coupled CPU-GPUClosely Coupled CPU-GPU
Operation 1 Operation 2 Operation 3
Init
Alloc
Function Lib LibFunction Function
CPU
GPU
Integrated programming model
High speed data transfer – up to 5.5GB/sec
Asynchronous data transfer
Large GPU memory systems
Library
![Page 27: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/27.jpg)
11M02: High Performance Computing with CUDA
CUBLASCUBLAS
Implementation of BLAS (Basic Linear Algebra Subprograms)on top of CUDA driver
Self-contained at the API level, no direct interaction with CUDAdriver
Basic model for use
Create matrix and vector objects in GPU memory space
Fill objects with data
Call sequence of CUBLAS functions
Retrieve data from GPU
CUBLAS library contains helper functions
Creating and destroying objects in GPU space
Writing data to and retrieving data from objects
Library
![Page 28: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/28.jpg)
13M02: High Performance Computing with CUDA
Using CUBLASUsing CUBLAS
Interface to CUBLAS library is in cublas.h
Function naming conventioncublas + BLAS name
Eg., cublasSGEMM
Error handlingCUBLAS core functions do not return error
CUBLAS provides function to retrieve last error recorded
CUBLAS helper functions do return error
Helper functions:Memory allocation, data transfer
Implemented using C-based CUDA tool chainInterfacing to C/C++ applications is trivial
Library
![Page 29: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/29.jpg)
© 2008 NVIDIA Corporation.
Supported Features
Single Precision Double Precision*
Real Complex Real Complex
Level 1! ! !
Level 2!
dgemv, dger,
dsyr, dtrsv
Level 3!
cgemm!
zgemm
*Double-precision functions only supported on GPUs with double-precision hardware
Library
![Page 30: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/30.jpg)
© 2008 NVIDIA Corporation.
CUBLAS Helper Functions
cublasInit()Initializes CUBLAS library
cublasShutdown()
Releases resources used by CUBLAS library
cublasGetError()
Returns last error from CUBLAS core function (+ resets)
cublasAlloc()Wrapper around cudaMalloc() to allocate space for array
cublasFree()destroys object in GPU memory
cublas[Set|Get][Vector|Matrix]()Copies array elements between CPU and GPU memory
Accommodates non-unit strides
Library
![Page 31: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/31.jpg)
© 2008 NVIDIA Corporation.
sgemmExample.c
#include <stdio.h>
#include <stdlib.h>
#include "cublas.h"
int main(void)
{
float *a_h, *b_h, *c_h;
float *a_d, *b_d, *c_d;
float alpha = 1.0f, beta = 0.0f;
int N = 2048, n2 = N*N;
int nBytes, i;
nBytes = n2*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
c_h = (float *)malloc(nBytes);
for (i=0; i < n2; i++) {
a_h[i] = rand() / (float) RAND_MAX;
b_h[i] = rand() / (float) RAND_MAX;
}
cublasInit();
cublasAlloc(n2, sizeof(float), (void **)&a_d);
cublasAlloc(n2, sizeof(float), (void **)&b_d);
cublasAlloc(n2, sizeof(float), (void **)&c_d);
cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1);
cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1);
cublasSgemm('n', 'n', N, N, N, alpha, a_d, N,
b_d, N, beta, c_d, N);
cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1);
free(a_h); free(b_h); free(c_h);
cublasFree(a_d); cublasFree(b_d);
cublasFree(c_d);
cublasShutdown();
return 0;
}
Library
![Page 32: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/32.jpg)
14M02: High Performance Computing with CUDA
Calling CUBLAS from FORTRANCalling CUBLAS from FORTRAN
Two interfaces:
Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c)
Allows interfacing to existing applications without any changes
During each call, the wrappers allocate GPU memory, copy source datafrom CPU memory space to GPU memory space, call CUBLAS, and finallycopy back the results to CPU memory space and deallocate the GPGPUmemory
Intended for light testing due to call overhead
Non-Thunking (default)
Intended for production code
Substitute device pointers for vector and matrix arguments in all BLASfunctions
Existing applications need to be modified slightly to allocate and deallocatedata structures in GPGPU memory space (using CUBLAS_ALLOC andCUBLAS_FREE) and to copy data between GPU and CPU memoryspaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR,CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)
Library
![Page 33: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/33.jpg)
15M02: High Performance Computing with CUDA
SGEMM example (THUNKING)SGEMM example (THUNKING)! Define 3 single precision matrices A, B, C
real , dimension(m1,m1):: A, B, C
……
! Initialize
……
#ifdef CUBLAS
! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of
! memory allocation on device and data movement)
call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
#else
! Call SGEMM in host BLAS library
call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
#endif
To use the host BLAS routine:
g95 –O3 code.f90 –L/usr/local/lib -lblas
To use the CUBLAS routine (fortran.c is provided by NVIDIA):
gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c
g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas
Library
![Page 34: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/34.jpg)
16M02: High Performance Computing with CUDA
SGEMM example (NON-THUNKING)SGEMM example (NON-THUNKING)
! Define 3 single precision matrices A, B, C
real , dimension(m1,m1):: A, B, C
integer:: devPtrA, devPtrB, devPtrC, size_of_real=4
……
! Initialize A, B, C
………
! Allocate matrices on GPU
cublasAlloc(m1*m1, size_of_real, devPtrA)
cublasAlloc(m1*m1, size_of_real, devPtrB)
cublasAlloc(m1*m1, size_of_real, devPtrC)
!Copy data from CPU to GPU
cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1)
cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1)
cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1)
! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data inGPU memory)
call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
!Copy data from GPU to CPU
cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1)
! Free memory on device
cublasFree(devPtrA)
……
g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas
Library
![Page 35: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/35.jpg)
Volkov and Demmel (SC08)
!
!
"#$!%&'(!"()*+,(!
"-./01!
"()*+,(!
2011"-.!
"()*+,(!
0011"-.
"()*+,(!
0311"-4
5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!
,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!
+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!
9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!
'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!
'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!
K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!
'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!
4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!
4#?!M(&>!M(+!,*+(! /;! /C! //! /:!
4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!
S#?!M(&>!"6=*MG9! C0! T! T! T!
S#?!6=*M9RO*+N! <B<! T! T! T!
-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
O*+N9B!!
!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,
V&9F=J!V*=>*7!
W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J
X&'(9!YB!S(''(=!
W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!
7901('$1,
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!
:,;#1(2<4$1*2#,
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!
]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!
Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!
-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!
]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!
-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U
9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!
=,-./,7($%*1"$14(",
[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!
*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!
=>:,?21'1*2#,
-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!
!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-
3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))
#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))
>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ
Library
![Page 36: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/36.jpg)
Volkov and Demmel (SC08)
!
!
"#$!%&'(!"()*+,(!
"-./01!
"()*+,(!
2011"-.!
"()*+,(!
0011"-.
"()*+,(!
0311"-4
5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!
,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!
+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!
9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!
'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!
'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!
K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!
'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!
4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!
4#?!M(&>!M(+!,*+(! /;! /C! //! /:!
4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!
S#?!M(&>!"6=*MG9! C0! T! T! T!
S#?!6=*M9RO*+N! <B<! T! T! T!
-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
O*+N9B!!
!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,
V&9F=J!V*=>*7!
W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J
X&'(9!YB!S(''(=!
W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!
7901('$1,
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!
:,;#1(2<4$1*2#,
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!
]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!
Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!
-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!
]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!
-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U
9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!
=,-./,7($%*1"$14(",
[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!
*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!
=>:,?21'1*2#,
-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!
!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-
3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))
#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))
>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ
Library
![Page 37: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/37.jpg)
17M02: High Performance Computing with CUDA
DGEMM PerformanceDGEMM Performance
Library
![Page 38: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/38.jpg)
© 2008 NVIDIA Corporation.
Additional Resources
CUDA SDK examplesimpleCUBLAS
CUBLAS Library documentation
in doc folder of CUDA Toolkit or download from CUDA Zone
Library
![Page 39: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/39.jpg)
18M02: High Performance Computing with CUDA
CUFFTCUFFT
The Fast Fourier Transform (FFT) is a divide-and-
conquer algorithm for efficiently computing discrete
Fourier transform of complex or real-valued data
sets.
CUFFT is the CUDA FFT library
Provides a simple interface for computing parallel FFT on
an NVIDIA GPU
Allows users to leverage the floating-point power and
parallelism of the GPU without having to develop a custom,
GPU-based FFT implementation
Library
![Page 40: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/40.jpg)
19M02: High Performance Computing with CUDA
Supported FeaturesSupported Features
1D, 2D and 3D transforms of complex and real-valued
data
Batched execution for doing multiple 1D transforms
in parallel
1D transform size up to 8M elements
2D and 3D transform sizes in the range [2,16384]
In-place and out-of-place transforms for real and
complex data.
Library
![Page 41: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/41.jpg)
20M02: High Performance Computing with CUDA
Transform TypesTransform Types
Library supports real and complex transformsCUFFT_C2C, CUFFT_C2R, CUFFT_R2C
DirectionsCUFFT_FORWARD (-1) and CUFFT_INVERSE (1)
According to sign of the complex exponential term
Real and imaginary parts of complex input andoutput arrays are interleaved
cufftComplex type is defined for this
Real to complex FFTs, output array holds onlynonredundant coefficients
N -> N/2+1
N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1)
For in-place transforms the input/output arrays need to bepadded
Library
![Page 42: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/42.jpg)
21M02: High Performance Computing with CUDA
More on TransformsMore on Transforms
For 2D and 3D transforms, CUFFT performs transforms in row-
major (C-order)
If calling from FORTRAN or MATLAB, remember to change the
order of size parameters during plan creation
CUFFT performs un-normalized transforms:
IFFT(FFT(A))= length(A)*A
CUFFT API is modeled after FFTW. Based on plans, that
completely specify the optimal configuration to execute a
particular size of FFT
Once a plan is created, the library stores whatever state is
needed to execute the plan multiple times without recomputing
the configuration
Works very well for CUFFT, because different kinds of FFTs
require different thread configurations and GPU resources
Library
![Page 43: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/43.jpg)
© 2008 NVIDIA Corporation.
CUFFT Types and Definitions
cufftHandleType used to store and access CUFFT plans
cufftResults
Enumeration of API function return values
cufftReal
single-precision, real datatype
cufftComplexsingle-precision, complex datatype
Real and complex transforms
CUFFT_C2C, CUFFT_C2R, CUFFT_R2C
DirectionsCUFFT_FORWARD, CUFFT_INVERSE
Library
![Page 44: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/44.jpg)
© 2008 NVIDIA Corporation.
CUFFT Example#include <stdio.h>
#include <math.h>
#include "cufft.h"
int main(int argc, char *argv[])
{
cufftComplex *a_h, *a_d;
cufftHandle plan;
int N = 1024, batchSize = 10;
int i, nBytes;
double maxError;
nBytes = sizeof(cufftComplex)*N*batchSize;
a_h = (cufftComplex *)malloc(nBytes);
for (i=0; i < N*batchSize; i++) {
a_h[i].x = sinf(i);
a_h[i].y = cosf(i);
}
cudaMalloc((void **)&a_d, nBytes);
cudaMemcpy(a_d, a_h, nBytes,
cudaMemcpyHostToDevice);
cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);
cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);
cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);
cudaMemcpy(a_h, a_d, nBytes,
cudaMemcpyDeviceToHost);
// check error - normalize
for (maxError = 0.0, i=0; i < N*batchSize; i++) {
maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError);
maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError);
}
printf("Max fft error = %g\n", maxError);
cufftDestroy(plan);
free(a_h); cudaFree(a_d);
return 0;
}
Library
![Page 45: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/45.jpg)
© 2008 NVIDIA Corporation.
Additional CUFFT Resources
CUDA SDK examplessimpleCUFFT
convolutionFFT2D
oceanFFT
CUFFT Library documentation
In doc folder of CUDA Toolkit or download from CUDA Zone
Library
![Page 46: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/46.jpg)
![Page 47: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/47.jpg)
Glue ?
![Page 48: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/48.jpg)
Interfacing CUDA
IAP09 CUDA@MIT / 6.963
![Page 49: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/49.jpg)
23M02: High Performance Computing with CUDA
Interfacing CUDA with other languagesInterfacing CUDA with other languages
CUDA kernels from FORTRAN, allocate pinnedmemory from FORTRAN
Calling CUDA from MATLAB with MEX files
Several packages (open source and commercial) tointerface CUDA with Python, IDL, .NET, FORTRAN(Flagon). Browse CUDA Zone to find all thepackages.
Glue
![Page 50: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/50.jpg)
24M02: High Performance Computing with CUDA
Pinned memoryPinned memory from FORTRANfrom FORTRAN
use iso_c_binding
! The allocation is performed by C function calls. Define the C pointer as type (C_PTR)
type(C_PTR) :: cptr_A, cptr_B, cptr_C
! Define Fortran arrays as pointer.
real, dimension(:,:), pointer :: A, B, C
! Allocating memory with cudaMallocHost.
! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the
! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1))
res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) )
call c_f_pointer ( cptr_A, A, (/ m1, m1 /) )
! Use A as usual.
! See example code for cudaMallocHost interface code
Pinned memory provides a fast PCI-e transfer speed and enables use of streams:
•Allocation needs to be done with cudaMallocHost
•Use new Fortran 2003 features for interoperability with C.
http://www.nvidia.com/object/cuda_programming_tools.html
Glue
![Page 51: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/51.jpg)
25M02: High Performance Computing with CUDA
Calling CUDA kernels from FORTRANCalling CUDA kernels from FORTRAN
! Fortran -> C -> CUDA ->C ->Fortran
call cudafunction(c,c2,N)
From Fortran call C function that will call CUDA kernel
/* NB: Fortran subroutine arguments are passed by reference. */
extern "C" void cudafunction_(cuComplex *a, cuComplex *b, int *Np)
{
...
int N=*np;
cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N);
cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice);
dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1;
square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N);
cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost);
cudaFree(a_d);
}
complex_mul: main.f90 Cuda_function.o $(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart
cuda_function.o: cuda_function.cu nvcc -c -O3 cuda_function.cu
Glue
![Page 52: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/52.jpg)
26M02: High Performance Computing with CUDA
CUDA & MATLABCUDA & MATLAB
Even though MATLAB is built on many well-optimized libraries, some functions can performbetter when written in a compiled language (e.g. Cand Fortran).
MATLAB provides a convenient API for interfacingcode written in C and FORTRAN to MATLABfunctions with MEX files.
MEX files could be used to exploit multi-coreprocessors with OpenMP or threaded codes or likein this case to offload functions to the GPU.
Glue
![Page 53: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/53.jpg)
27M02: High Performance Computing with CUDA
NVMEX NVMEX
Native MATLAB script cannot parse CUDA code
New MATLAB script nvmex.m compiles CUDA code
(.cu) to create MATLAB function files
Syntax similar to original mex script:
>> nvmex –f nvmexopts.bat filename.cu –IC:\cuda\include
–LC:\cuda\lib -lcudart
Available for Windows and Linux from:
http://developer.nvidia.com/object/matlab_cuda.html
Glue
![Page 54: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/54.jpg)
28M02: High Performance Computing with CUDA
Mex Mex files for CUDAfiles for CUDA
A typical mex file will perform the following steps:
1. Convert from double to single precision
2. Rearrange the data layout for complex data
3. Allocate memory on the GPU
4. Transfer the data from the host to the GPU
5. Perform computation on GPU (library, custom code)
6. Transfer results from the GPU to the host
7. Rearrange the data layout for complex data
8. Convert from single to double
9. Clean up memory and return results to MATLAB
Some of these steps will go away with new versions of the library(2,7) and new hardware (1,8)
Glue
![Page 55: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/55.jpg)
29M02: High Performance Computing with CUDA
CUDA MEX exampleCUDA MEX example
/*Parse input, convert to single precision and to interleaved complex format */
…..
/* Allocate array on the GPU */
cufftComplex *rhs_complex_d;
cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M);
/* Copy input array in interleaved format to the GPU */
cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M, cudaMemcpyHostToDevice);
/* Create plan for CUDA FFT NB: transposing dimensions*/
cufftPlan2d(&plan, N, M, CUFFT_C2C) ;
/* Execute FFT on GPU */
cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ;
/* Copy result back to host */
cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M, cudaMemcpyDeviceToHost);
/* Clean up memory and plan on the GPU */
cufftDestroy(plan); cudaFree(rhs_complex_d);
/*Convert back to double precision and to split complex format */
….
Additional code in MEX file to handle CUDA
Glue
![Page 56: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/56.jpg)
30M02: High Performance Computing with CUDA
Timing detailsTiming details
1483 MB/s
1223 MB/s
1135 MB/s
1003 MB/s
PCI-e Bandwidth:
Host to/from device
14.x
11.x
1.8x
Speed
up
605s
789s
4937s
9525s
Runtime
Opteron 2210
Speed
up
Runtime
Opteron 250
577 s
735 s
4425 s
8098 s
12.XOverload Szeta
Standard MATLAB
15.7xOverload Szeta , FFT2 and
IFFT2
1.9xOverload FFT2 and IFFT2
1024x1024 mesh, 400 RK4 steps on Windows,
2D isotropic turbulence
Glue
![Page 57: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/57.jpg)
Glue
![Page 58: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/58.jpg)
Glue
![Page 59: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/59.jpg)
Glue
![Page 60: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/60.jpg)
Glue
![Page 61: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/61.jpg)
Glue
![Page 62: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/62.jpg)
![Page 63: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/63.jpg)
Wanna Play with The Big Guys?
![Page 64: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/64.jpg)
CUDAPerformance Strategies
IAP09 CUDA@MIT / 6.963
![Page 65: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/65.jpg)
© NVIDIA Corporation 2006 3
Programming Model
A kernel is executed as a grid of thread blocks
A thread block is a batch of threads that can cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution
Threads from different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Threading
![Page 66: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/66.jpg)
© NVIDIA Corporation 2008 10
Data Movement in a CUDA Program
Host Memory
Device Memory
[Shared Memory]
COMPUTATION
[Shared Memory]
Device Memory
Host Memory
Memory
![Page 67: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/67.jpg)
39
!"#$%$&'()*+,-$#.%/(0,-(#.'(123
456$%$&'($78'"'78'7#("5-5**'*$/%
456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?
@,%'#$%'/($#A/(='##'-(#,(-'9,%"B#'(#.57(#,(959.'
123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:
E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(
85#5(#-57/0'-/
GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='(
05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#
Perf
![Page 68: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/68.jpg)
40
!"#$%$&'()'%*+,(-*.'+'/0'
-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'
=2*>12?@*012(4'5$0'(%'%*+,(
!"#$%$&'(:*+(3"1#$12(2*012$#,($/(010.'4(#'A#<+'(
%'%*+,
B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3
Perf
![Page 69: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/69.jpg)
41
!"#$%&'(")*"+$%,-%./"0$'%1$2,03
45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03
!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03
<6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"%
6/"0$'%93%"88%*/0$"'6
<6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66
.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)?
:,"8$6:$"98$%"''0$667)+
1"*07@%*0")6;,6$%$@"2;8$%8"*$0
Perf
![Page 70: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/70.jpg)
42
!"#$%&'&((#()"*$+,,)-)#./(0
%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$
*2(/)3'1-#""1'"$#72&((0$82"0
9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4"
<##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$
*2(/)3(#$&-/)?#$/5'#&:$8(1-4"$3#'$*2(/)3'1-#""1'
@#=)"/#'";$"5&'#:$*#*1'0
Perf
![Page 71: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/71.jpg)
44
!"#$%&'$()*#*+,)*$-.
/()*#*+*-0'#"#$%&')%,-.1"%.
2$,3".4*-0'03$5,3'#"#$%&',44"..".
6.*-0'.7,%"8'#"#$%&'"11"4)*9"3&
Perf
![Page 72: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/72.jpg)
45
!"#"$%&"'()*&(
!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$
6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1
789:($;*"<$=>?@A*$BCDE$+(F$GH$89:($;*"<$=I5"3&/$JK$LDHHE
G89:($)/&$>?@A*$MFH
N,',.,O*$#&"'()*&(
@'#*&.*3,"#*$3"#"$(#&5-#5&*($-"'$2*$"66/-"#*3P$/;*&"#*3$
/'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$
.*./&0
8&/5;$#&"'()*&(
R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*(
Perf
![Page 73: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/73.jpg)
46
!"#$%&'()$*+,$-'./+0."123$.2
(4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+='27+-$-'./
>1"?5$2+=;#=$27+(4*",$-(</+<$.3'.-"1($@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9
LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+8'Q$.(5'()$*+!GH%$9
R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$
U2$+:;7=+("47;'1W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+'Q$."55+2/27$-+<$.3'.-"1($
0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72
Perf
![Page 74: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/74.jpg)
47
!"#$%"&'()#*+&,(%-./0*12(.
3145(.2&"%2(67+&16.2*8721#6.9&:;;<=;;&7"#7>&7+7"(.
?1>("+&2#&$(&@(*A#*)%67(&$#22"(6(7>
B@21)1C%21#6.&7%6&4*(%2"+&167*(%.(&@(*A#*)%67(
D#%"(.71649&8@&2#&E;F&.@((-8@
?%2(67+&51-1649&8@&2#&GHIF&.@((-8@
Perf
![Page 75: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/75.jpg)
48
!"#$%&'()*
+,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:
+,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1=
9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5"-.=,()/?,3$"#/?,@
8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.";0$%45"-.=,()/A?,3$"#/A?,@
AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45"-.=,()/>?,3$"#/>?,@
+..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%=
J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("),
&(K%
L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,
0$"'M,0%()*,-%#.
NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*
P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6
Perf
![Page 76: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/76.jpg)
49
!"#$%&'%()*''%&&+),%#(-./)0$"#1&
12 13 14 135 13617
12 13 14 135 13617
374 378 395 3:4349 352 355 399
374 378 395 3:4349 352 355 399
;"<%)=>?%#(&)@")A"1)B#?1-'-C#1%
*$$)1>?%#(&)C#?1-'-C#1%
Perf
![Page 77: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/77.jpg)
50
!"#$%&'(#')*+##'((,*-'%)."/*0&$%1(
12 13 14 135 13617
374 378349 352 355
:';<=1')*+##'((*>?*@A;'%)(
395 3B4399
C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G
12 13 14 137 13617
374 378 395 3B4349 352 355 399
135
Perf
![Page 78: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/78.jpg)
51
!"#$%&'()*+,-(.()*,/%&0$1&
234%5(.%)1,"),678+,
9%5)%$+,5%#:,#,;$"#1<,()'5%.%)1<,=5(1%,>#'?
@A,;$"#1&,BCDAEF
-(.%&,#G%5#*%:,"G%5,C89,50)&
CD9,>$"'?&,3,DHI,1J5%#:&+
@HIK&,L '"#$%&'%:
@HMK&,L '"#$%&'%:<,&".%,1J5%#:&,:")N1,4#51('(4#1%
@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&
Perf
![Page 79: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/79.jpg)
58
!"#$%&'()*+
,-./'-/.%&0"10&(2%0! 34054067089-%&:&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0
<;",=
?10,";0(&0)"-0@(#A$%+
B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540".067
:&%0,IJI0-"0#'G(%@%0'"#$%&'()*
zyx Point structure
zyx zyx zyx AoS
xxx yyy zzz SoA
Perf
![Page 80: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/80.jpg)
59
!"#$%&'()*+,-.//#01
!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2
!0(2('#$,2",/%/"0167".)8,9%0)%$&
:%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?>,"0,@A712%&,B($$,70%#9,'"#$%&'()*+
C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"-
E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG
D88(2(")#$,0%&".0'%&+D$(*)%8,I13%&,-JK,-#/3$%
Perf
![Page 81: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/81.jpg)
64
!"#"$$%$&'%()#*&+#,-./%,/0#%
12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#*
7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6
<66%2/."$&/)&",-.%9%&-.=-&:"25>.5/-
<",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$%
+&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06&
",,%66%6&"6&./&-"6&:"2;6
'0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2;
#%60$/&.2&"&:"2;&,)28$.,/&
?)28$.,/.2=&",,%66%6&"#%&6%#."$.@%5
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Perf
![Page 82: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/82.jpg)
65
!"#$%&''()**+#,%-."/01)*
23%!"#$%43#51+67*
8+#)"(%"''()**+#,%
*7(+')%99%:
23%!"#$%43#51+67*
;"#'3/%:<:%=)(/>7"7+3#
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Perf
![Page 83: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/83.jpg)
66
!"#$%&''()**+#,%-."/01)*
234"5%!"#$%67#81+9:*
;+#)"(%"''()**+#,%
*:(+')%<<%2
=34"5%!"#$%67#81+9:*
;+#)"(%"''()**+#,%
*:(+')%<<%=
Thread 11
Thread 10
Thread 9
Thread 8
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 15
Bank 7
Bank 6
Bank 5
Bank 4
Bank 3
Bank 2
Bank 1
Bank 0
Thread 15
Thread 7
Thread 6
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Thread 0
Bank 9
Bank 8
Bank 15
Bank 7
Bank 2
Bank 1
Bank 0x8
x8
Perf
![Page 84: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/84.jpg)
67
!"#$%&&'())()$*%+$,"$-%./)$".$012
3%.&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:()
<=77())4>($89?-4,$#"'&)$%'($%))4@.(&$,"$)=77())4>($
-%./)
012$5%)$AB$-%./)
<"$-%./$C$%&&'())$D$AB
<%*($%)$,5($)4E($"6$%$5%:6?#%'+F"$-%./$7".6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$".:;$#4,54.$%$)4.@:($5%:6?#%'+
Perf
![Page 85: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/85.jpg)
68
!"#$%&'(%()$*'+#,-'.),/01.23
!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%'
,)'+#,-'.),/01.23
5"%'/#32'.#3%6
7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13'
,)'+#,-'.),/01.2
7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;'
2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=
5"%'30)9'.#3%6
>#,-'?),/01.26'(@021:0%'2"$%#&3'1,'2"%'3#(%'"#0/89#$:'
#..%33'2"%'3#(%'+#,-
A@32'3%$1#01B%'2"%'#..%33%3
?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-
Perf
![Page 86: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/86.jpg)
Conflicts,Coalescing, Warps...I hate growing up.
Perf
![Page 87: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/87.jpg)
!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.
Perf
![Page 88: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/88.jpg)
70
!"#$%&'($")*+,*-
./0'."1+2-'34#$")*+,*-56
7228*#$"#-*9
:,"2-*;%)<
=>,%?%)<'.!@!'A")B';,)C2%;#*
.+--?8+*'C,$'->-)'*1"22'1"#$%;-*
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
Perf
![Page 89: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/89.jpg)
71
!"#$%&'(#')*+,%"(-$('
__global__ void transpose_naive(float *odata, float *idata, int width, int height)
{
unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
if (xIndex < width && yIndex < height)
{
unsigned int index_in = xIndex + width * yIndex;
unsigned int index_out = yIndex + height * xIndex;
$)%.%/0")'12$3.4 = 0)%.%/0")'120"4;
}
}
1.
2.
3.
4.
5.
6.
Perf
![Page 90: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/90.jpg)
72
!"#$%&'(#')*+,%"(-$('
.'%)(*/"-01*2,$3*4565
787978:78778;
;879;8:;87;8;
79879798:7987798;
<,/1'*$01-01*1$*4565
7987:87787;87
798;:8;78;;8;
79879:8797879;879
Stride = 16, uncoalesced
45654565
Stride = 1, coalesced
Perf
![Page 91: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/91.jpg)
73
!"#$%&'%()*+#,&-"&%
.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%&
*6+%#(7$"'8)974:)7;<3
=%#()16%)974:7;< 2,-/1)12$%:)&1"+%)2,1")>?@?
A+21%)16%)>?@?)(#1#)1")97;:74< "/1-/1)12$%*+#,&-"&%)16%)2,(%42,B)2,1")>?@?
*6+%#()914:1;<3
=%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$%
A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%
!"#$%&'2,B)2&)#'62%D%()2C3
E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH
Perf
![Page 92: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/92.jpg)
74
!"#$%&'%()*+#,&-"&%
.+/0%&)0")12324%#(&)5+"6)7232
898:89;89889<
<98:<9;<98<9<
8:98:8:9;8:988:9<
.+/0%&)0")72324%#(&)5+"6)1232
8:98;98898<98
8:9<;9<89<<9<
8:98:;98:898:<98:
898:89;89889<
<98:<9;<98<9<
8:98:8:9;8:988:9<
898:89;89889<
<98:<9;<98<9<
8:98:8:9;8:988:9<
Perf
![Page 93: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/93.jpg)
75
!"#"$%&'()(*+'(,-
./01+23$01+2$!"#"$4('/$3'0(21$5$67
8+-9$:,-;<(:'3
=1+23$;0,)$!"#"
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
!,<B'(,-
C<<,:+'1$+-$D1E'0+F :,<B)-
=1+2$3'0(21$5$6G
./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
Perf
![Page 94: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/94.jpg)
75
!"#"$%&'()(*+'(,-
./01+23$01+2$!"#"$4('/$3'0(21$5$67
8+-9$:,-;<(:'3
=1+23$;0,)$!"#"
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
!,<B'(,-
C<<,:+'1$+-$D1E'0+F :,<B)-
=1+2$3'0(21$5$6G
./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
6>?6@?66?6A?6
6>?A@?A6?AA?A
6>?6>@?6>6?6>A?6>
Perf
![Page 95: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/95.jpg)
76
!"#$%&'%()*+#,&-"&%
__global__ void transpose(float *odata, float *idata, int width, int height)
{
__shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];
unsigned int xBlock = blockDim.x * blockIdx.x;
unsigned int yBlock = blockDim.y * blockIdx.y;
unsigned int xIndex = xBlock + threadIdx.x;
unsigned int yIndex = yBlock + threadIdx.y;
unsigned int index_out, index_transpose;
if (xIndex < width && yIndex < height)
{
unsigned int index_in = width * yIndex + xIndex;
unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;
block[index_block] = idata[index_in];
index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
}
__syncthreads();
if (xIndex < width && yIndex < height)
odata[index_out] = block[index_transpose];
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Perf
![Page 96: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/96.jpg)
77
!"#$%&'%()!*+*$,%
-&((./&%)0*12)3'#4(%3*$,)#$.)-565)'&1*+*7#1*'$8
9:;<9:;8))=>=99+% ?%>)=>=::+%))@:>=A %&((./&B
C9:<C9:8))=>=D+%)))?%>)=>EE+%))))@F>CA %&((./&B
9=:F<9=:F8))=>E=+%)))?%>)9>G:+%))))@H>FA %&((./&B
9=:F<:=F;8))=>DG+%)))?%>)H>H+%))))))@;>FA %&((./&B
I'#4(%3*$,)0*12'/1)-565)'&1*+*7#1*'$8
9:;<9:;8))=>=9F+%
C9:<C9:8))=>9=9+%
9=:F<9=:F8))=>F9:+%
9=:F<:=F;8))=>;HG+%
Perf
![Page 97: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/97.jpg)
!"#$%&'()*+(),'-%./&'()*01&'2'3/&'()4
Perf
![Page 98: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/98.jpg)
79
!""#$%&"'
()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-
+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'
!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-
"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-
<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-
"1&"#**+&04'
?.<.0+,-9'-*+/1#*"+-#/%6+@
A+6./0+*/
B)%*+,-<+<1*'
Perf
![Page 99: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/99.jpg)
80
!"#$%&'()*+,#-.+/.0"#12#)1
3+(4+5'()*1+6+3+(4+70'2#8"().11("1
,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.
3+(4+5'()*1+%+3+(4+70'2#8"().11("1+6+>
?0'2#8'.+5'()*1+)9<+"0<+)(<)0"".<2'@+#<+9+70'2#8"().11("
&'()*1+2:92+9".<A2+B9#2#<C+92+9+DD1@<)2:".9$1EF+*..8+2:.+
:9"$B9".+501@
,05G.)2+2(+".1(0").+9;9#'95#'#2@+H ".C#12."1I+1:9".$+7.7("@
3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1
&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<
JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1
Perf
![Page 100: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/100.jpg)
81
!"#$%&"'()"*"+,"+-.
!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-.3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"'
;-"+/'$5%<=>)?< @AB<
A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'
?&(7"/%&(:JK 5--4*/+-.
AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M
/,,N1O:(((P1OQ(P1EQ(P1:
/,,N1O:(((P1JQ(P1OQ(P1R
S T(.(U(JV
W(T(S U(OV
7,N%D/'",N1O:((P1OQ(XP'OEUYZ(
/,,N1O:(((((((((((P1OQ(P1OQ(P1R
%[,/&/XYZ(UT(OV
Perf
![Page 101: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/101.jpg)
82
!"#$%&"'()'"%%*'"
+$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78
9$3$&$/#(:.0&4'%;
<*32"'(4=('"#$%&"'%(6"'(>"'/"-
?@AB 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,%
D34*/&(4=(%5.'",(3"34'1
@EFG 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>%
H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"-
L%"(M3.N''"#04*/&O< =-.#(&4(<PHH
< O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"-
D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'
!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T
H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#"
Perf
![Page 102: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/102.jpg)
83
!"#"$%&'&'()$"*+,$-"),*.("
/*")012#3+2#&+'*4567 +2#&+')#+)'6--
8$9)-+%2&:")#;")<"$'":)-+=")>&#;)#;")5-,?&')@:.()#+)
="#"$%&'")$"(&*#"$),*.("A
82"')#;")A-,?&')@&:")>&#;).)#"3#)"=&#+$).'=):++<)@+$)
#;")0-+="7 *"-#&+'Aarchitecture {sm_10}
abiversion {0}
modname {cubin}
code {
name = BlackScholesGPU
lmem = 0
smem = 68
reg = 20
bar = 0
bincode {
0xa0004205 0x04200780 0x40024c09 0x00200780
…
per thread local memory
per thread block shared memory
per thread registers
Perf
![Page 103: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/103.jpg)
84
!"#$%&''()*+',%!*-'(-*./0Perf
![Page 104: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/104.jpg)
85
!"#$%$&$'()#*+,-./)",+)01234
5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,
9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/
<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)
*$.$'(
?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)
#*+,-.
A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
B,6+$/#$3/
<$'$%6%C)DE)#*+,-./)",+)01234
!'1>)$7)%61#$"1,)32'36++,'#)01234/)
FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>
K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M
Perf
![Page 105: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/105.jpg)
86
!""#$%&"'()*(+,-./-0%&",
1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'(
3&"-,%2,($,-./-0%&",
BUT…
8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'(
<35,(7%6,&"'(/&(0,0/-':=/#&5(>,-&,72
?16(%77("/0,2(5/9&(6/(%-36<0,63"(3&6,&236'(%&5(%@%37%=7,(
$%-%77,7320A
Perf
![Page 106: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/106.jpg)
87
!"#"$%&%#'(%)*+,#)-../'0"&'+1
!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73
6!73)8"#9)'1)$"19):"93;)+5)$,/&'.#+0%33+#3
<%$+#9)="14:'4&2
>2"#%4)$%$+#9)3'(%
?%@'3&%#)5'/%)3'(%
A2#%"43).%#)=/+0B
*+,)0"1)%8%1)$"B%)"..3)3%/5C&,1'1@)D/'B%)EEAF)"14)-AG->H
IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/)0+15'@,#"&'+1
Perf
![Page 107: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/107.jpg)
88
!"#$%&'("#
)#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($'
6+4",7/$".%+'$(#8
0(9+,8+#-/:,.#$5(#8
;.#</$"#3%($-'
=.-+#$7/5(*(#8
)'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/
)#*+,'-.#*/2.,.%%+%/.%8",(-54/$"42%+?(-7/-5+",7
@#"A/5"A/-"/(*+#-(37/-72+/"3/:"--%+#+$<
+B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.*
D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2'
)'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+
Perf
![Page 108: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/108.jpg)
!"#$%&'($)*+,-.$/012*.#0
Perf
![Page 109: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/109.jpg)
61
!"#$%&'($)*+,-.$/012*.#0
3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$
401:.#5
;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$
5#594?+
!*5#$+8-54+
(99#++$81$"-07@-0#$4#02105-69#$91,68#0+$
Perf
![Page 110: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/110.jpg)
101
!"#$%&'"()'*#
Perf
![Page 111: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/111.jpg)
62
!"#$%&'
()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56
+"7*'+%75
#&08"$.32*-*$+
#&08.32*-*$+
#'+8"$.32*-*$+
#'+8.32*-*$+
&3.%&8&3%0
&3.%&8'+3-*
9-%$.2
0")*-#*$+89-%$.2
"$'+-4.+"3$' : "$'+-4.+"3$,.34$+
1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3,'2%-*0,3-,.3$'+%$+,7*73-=
.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'
Global memory loads/stores are coalesced
(coherent) or non-coalesced (incoherent)
Total branches and divergent branches
taken by threads
Local loads/stores
Perf
![Page 112: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/112.jpg)
63
!"#$%&%$#'"()&%*+',$%)-*."#$%/
01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%&
6",7)#1%($#/)*"$)8.,#'&%*-$//*%01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,;
<1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#)8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,)3*%:;
01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$)5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$
!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+)(,5?(/#@'"-*4$%$"#>)5'2$%($"#@9%1"-4>)1"5)31%&@/$%'1,'=$
Perf
![Page 113: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/113.jpg)
COME
![Page 114: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/114.jpg)
Back Pocket Slides
slide by David Cox
![Page 115: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/115.jpg)
Dense Linear Algebra
IAP09 CUDA@MIT / 6.963
![Page 116: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/116.jpg)
!"#$$%"&'()(*"+,-.,-/01,23
4/5,-".-,6789:;
! <128/-":=:089:
! >1?82@/7A8:
! B12?A7/-"@/7A8:
!"#$"%&'#"()%*+,"-)(
B,A-C8",D"7/-?8E:C/78"C/:8:;
! *8-,2/A01C:;"F>4
! +,9.A0/01,2/7"CG891:0-=
! )/0/"91212?!"#!$ !
"%$% !
&$% !
![Page 117: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/117.jpg)
!"#$$%"&'()(*"+,-.,-/01,23
4/5,-".-,6789:;
! <128/-":=:089:
! >1?82@/7A8:
! B12?A7/-"@/7A8:
!"#$"%&'#"()%*+,"-)(
*7?,-10C9:;
! D28E:1F8F"G/H0,-1I/01,2:;
<JK"+C,78:L=K"MN
! OP,E:1F8F"G/H0,-1I/01,2:;
MN"/7?3K"Q/H,61
! OP,E:1F8F"G/H0,-1I/01,2:;
B')!"#!$ !
"%$% !
&$% !
![Page 118: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/118.jpg)
!"#$$%"&'()(*"+,-.,-/01,23
45*6"78,-0-/29::;<
!"##!$%&''(!)*+,!)*+,
-,!.,!/,
012,!",!#3",
4,!#34,
012,!!,!#3!!5
!"#$%&'!"#$%&()$*+,-#.(!"#"!"$"%"&"'
+=45*6"7+;<
6789:;$<=--(!)*+,!!)*+,
-,!.,!/,
012,!>",!#3",
>4,!#34,
012,!>!,!#3!!5?
+,>.?0/01,2"12"@A="-BC?1-BD<
! (2101/E1F/01,2",G"+=)*"B2H1-,2>B20
! *EE,I/01,2",G"J/0/"D0-?I0?-BD"12"@A=">B>,-K"7L/2JEB-D"!"#$!%#$!&;
! M-/2DGB-",G"J/0/"7>/0-1IBD""#$%#$&;
! +,>.?0/01,2"7I?NE/D6OB>>;
! PB0-1BHB"-BD?E0"7>/0-1Q"&;
! 8-BB"J/0/"D0-?I0?-BD"12"@A=">B>,-K
! MB->12/01,2",G"+=)*"B2H1-,2>B20
![Page 119: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/119.jpg)
!
!
"#$!%&'(!"()*+,(!
"-./01!
"()*+,(!
2011"-.!
"()*+,(!
0011"-.
"()*+,(!
0311"-4
5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!
,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!
+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!
9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!
'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!
'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!
K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!
'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!
4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!
4#?!M(&>!M(+!,*+(! /;! /C! //! /:!
4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!
S#?!M(&>!"6=*MG9! C0! T! T! T!
S#?!6=*M9RO*+N! <B<! T! T! T!
-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
O*+N9B!!
!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,
V&9F=J!V*=>*7!
W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J
X&'(9!YB!S(''(=!
W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!
7901('$1,
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!
:,;#1(2<4$1*2#,
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!
]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!
Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!
-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!
]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!
-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U
9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!
=,-./,7($%*1"$14(",
[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!
*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!
=>:,?21'1*2#,
-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!
!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-
3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))
#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))
>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ
![Page 120: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/120.jpg)
Volkov and Demmel (SC08)
!
!
"#$!%&'(!"()*+,(!
"-./01!
"()*+,(!
2011"-.!
"()*+,(!
0011"-.
"()*+,(!
0311"-4
5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!
,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!
+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!
9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!
'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!
'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!
K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!
'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!
4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!
4#?!M(&>!M(+!,*+(! /;! /C! //! /:!
4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!
S#?!M(&>!"6=*MG9! C0! T! T! T!
S#?!6=*M9RO*+N! <B<! T! T! T!
-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
O*+N9B!!
!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,
V&9F=J!V*=>*7!
W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J
X&'(9!YB!S(''(=!
W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!
7901('$1,
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!
:,;#1(2<4$1*2#,
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!
]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!
Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!
-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!
]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!
-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U
9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!
=,-./,7($%*1"$14(",
[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!
*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!
=>:,?21'1*2#,
-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!
!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-
3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))
#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))
>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ
![Page 121: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/121.jpg)
Volkov and Demmel (SC08)
![Page 122: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/122.jpg)
![Page 123: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/123.jpg)
!
!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
*+,-./0
123425-+567829:
;<=>-,40?@AB
C(D
')D
"#D
"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4+,#42-6!7'&.'2,-!#21#.+,'!
,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!
42B8C!+./#'0'1D!
!E!
!E"
#E!
#E"
$E!
$E"
%E!
%E"
'E!
'E"
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
F.443G.5H05=-24$5;G73
123425-+567829:
;<=>-,40?@AB
'E':
$EC:
*IJ$(!
((!!*IJ
"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!
&#$/,!+&'!,/'!M'-,!-7''1%7-D!!
!
!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!
!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7
T?! (G! U(V! JDOW! GHV! XDUW!
A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!
K*! (O! UVJ! JDNW! GXH! XDXW!
F=RYY! EE! JHE! JDXW! G(O! XDGW!
7'+;! VN! GEE! XDHW! NN(! NDVW!
P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
'!!
'"!
"!!
""!
! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!
*+,-./0
123425-+567829:
"%(
%!) $)(
#C)
"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!
,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D
-,4$!,/'!.4+&-'&!MB4.;-D!
!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"
"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!
^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!
Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!
P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J
dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!
!45"6&77-+8"*)"9$+)*+7-31$"
"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!
"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!
!4:"9$+)*+7-31$";3-'8%.%"
"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4+,#42!
42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z
,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!
43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!
=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z
1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!
A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!
![Page 124: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/124.jpg)
!
!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
*+,-./0
123425-+567829:
;<=>-,40?@AB
C(D
')D
"#D
"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4+,#42-6!7'&.'2,-!#21#.+,'!
,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!
42B8C!+./#'0'1D!
!E!
!E"
#E!
#E"
$E!
$E"
%E!
%E"
'E!
'E"
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
F.443G.5H05=-24$5;G73
123425-+567829:
;<=>-,40?@AB
'E':
$EC:
*IJ$(!
((!!*IJ
"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!
&#$/,!+&'!,/'!M'-,!-7''1%7-D!!
!
!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!
!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7
T?! (G! U(V! JDOW! GHV! XDUW!
A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!
K*! (O! UVJ! JDNW! GXH! XDXW!
F=RYY! EE! JHE! JDXW! G(O! XDGW!
7'+;! VN! GEE! XDHW! NN(! NDVW!
P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
'!!
'"!
"!!
""!
! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!
*+,-./0
123425-+567829:
"%(
%!) $)(
#C)
"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!
,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D
-,4$!,/'!.4+&-'&!MB4.;-D!
!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"
"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!
^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!
Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!
P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J
dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!
!45"6&77-+8"*)"9$+)*+7-31$"
"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!
"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!
!4:"9$+)*+7-31$";3-'8%.%"
"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4+,#42!
42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z
,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!
43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!
=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z
1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!
A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!
![Page 125: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/125.jpg)
!
!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
*+,-./0
123425-+567829:
;<=>-,40?@AB
C(D
')D
"#D
"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4+,#42-6!7'&.'2,-!#21#.+,'!
,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!
42B8C!+./#'0'1D!
!E!
!E"
#E!
#E"
$E!
$E"
%E!
%E"
'E!
'E"
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
F.443G.5H05=-24$5;G73
123425-+567829:
;<=>-,40?@AB
'E':
$EC:
*IJ$(!
((!!*IJ
"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!
&#$/,!+&'!,/'!M'-,!-7''1%7-D!!
!
!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!
!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7
T?! (G! U(V! JDOW! GHV! XDUW!
A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!
K*! (O! UVJ! JDNW! GXH! XDXW!
F=RYY! EE! JHE! JDXW! G(O! XDGW!
7'+;! VN! GEE! XDHW! NN(! NDVW!
P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
'!!
'"!
"!!
""!
! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!
*+,-./0
123425-+567829:
"%(
%!) $)(
#C)
"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!
,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D
-,4$!,/'!.4+&-'&!MB4.;-D!
!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"
"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!
^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!
Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!
P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J
dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!
!45"6&77-+8"*)"9$+)*+7-31$"
"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!
"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!
!4:"9$+)*+7-31$";3-'8%.%"
"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4+,#42!
42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z
,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!
43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!
=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z
1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!
A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!
![Page 126: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/126.jpg)
!
!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
*+,-./0
123425-+567829:
;<=>-,40?@AB
C(D
')D
"#D
"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4+,#42-6!7'&.'2,-!#21#.+,'!
,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!
42B8C!+./#'0'1D!
!E!
!E"
#E!
#E"
$E!
$E"
%E!
%E"
'E!
'E"
&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('
F.443G.5H05=-24$5;G73
123425-+567829:
;<=>-,40?@AB
'E':
$EC:
*IJ$(!
((!!*IJ
"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!
&#$/,!+&'!,/'!M'-,!-7''1%7-D!!
!
!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!
!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7
T?! (G! U(V! JDOW! GHV! XDUW!
A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!
K*! (O! UVJ! JDNW! GXH! XDXW!
F=RYY! EE! JHE! JDXW! G(O! XDGW!
7'+;! VN! GEE! XDHW! NN(! NDVW!
P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!
!
"!
#!!
#"!
$!!
$"!
%!!
%"!
'!!
'"!
"!!
""!
! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!
*+,-./0
123425-+567829:
"%(
%!) $)(
#C)
"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!
,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D
-,4$!,/'!.4+&-'&!MB4.;-D!
!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"
"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!
^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!
Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!
P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J
dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!
!45"6&77-+8"*)"9$+)*+7-31$"
"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!
"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!
!4:"9$+)*+7-31$";3-'8%.%"
"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4+,#42!
42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z
,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!
43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!
=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z
1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!
A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!
![Page 127: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/127.jpg)
!
!
!"
#!"
$!"
%!"
&!"
'!"
(!"
)!"
*!"
+!"
#!!"
&&* )!& #!** #((& $&+( %(&* '%#$ ))&& ##$(&
,-./
012/134536781-9
!"#!$"#"%&'()*+&
%&'(),-)+
.--/!'0+'1
!"#2$"#
-3+&.',
$"#
!"#
"#$%&'!()*!+,'!-&'./0123!14!5#6'!#3!5,'!78!0'916:1;#5#13!&%3!13!
<'"1&9'!==))!<+>?!
!:+
#:!
#:#
#:$
#:%
#:&
#:'
#:(
#:)
#:*
#:+
$:!
(& #$* $'( '#$ #!$& $!&* &!+( *#+$ #(%*&
;<4=24=>
012/134536781-9
-3+&.',"!"#2$"#
%&'(),-)+"4'%&56
789:"35'"$;::
<'%=0",53-%5(>
"#$%&'!((*!@A120123!2,'3!16#55#3$!13'!14!5,'!1:5#6#B.5#13;!%;'0!2,'3!&%33#3$!13!<'"1&9'!<+>!C=)?!
6.5	';!.30!;,1%A0!-'!#6:&1D'0!2#5,!5,'!3'2'&!EFG'!#35'&913H
3'95! ;%::1&5'0! -I! 5,'! 3'2'&! <E8;! .30! 615,'&-1.&0;?! +#6'!
;:'35! #3! 5&.3;:1;#3$! 5,'!6.5	';! #;! 315! ;%-;5.35#.A?! G30#D#0%.A!
6'.;%&'6'35;!,.D'!;,123!5,.5!5&.3;:1;'!&%3;!.5!CJ!KJ!<LM;!41&!
!! N! ()))?! +,#;! D.&#.5#13! #3! -.302#05,! #;! 0%'! 51! 5,'!610'&.5'!
$&.3%A.I!14!5,#;!1:'&.5#13?!"1&!'O.6:A'P!#5!5./';!QR";!51!91:I!
1&!5&.3;:1;'!.!()CK#SK!6.5&#O!.5!5,'!:'./!;%;5.#3'0!-.302#05,!
14!RS!<LM;P!2,#9,!#;!9A1;'!51!5,'!/'&3'A!A.%39,!1D'&,'.0?!FE8H
<E8! 5&.3;4'&;! &%3! .5! T?)!T?T! <LM;! 41&! !" N! ()))P! 2,#9,! .:H
:&1.9,';!5,'!:'./!;%;5.#3'0!&.5'?!"#$?!((!'D.A%.5';!5,'!#6:.95;!14!0#44'&'35!1:5#6#B.5#13;!%;'0!
2,'3!916:%5#3$!13!<+>C=)?!+,'!61;5!#6:1&5.35!1:5#6#B.5#13!2.;!%;#3$!&12H6.U1&!A.I1%5!13!5,'!<E8!5,.5!3'.&AI!01%-A'0!5,'!:'&41&6.39'! .5! A.&$'! :&1-A'6! ;#B';?! G30#D#0%.A! 6'.;%&'6'35;!,.D'! ;,123! 5,.5! :#D15#3$! 5./';! (!()V! 14! 5#6'! #3! 5,'! '35#&'!916:%5.5#13! 41&!!! N! (J))! #4! 013'! #3! 5,'! &12H6.U1&! A.I1%5?! G3!5,.5! 9.;'! #5! .9,#'D';! R!CJ!<LM;! 14! '44'95#D'! -.302#05,?!W,'3!%;#3$! 91A%63H6.U1&! A.I1%5P! #5! 5./';! CR!J)V! 14! 5,'! 515.A! 5#6'!.30!&%3!.5!)?T!(?=<LM;P!2#5,!;A12'&!&.5';!41&!A.&$'&!6.5	';?!
X! ;%&:&#;#3$AI! A.&$'! ;:''0%:! Y%:! 51!T)VZ!2.;!1-5.#3'0!-I!:'&41&6#3$! 5&#.3$%A.&! ;1AD'!D#.!6%A5#:AI#3$!-I! 5,'! #3D'&;'!6.H5&#O?!+&#.3$%A.&! ;1AD'!2#5,! .!SK#SK! 5&#.3$%A.&!6.5&#O! .30!=([C!&#$,5! ,.30! ;#0';! &%3;! .5! (=! <4A1:M;! 13! <+>C=)! 2,'3! %;#3$!F8L7X@!C?)?! G5! #;! .3!1&0'&!14!6.$3#5%0'! ;A12'&! 5,.3! 5,'!CS=!<4A1:M;! &.5'! .9,#'D'0! #3! 6%A5#:AI#3$! .! SK#SK! 6.5&#O! -I! .!SK#=([C!6.5&#O!5,.5!01';!5,'!;.6'!21&/!Y5,#;!#;!(TK!<4A1:M;!#4!315!91%35#3$!5,'!&'0%30.35!21&/Z?!
X61&5#B.5#13!14!/'&3'A!A.%39,!1D'&,'.0!0%'!51!-.59,!:#D15H#3$!I#'A0;!T)!())V!;:''0%:!.5!!!\!()CK?!]44'95!14!.AA!1:5#6#B.H5#13;!0'9&'.;';!.5!A.&$'&!:&1-A'6!;#B';P!2,'&'!5#6'!#;!016#3.5'0!-I! 6.5&#OH6.5&#O! 6%A5#:A#';?! ^.5';! #3! 5,';'! 6%A5#:A#';! .&'! .4H4'95'0!-I!%;#3$!CHA'D'A!;9,'6';!#3!78!.30!F,1A';/I!.30!%;#3$!.%515%3#3$! 51! 9,11;'! -A19/! ;#B'! #3!_^?!+,';'! 5'9,3#`%';! $.D'!%:!51!K!RV!;:''0%:!.30!4.951&'0!#3!13AI!41&!!!N!K)[S?!
X991&0#3$! 51!"#$?! [P! %;#3$! 521!==))<+>!I#'A0;!13AI!SRV!#6:&1D'6'35!#3!5,'!:'./!<4A1:M;!&.5'?!+,#;!&';%A5!91&&';:130;!51!:&'H.AA19.5#3$!:#33'0!6'61&I!#3!5,'!6.;5'&!FE8!5,&'.0!-'41&'!<E8!9135'O5;!.&'!9&'.5'0!#3!5,'!9,#A0!FE8!5,&'.0;?!X;!.!&';%A5P!.AA!5&.3;4'&;!&%3!.5!.!;6.AA!4&.95#13!14!5,'!:'./!EFG'!-.302#05,!.;! #4! 5,'!6'61&I!2.;!315!:#33'0?!a#$,'&!#6:&1D'6'35!14!RKV!2,'3! %;#3$! 521! <+>C=)! 91&&';:130;! 51! .AA19.5#3$! :#33'0!6'61&I!#3!13'!14!5,'!9,#A0!FE8!5,&'.0;!.45'&!5,'!<E8!9135'O5;!.&'! .55.9,'0?!+,#;!6'61&I! #;! %;'0! 51! ;51&'! 5,'!FE8b;! 91:I!14!5,'!6.5&#OP!#?'?!-15,!5,'!#3:%5!.30!1%5:%5!0.5.!14!5,'!&1%5#3'?!+,#;!.AA12;! &%33#3$! 5&.3;4'&;!.5! 4%AA!-.302#05,! 51!13'!14! 5,'!<E8;?!+,'&'!.&'!15,'&!&'.;13;!41&!A';;!5,.3!#0'.A!;9.A#3$P!;%9,!.;!'O5&.!FE8H<E8!-.302#05,!913;%6:5#13P!A.9/!14!CHA'D'A!-A19/#3$!.30!315!;9.A#3$!5,'!FE8!;#0'!14!5,'!;I;5'6?!
!"#$%&'()*+,&-$.+/0$1/02*$3&*4$
+,'! 4#&;5! #6:A'6'35.5#13! 14! 5,'! 78! 4.951&#B.5#13! %;#3$! <E8;!
5,.5!2'!/312!2.;!:%-A#;,'0!-I!<.A1::1!'5!.A?!cC))Jd!.30!&.3!.5!
%:! 51! Q()! <4A1:M;! 41&! !! e! K)))! 2#5,1%5! :#D15#3$! .30! .5! QS!
<4A1:M;!41&!!!e!TJ))!2#5,!:.&5#.A!:#D15#3$!13!5,'!1A0'&!<'"1&9'!
R=))?! +,'I! %;'! .! 313H-A19/'0! .A$1,6! 5,.5! #;! -.302#05,!
.30M1&!1D'&,'.0!-1%[email protected]#3$! 5,';'!3%6-'&;!2#5,!-.302#05,!
$#D';!%:!51!CS!<4A1:M;!13!<+>C=)P!.3!1&0'&!14!6.$3#5%0'!A';;!
5,.3! #3! 1%&! #6:A'6'35.5#13?! f%&! ;1A%5#13! 21&/;! 4.;5'&! 0%'! 51!
A.&$'! -A19/#3$! '3.-A'0! -I! ;,.&'0! 6'61&I?! f%&! ,#$,! :'&41&H
6.39'!2,'3!:#D15#3$! #;! '3.-A'0!-I! 5,'!,#$,H-.302#05,! .99';;!
51!A#3'.&!.00&';;!;:.9'!.D.#A.-A'!13!610'&3!<E8;?!
L.&&.9,#3.!'5! .A?! cC))=d! &':1&5!J)!<4A1:M;! #3!78! 4.951&#B.H
5#13!.30!K(!<4A1:M;!#3!F,1A';/I!4.951&#B.5#13!41&!!!e!J)))!%;#3$!
F8L7X@! (?)! 13! <'"1&9'! ==))! 8A5&.?! f%&! #6:A'6'35.5#13!
.9,#'D';!C?[#!.30!T?R#!,#$,'&!;:''0!41&!78!.30!F,1A';/I!&';:?!
13! 5,'! ;A#$,5AI! ;A12'&! ==))<+>?! +,#;! #;! 0%'! 51! 1%&! #6:&1D'0!
6.5&#OH6.5&#O!6%A5#:AI! &1%5#3'! .30! 5,'! 1:5#6#B.5#13;! 'D.A%.5'0!
.-1D'?!
L.-1%A#3!'5!.A?! cC))=d!0';9&#-';! #6:A'6'35.5#13!14!78!.30!
_^!.A$1,6;!5,.5!&%3!.5!%:!51!$JJ!<4A1:M;!13!_%.0&1!">JS))!
41&!!!$!([P)))!%;#3$!F8L7X@!(?)?!+,#;!<E8!,.;!;A#$,5AI!;A12H
'&!6'61&I! 5,.3!==))<+>!.30!15,'&2#;'! ;#6#A.&?!+,'#&! #6:A'H
6'35.5#13! 14! F,1A';/I! &%3;! .5! %:! 51! [)! <4A1:M;! #4! %;#3$!
F8L7X@!.30!.::&1.9,';!(S)!<4A1:M;!#4!%;#3$!.3!'.&AI!D'&;#13!
14! 5,'! 6.5&#O! 6%A5#:AI! 0';9&#-'0! #3! 5,#;! :.:'&! .30! 144A1.0#3$!
L7X@(ML7X@C! 1:'&.5#13;! 51! 5,'! FE8?! f%&! #6:A'6'35.5#13!
.9,#'D';!,#$,'&!&.5';!0%'!51!.!0'':'&!:'&41&6.39'!.3.AI;#;!.30!
5%3#3$?!F.;5#AA1! '5! .A?! cC))=d! &':1&5! &';%A5;! 41&!F,1A';/I! 4.951&#B.H
5#13!&%3!13!KH<E8!ghGiGX!+';A.!@=R)?!].9,!14!5,';'!<E8;!#;!;#6#A.&!51!_%.0&1!">JS))!0';9&#-'0!.-1D'?!X%5,1&;!&':1&5!(=)!<4A1:M;!13!.!;I;5'6!.5!!"$!()P)))?!W'!.9,#'D'!5,#;!:'&41&6.39'!%;#3$!.!;#3$A'!==))<+>?!+,'#&!&';%A5!2.;!A.5'&!#6:&1D'0!51!KCK!<4A1:M;!.5!!"$!C)P)))!-I!%;#3$!5,'!6.5&#O!6%A5#:AI!&1%5#3'!:&'H;'35'0!#3!5,#;!:.:'&!c_%#35.3.Hf&5#!'5!.A?!C))=d?!
5$%&-678,+&-,$
W'!,.D'!:&';'35'0!5,'!4.;5';5!Y;1!4.&Z!#6:A'6'35.5#13;!14!0'3;'!78P! _^! .30! F,1A';/I! 4.951&#B.5#13;! &%33#3$! 13! .! ;#3$A'! 1&!01%-A'!ghGiGX!<E8;?!L.;'0!13!1%&!:'&41&6.39'!-'39,6.&/H#3$! .30! 610'A#3$P! 5,'I! .55.#3! =)V![)V! 14! 5,'! :'./! ;:''0;!:1;;#-A'!41&!A.&$'!6.5	';?!+,#;!;:''0!2.;!.9,#'D'0!-I!9.&'4%AHAI!9,11;#3$!1:5#6#B.5#13;!51!6.59,!5,'!9.:.-#A#5#';!14!5,'!,.&0H
![Page 128: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/128.jpg)
CUFFT Example
IAP09 CUDA@MIT / 6.963
![Page 129: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/129.jpg)
31M02: High Performance Computing with CUDA
CUDA Example:CUDA Example:Fourier-spectral Poisson SolverFourier-spectral Poisson Solver
Solve a Poisson equation on a rectangular domain with
periodic boundary conditions using a Fourier-spectral
method.
This example will show how to use the FFT library, transfer
the data to/from GPU and perform simple computations on
the GPU.
![Page 130: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/130.jpg)
32M02: High Performance Computing with CUDA
Mathematical backgroundMathematical background
rkkr yx
FFT ˆˆ)( 222=+!""#"=$ %%
1. Apply 2D forward FFT to r to obtain r(k), where k is the
wave number
2. Apply the inverse of the Laplace operator to r(k) to obtain
u(k): simple element-wise division in Fourier space
3. Apply 2D inverse FFT to u(k) to obtain u
)(
ˆˆ22
yx kk
r
+!="
![Page 131: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/131.jpg)
33M02: High Performance Computing with CUDA
Reference MATLAB implementationReference MATLAB implementation% No. of Fourier modes
N = 64;
% Domain size (assumed square)
L = 1;
% Characteristic width of f (make << 1)
sig = 0.1;
% Vector of wavenumbers
k = (2*pi/L)*[0:(N/2-1) (-N/2):(-1)];
%Matrix of (x,y) wavenumbers corresponding
% to Fourier mode (m,n)
[KX KY] = meshgrid(k,k);
% Laplacian matrix acting on the wavenumbers
delsq = -(KX.^2 + KY.^2);
% Kludge to avoid division by zero for
% wavenumber (0,0).
% (this waveno. of fhat should be zero anyway!)
delsq(1,1) = 1;
% Grid spacing
h = L/N;
x = (0:(N-1))*h ;
y = (0:(N-1))*h;
[X Y] = meshgrid(x,y);
% Construct RHS f(x,y) at the Fourier gridpoints
rsq = (X-0.5*L).^2 + (Y-0.5*L).^2;
sigsq = sig^2;
f = exp(-rsq/(2*sigsq)).*…
(rsq - 2*sigsq)/(sigsq^2);
% Spectral inversion of Laplacian
fhat = fft2(f);
u = real(ifft2(fhat./delsq));
% Specify arbitrary constant by forcing corner
% u = 0.
u = u - u(1,1);
% Compute L2 and Linf norm of error
uex = exp(-rsq/(2*sigsq));
errmax = norm(u(:)-uex(:),inf);
errmax2 = norm(u(:)-uex(:),2)/(N*N);
% Print L2 and Linf norm of error
fprintf('N=%d\n',N);
fprintf('Solution at (%d,%d): ',N/2,N/2);
fprintf('computed=%10.6f …
reference = %10.6f\n',u(N/2,N/2),uex(N/2,N/2));
fprintf('Linf err=%10.6e L2 norm
err = %10.6e\n',errmax, errmax2);
http://www.atmos.washington.edu/2005Q2/581/matlab/pois_FFT.m
![Page 132: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/132.jpg)
34M02: High Performance Computing with CUDA
Implementation stepsImplementation steps
The following steps need to be performed:
1. Allocate memory on host: r (NxN), u (NxN) , kx (N) and ky (N)
2. Allocate memory on device: r_d, u_d, kx_d, ky_d
3. Transfer r, kx and ky from host memory to the correspondentarrays on device memory
4. Initialize plan for FFT
5. Compute execution configuration
6. Transform real input to complex input
7. 2D forward FFT
8. Solve Poisson equation in Fourier space
9. 2D inverse FFT
10.Transform complex output to real input
11.Transfer results from the GPU back to the host
We are not taking advantage of the symmetries (C2C transform for real data) tokeep the code simple.
![Page 133: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/133.jpg)
35M02: High Performance Computing with CUDA
Solution walk-through (steps 1-2)Solution walk-through (steps 1-2)
/*Allocate arrays on the host */
float *kx, *ky, *r;
kx = (float *) malloc(sizeof(float*N);
ky = (float *) malloc(sizeof(float*N);
r = (float *) malloc(sizeof(float*N*N);
/* Allocate array on the GPU with cudaMalloc */
float *kx_d, *ky_d, *r_d;
cudaMalloc( (void **) &kx_d, sizeof(cufftComplex)*N);
cudaMalloc( (void **) &ky_d, sizeof(cufftComplex)*N);
cudaMalloc( (void **) &r_d , sizeof(cufftComplex)*N*N);
cufftComplex *r_complex_d;
cudaMalloc( (void **) &r_complex_d, sizeof(cufftComplex)*N*N);
![Page 134: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/134.jpg)
36M02: High Performance Computing with CUDA
Code walk-through (steps 3-4)Code walk-through (steps 3-4)
/* Initialize r, kx and ky on the host */
……………
/*Transfer data from host to device with
cudaMemcpy(target, source, size, direction)*/
cudaMemcpy (kx_d, kx, sizeof(float)*N , cudaMemcpyHostToDevice);
cudaMemcpy (ky_d, ky, sizeof(float)*N , cudaMemcpyHostToDevice);
cudaMemcpy (r_d , r , sizeof(float)*N*N, cudaMemcpyHostToDevice);
/* Create plan for CUDA FFT (interface similar to FFTW) */
cufftHandle plan;
cufftPlan2d( &plan, N, N, CUFFT_C2C);
![Page 135: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/135.jpg)
37M02: High Performance Computing with CUDA
Code walk-through (step 5)Code walk-through (step 5)/* Compute the execution configuration
NB: block_size_x*block_size_y = number of threads
On G80 number of threads < 512 */
dim3 dimBlock(block_size_x, block_size_y);
dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);
/* Handle N not multiple of block_size_x or block_size_y */
if (N % block_size_x !=0 ) dimGrid.x+=1;
if (N % block_size_y !=0 ) dimGrid.y+=1
Block_size_x
Blo
ck_siz
e_y
N
N
![Page 136: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/136.jpg)
38M02: High Performance Computing with CUDA
Code walk-through (step 6-10)Code walk-through (step 6-10)
/* Transform real input to complex input */
real2complex<<<dimGrid, dimBlock>>> (r_d, r_complex_d, N);
/* Compute in place forward FFT */
cufftExecC2C (plan, r_complex_d, r_complex_d, CUFFT_FORWARD);
/* Solve Poisson equation in Fourier space */
solve_poisson<<<dimGrid, dimBlock>>> (r_complex_d, kx_d, ky_d,N);
/* Compute in place inverse FFT */
cufftExecC2C (plan, r_complex_d, r_complex_d, CUFFT_INVERSE);
/* Copy the solution back to a real array and apply scaling ( an FFT followed by iFFT willgive you back the same array times the length of the transform) */
scale = 1.f / ( (float) N * (float) N );
complex2real_scaled<<<dimGrid, dimBlock>>> (r_d, r_complex_d, N, scale);
![Page 137: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/137.jpg)
39M02: High Performance Computing with CUDA
Code walk-through (step 11)Code walk-through (step 11)
/*Transfer data from device to host with
cudaMemcpy(target, source, size, direction)*/
cudaMemcpy (r , r_d , sizeof(float)*N*N, cudaMemcpyDeviceToHost);
/* Destroy plan and clean up memory on device*/
cufftDestroy( plan);
cudaFree(r_complex_d);
…….
cudaFree(kx_d);
![Page 138: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/138.jpg)
40M02: High Performance Computing with CUDA
real2complexreal2complex/*Copy real data to complex data */
__global__ void real2complex (float *a, cufftComplex *c, int N)
{
/* compute idx and idy, the location of the element in the original NxN array */
int idx = blockIdx.x*blockDim.x+threadIdx.x;
int idy = blockIdx.y*blockDim.y+threadIdx.y;
if ( idx < N && idy <N)
{
int index = idx + idy*N;
c[index].x = a[index];
c[index].y = 0.f;
}
}
idx
idy
![Page 139: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/139.jpg)
41M02: High Performance Computing with CUDA
solve_poisson solve_poisson (with shared memory)(with shared memory)
__global__ void solve_poisson (cufftComplex *c, float *kx, float *ky, int N)
{
unsigned int idx = __umul24(blockIdx.x,blockDim.x)+threadIdx.x;
unsigned int idy = __umul24(blockIdx.y,blockDim.y)+threadIdx.y;
// use shared memory to minimize multiple access to same k values
__shared__ float kx_s[BLOCK_WIDTH], ky_s[BLOCK_HEIGHT]
if (threadIx.x < 1) kx_s[threadIdx.x] = kx[idx];
if (threadIx.y < 1) ky_s[threadIdx.y] = ky[idy];
__syncthreads();
if ( idx < N && idy <N)
{
unsigned int index = idx +__umul24(idy ,N);
float scale = - ( kx_s[threadIdx.x]*kx_s[threadIdx.x]
+ ky_s[threadIdy.y]*ky_s[threadIdy.y] );
if ( idx ==0 && idy == 0 ) scale =1.f;
scale = 1.f / scale;
c[index].x *= scale;
c[index].y*= scale;
}
}
)(
ˆˆ22
yx kk
r
+!="
![Page 140: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/140.jpg)
42M02: High Performance Computing with CUDA
Compile and run Compile and run poissonpoisson
Compile the example poisson.cu:nvcc –O3 –o poisson poisson.cu
-I/usr/local/cuda/include –L/usr/local/cuda/lib -lcufft
-L/usr/local/NVDIA_CUDA_SDK/common/inc
-L/usr/local/NVDIA_CUDA_SDK/lib -lcutil
Run the example./poisson -N64
Poisson solver on a domain 64 x 64
dimBlock 32 16 (512 threads)
dimGrid 2 4
L2 error 9.436995e-08:
Time 0.000569:
Time I/O 0.000200 (0.000136 + 0.000064):
Solution at (32,32)
computed=0.975879 reference=0.975882
Reference values from MATLAB: N=64
Solution at (32,32): computed= 0.975879 reference= 0.975882
Linf err=2.404194e-05 L2 norm err = 9.412790e-08
![Page 141: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/141.jpg)
Misc
IAP09 CUDA@MIT / 6.963
![Page 142: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/142.jpg)
19M02: High Performance Computing with CUDA
Tesla C1060 Computing ProcessorTesla C1060 Computing Processor
1.33 GHzCore GHz
Processor 1x Tesla T10P
Form factor
Full ATX:
4.736” (H) x 10.5” (L)
Dual slot wide
On-boardmemory
4 GB
System I/O PCIe x16 gen2
Memory I/O512-bit, 800MHz DDR
102 GB/s peak bandwidth
Display outputs None
Typical power 160 W
![Page 143: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/143.jpg)
20M02: High Performance Computing with CUDA
Tesla S1070 1U SystemTesla S1070 1U System
1.5 GHzCore GHz
Processors 4 x Tesla T10P
Form factor1U for an EIA 19”
4-post rack
Total 1U systemmemory
16 GB (4.0GB per GPU)
System I/O 2 PCIe x16
Memory I/O perprocessor
512-bit, 800MHz GDDR
102 GB/s peakbandwidth
Display outputs None
Typical power 700 W
Chassisdimensions
1.73” H ! 17.5” W !28.5” D
![Page 144: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)](https://reader038.fdocuments.net/reader038/viewer/2022102716/5469ff0faf7959e8488b5007/html5/thumbnails/144.jpg)
18M02: High Performance Computing with CUDA
Double Precision Floating PointDouble Precision Floating Point
NVIDIA GPU SSE2 Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADDand FMUL
All 4 IEEE, round tonearest, zero, inf, -inf
All 4 IEEE, round tonearest, zero, inf, -inf
Round tozero/truncate only
Denormal handling Full speedSupported, costs 1000’sof cycles
Flush to zero
NaN support Yes Yes No
Overflow and Infinitysupport
Yes YesNo infinity,clamps to max norm
Flags No Yes Some
FMA Yes No Yes
Square rootSoftware with low-latencyFMA-based convergence
Hardware Software only
DivisionSoftware with low-latencyFMA-based convergence
Hardware Software only
Reciprocal estimateaccuracy
24 bit 12 bit 12 bit
Reciprocal sqrt estimateaccuracy
23 bit 12 bit 12 bit
log2(x) and 2^x estimatesaccuracy
23 bit No No