Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures...
Transcript of Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures...
Lecture21:ManycoreGPUArchitecturesandProgramming,Part3
--Streaming,LibraryandTuning
ConcurrentandMul>coreProgramming
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
ManycoreGPUArchitecturesandProgramming:Outline
• IntroducFon– GPUarchitectures,GPGPUs,andCUDA• GPUExecuFonmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstrucFonintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• DirecFve-basedhigh-levelprogrammingmodel– OpenACCandOpenMP
2
OffloadingProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
PCIBus
3
OffloadingProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
2. LoadGPUprogramandexecute,cachingdataonchipforperformance
PCIBus
4
OffloadingProcessingFlow
1. CopyinputdatafromCPUmemorytoGPUmemory
2. LoadGPUprogramandexecute,cachingdataonchipforperformance
3. CopyresultsfromGPUmemorytoCPUmemory
PCIBus
5
OverlappingCommunica>onandComputa>on
GPU
PCIe Bus Copy Copy Copy Copy Copy
Compute Compute Compute Compute
• ThreesequenFalstepsforasinglekernelexecuFon• MulFplekernels– Asynchronyisafirst-classciFzenofmostGPUprogramming
frameworks– ComputaFon-communicaFonoverlapisacommontechnique
inGPUprogramming
6
AbstractConcurrency
• DifferentkindsofacFonoverlaparepossibleinCUDA?1. OverlappedhostcomputaFonanddevicecomputaFon2. OverlappedhostcomputaFonandhost-devicedata
transfer3. Overlappedhost-devicedatatransferanddevice
computaFon4. ConcurrentdevicecomputaFon
• CUDAStreamstoachieveeachofthesetypesofoverlap
7
CUDAStreams
• CUDAStreams:aFIFOqueueofCUDAacFonstobeperformed– PlacinganewacFonattheheadofastreamisasynchronous– ExecuFngacFonsfromthetailasCUDAresourcesallow– EveryacFon(kernellaunch,cudaMemcpy,etc)runsinan
implicitorexplicitstream
CUDAStream
CUDAApplicaFon
CUDARunFme&GPUKernel cudaMemcpy cudaMemcpy
head tail
8
CUDAStreams
• TwotypesofstreamsinaCUDAprogram– Theimplicitlydeclaredstream(NULLstream)– Explicitlydeclaredstreams(non-NULLstreams)
• UpunFlnow,allcodehasbeenusingtheNULLstreambydefault
cudaMemcpy(...); kernel<<<...>>>(...); cudaMemcpy(...);
• Non-NULLstreamsrequiremanualallocaFonandmanagementbytheCUDAprogrammer
9
CUDAStreams
• TocreateaCUDAstream: cudaError_t cudaStreamCreate(cudaStream_t *stream);
• TodestroyaCUDAstream: cudaError_t cudaStreamDestroy(cudaStream_t stream);
• TowaitforallacFonsinaCUDAstreamtofinish: cudaError_t cudaStreamSynchronize(cudaStream_t stream);
• TocheckifallacFonsinaCUDAstreamhavefinished: cudaError_t cudaStreamQuery(cudaStream_t stream);
10
CUDAStreams
• cudaMemcpyAsync:AsynchronousmemcpycudaError_t cudaMemcpyAsync(void *dst, const void *src,
size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);
• cudaMemcpyAsyncdoesthesameascudaMemcpy,butmayreturnbeforethetransferisactuallycomplete
• PinnedhostmemoryisarequirementforcudaMemcpyAsync – Memorythatisresidentinphysicalmemorypages,and
cannotbeswappedout,alsoreferredaspage-locked• Recallmallocnormallyreservevirtualaddressspacefirstandthenactuallyphysicalpagesareallocated
11
CUDAStreams
• PerformingacudaMemcpyAsync:
int *h_arr, *d_arr; cudaStream_t stream; cudaMalloc((void **)&d_arr, nbytes); cudaMallocHost((void **)&h_arr, nbytes); cudaStreamCreate(&stream); cudaMemcpyAsync(d_arr, h_arr, nbytes, cudaMemcpyHostToDevice, stream); ... cudaStreamSynchronize(stream); cudaFree(d_arr); cudaFreeHost(h_arr); cudaStreamDestroy(stream);
page-lockedmemoryallocaFon
Callreturnbeforetransfercomplete
Dosomethingwhiledataisbeingmoved
SynctomakesureoperaFonscomplete12
CUDAStreams
• Associatekernellauncheswithanon-NULLstream– Notethatkernelsarealwaysasynchronous
kernel<<<nblocks, threads_per_block, smem_size, stream>>>(...);
• TheeffectsofcudaMemcpyAsyncandkernellaunching– OperaFonsareputinthestreamqueueforexecuFon– ActuallyoperaFonsmaynothappenyet
• Host-sideFmertoFmethoseoperaFons– NottheactualFmeoftheoperaFons
13
• Vectorsumexample,A+B=C
• ParFFonthevectorsanduseCUDAstreamstooverlapcopyandcompute
CUDAStreams
CopyA CopyB vector_sum<<<...>>> CopyCNULL stream
A B v_s C
A B v_s C
A B v_s C
A B v_s C
Stream A
Stream B
Stream C
Stream D
14
• Howcanthisbeimplementedincode?
for (int i = 0; i < nstreams; i++) { int offset = i * eles_per_stream; cudaMemcpyAsync(&d_A[offset], &h_A[offset], eles_per_stream *
sizeof(int), cudaMemcpyHostToDevice, streams[i]); cudaMemcpyAsync(&d_B[offset], &h_B[offset], eles_per_stream * sizeof(int), cudaMemcpyHostToDevice, streams[i]);
…… vector_sum<<<..., streams[i]>>>(d_A + offset, d_B + offset, d_C + offset); cudaMemcpyAsync(&h_C[offset], &d_C[offset], eles_per_stream *
sizeof(int), cudaMemcpyDeviceToHost, streams[i]); }
for (int i = 0; i < nstreams; i++) cudaStreamSynchronize(streams[i]);
CUDAStreams
15
• TimingasynchronousoperaFons– Host-sideFmer:onlymeasuretheFmeforthecall,nottheactual
FmeforthedatamovementorkernelexecuFon
• Eventstostreams,whichmarkspecificpointsinstreamexecuFon
• Eventsaremanuallycreatedanddestroyed: cudaError_t cudaEventCreate(cudaEvent_t *event); cudaError_t cudaEventDestroy(cudaEvent_t *event);
CUDAEvents
CopyA CopyB vector_sum<<<...>>> CopyC
Event
16
• ToaddaneventtoaCUDAstream: cudaError_t cudaEventRecord(cudaEvent_t event,
cudaStream_t stream);
– Eventmarksthepoint-in-FmeamerallprecedingacFonsinstreamcomplete,andbeforeanyacFonsaddedamercudaEventRecordrun
• HosttowaitforsomeCUDAacFonstofinish cudaError_t cudaEventSynchronize(cudaEvent_t event);
– WaitforalltheoperaFonsbeforethiseventstocomplete,butnotthoseamer
CUDAEvents
CopyA CopyB vector_sum<<<...>>> CopyC
Event
17
• CheckifaneventhasbeenreachedwithoutwaiFngforit:
cudaError_t cudaEventQuery(cudaEvent_t event);
• Gettheelapsedmillisecondsbetweentwoevents: cudaError_t cudaEventElapsedTime(float *ms, cudaEvent_t start, cudaEvent_t stop);
CUDAEvents
CopyA CopyB vector_sum<<<...>>> CopyC
start stop 18
• Incodes:
float time; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start); kernel<<<grid, block>>>(arguments); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);
CUDAEvents
19
ImplicitandExplicitSynchroniza>on
• Twotypesofhost-devicesynchronizaFon:– Implicitsynchroniza>oncausesthehosttowaitontheGPU,
butasasideeffectofotherCUDAacFons– Explicitsynchroniza>oncausesthehosttowaitontheGPU
becausetheprogrammerhasaskedforthatbehavior
20
• FiveCUDAoperaFonsthatincludeimplicitsynchronizaFon:1. ApinnedhostmemoryallocaFon(cudaMallocHost,
cudaHostAlloc)2. AdevicememoryallocaFon(cudaMalloc)3. Adevicememset(cudaMemset)4. Amemorycopybetweentwoaddressesonthesame
device(cudaMemcpy(..., cudaMemcpyDeviceToDevice))
5. AmodificaFontotheL1/sharedmemoryconfiguraFon(cudaThreadSetCacheConfig, cudaDeviceSetCacheConfig)
ImplicitandExplicitSynchroniza>on
21
• FourwaystoexplicitlysynchronizeinCUDA:1. Synchronizeonadevice
cudaError_t cudaDeviceSynchronize();
2. SynchronizeonastreamcudaError_t cudaStreamSynchronize();
3. SynchronizeonaneventcudaError_t cudaEventSynchronize();
4. SynchronizeacrossstreamsusinganeventcudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event);
ImplicitandExplicitSynchroniza>on
22
• cudaStreamWaitEventaddsinter-streamdependencies– Causesthespecifiedstreamtowaitonthespecifiedevent
beforeexecuFnganyfurtheracFons– eventdoesnotneedtobeaneventrecordedinstream
cudaEventRecord(event, stream1); ... cudaStreamWaitEvent(stream2, event);
...
– NoacFonsaddedtostream2amerthecalltocudaStreamWaitEventwillexecuteunFleventissaFsfied
ImplicitandExplicitSynchroniza>on
23
SuggestedReadings
1. Chapter6inProfessionalCUDACProgramming2. JusFnLuitjens.CUDAStreams:BestPrac7cesandCommon
Pi9alls.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4158-cuda-streams-best-pracFces-common-piqalls.pdf
3. SteveRennich.CUDAC/C++StreamsandConcurrency.2011.hpp://on-demand.gputechconf.com/gtc-express/2011/presentaFons/StreamsAndConcurrencyWebinar.pdf
24
ManycoreGPUArchitecturesandProgramming:Outline
• IntroducFon– GPUarchitectures,GPGPUs,andCUDA• GPUExecuFonmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstruc>onintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• DirecFve-basedhigh-levelprogrammingmodel– OpenACCandOpenMP
25
CUDALibraries
• CUDALibrariesofferpre-packagedandexpertly-opFmizedfuncFonsthatimplementcommonlyusefuloperaFons.– VectoraddiFon,matrixvector,matrixmatrix,FFT,etc
26
CUDALibraries
• WhataretheadvantagesofCUDALibraries?– SupportawiderangeofapplicaFondomains– Highlyusable,high-levelAPIsthatarefamiliartodomain
experts– TunedbyCUDAexpertstoperformwellacrossplaqormsand
datasets– OmenofferthequickestrouteforporFng,simplyswapoutAPI
calls– Lowmaintenance,developerofthelibrarytakeson
responsibilityofbugfixesandfeaturerequests
27
CUDALibraries
28
WorkflowtoUseCUDALibrary
1. Createalibrary-specifichandlethatmanagescontextualinformaFonusefulforthelibrary’soperaFon.– ManyCUDALibrarieshavetheconceptofahandlewhich
storesopaquelibrary-specificinformaFononthehostwhichmanylibraryfuncFonsaccess
– Programmer’sresponsibilitytomanagethishandle– For example: cublasHandle_t,cufftHandle,cusparseHandle_t,curandGenerator_t
2. AllocatedevicememoryforinputsandoutputstothelibraryfuncFon.– Use cudaMalloc as usual
29
CommonLibraryWorkflow
3. Ifinputsarenotalreadyinalibrary-supportedformat,convertthemtobeaccessiblebythelibrary.– ManyCUDALibrariesonlyacceptdatainaspecificformat– Forexample:column-majorvs.row-majorarrays
4. Populatethepre-allocateddevicememorywithinputsinasupportedformat.– Inmanycases,thisstepsimplyimpliesacudaMemcpyorone
ofitsvariantstomakethedataaccessibleontheGPU– SomelibrariesprovidecustomtransferfuncFons,forexample:cublasSetVectoropFmizesstridedcopiesfortheCUBLASlibrary
30
CommonLibraryWorkflow
5. ConfigurethelibrarycomputaFontobeexecuted.– Insomelibraries,thisisano-op– OthersrequireaddiFonalmetadatatoexecutelibrary
computaFoncorrectly– InsomecasesthisconfiguraFontakestheformofextra
parameterspassedtolibraryfuncFons,otherssetfieldsinthelibraryhandle
6. ExecutealibrarycallthatoffloadsthedesiredcomputaFontotheGPU.– NoGPU-specificknowledgerequired
31
CommonLibraryWorkflow
7. RetrievetheresultsofthatcomputaFonfromdevicememory,possiblyinalibrary-determinedformat.– Again,thismaybeassimpleasacudaMemcpyorrequirea
library-specificfuncFon
8. Ifnecessary,converttheretrieveddatatotheapplicaFon’snaFveformat.– Ifaconversiontoalibrary-specificformatwasnecessary,this
stepensurestheapplicaFoncannowusethecalculateddata– Ingeneral,itisbesttokeeptheapplicaFonformatandlibrary
formatthesame,reducingoverheadfromrepeatedconversions
32
CommonLibraryWorkflow
9. ReleaseCUDAresources.– IncludestheusualCUDAcleanup(cudaFree,cudaStreamDestroy,etc)plusanylibrary-specificcleanup
10. ConFnuewiththeremainderoftheapplicaFon.
33
CommonLibraryWorkflow
34
• Notalllibrariesfollowthisworkflow,andnotalllibrariesrequireeverystepinthisworkflow– Infact,formanylibrariesmanystepsareskipped– Keepingthisworkflowinmindwillhelpgiveyoucontexton
whatthelibrarymightbedoingbehindthescenesandwhereyouareintheprocess
• Next,we’lltakealookattwocommonlyusefullibraries– Trytokeepthecommonworkflowinmindwhileweworkwith
them
cuBLAS
• cuBLASisaportofapopularlinearalgebralibrary,BLAS
• cuBLAS(likeBLAS)splitsitssubrouFnesintomulFplelevelsbasedondatatypesprocessed:– Level1:vector-onlyoperaFons(e.g.vectoraddiFon)– Level2:matrix-vectoroperaFons(e.g.matrix-vector
mulFplicaFon)– Level3:matrix-matrixoperaFons(e.g.matrixmulFplicaFon)
35
cuBLASIdiosyncracies
• ForlegacycompaFbility,cuBLASoperatesoncolumn-majormatrices
• cuBLASalsohasalegacyAPIwhichwasdroppedsinceCUDA4.0,thislecturewillusethenewcuBLASAPI– IfyoufindcuBLAScodethatdoesn’tquitematchup,youmay
belookingattheoldcuBLASAPI
3 0 0 6 0 0 0 2 1
3 6 0 0 0 2 0 0 1
36
cuBLASDataManagement
• DevicememoryincuBLASisallocatedasyou’reusedto:cudaMalloc
• Transferringdatato/fromthedeviceusescuBLAS-specificfuncFons:– cublasGetVector/cublasSetVector– cublasGetMatrix/cublasSetMatrix
37
cuBLASDataManagement
• Example:cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy);
where:• nisthenumberofelementstotransfertotheGPU• elemSizeisthesizeofeachelement(e.g.sizeof(int))• x isthevectoronthehosttocopyfrom• incxisastrideinxofthearraycellstotransferto• y isthevectorontheGPUtocopyto• incyisastrideinyofthearraycellstotransferto
38
cuBLASDataManagement
• Example:cublasSetVector(5, sizeof(int), h_x, 3, d_x, 2);
h_x
d_x
39
cuBLASDataManagement
• Similarly:cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb);
where:• rowsisthenumberofrowsinamatrixtocopy• colsisthenumberofcolsinamatrixtocopy• elemSizeisthesizeofeachcellinthematrix(e.g.sizeof(int))
• A isthesourcematrixonthehost• ldaisthenumberofrowsintheunderlyingarrayforA • BisthedesFnaFonmatrixontheGPU• ldbisthenumberofrowsintheunderlyingarrayforB
40
cuBLASDataManagement
• Similarly:cublasSetMatrix(3, 3, sizeof(int), h_A, 4, d_A, 5);
4
5
41
cuBLASExample
• Matrix-vectormulFplicaFon– Uses6ofthe10stepsinthecommonlibraryworkflow:
1. CreateacuBLAShandleusingcublasCreateHandle 2. Allocatedevicememoryforinputsandoutputsusing
cudaMalloc 3. PopulatedevicememoryusingcublasSetVector,
cublasSetMatrix 4. CallcublasSgemvtorunmatrix-vectormulFplicaFonon
theGPU5. RetrieveresultsfromtheGPUusingcublasGetVector 6. ReleaseCUDAandcuBLASresourcesusingcudaFree,
cublasDestroy 42
cuBLASExample
• Youcanbuildandruntheexamplecublas.cu:
cublasCreate(&handle); cudaMalloc((void **)&dA, sizeof(float) * M * N); cudaMalloc((void **)&dX, sizeof(float) * N); cudaMalloc((void **)&dY, sizeof(float) * M); cublasSetVector(N, sizeof(float), X, 1, dX, 1); cublasSetVector(M, sizeof(float), Y, 1, dY, 1); cublasSetMatrix(M, N, sizeof(float), A, M, dA, M); cublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1, &beta, dY, 1); cublasGetVector(M, sizeof(float), dY, 1, Y, 1); /* for sgemm */ cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,matrix_size.uiWB,matrix_size.uiHA,matrix_size.uiWA,&alpha,d_B,matrix_size.uiWB,d_A,matrix_size.uiWA,&beta,d_C,matrix_size.uiWA)
43
cuBLASPortability
• PorFngtocuBLASfromBLASisastraighqorwardprocess.Ingeneral,itrequires:– AddingdevicememoryallocaFon/freeing(cudaMalloc,cudaFree)
– AddingdevicetransferfuncFons(cublasSetVector,cublasSetMatrix,etc)
– TransformlibraryrouFnecallsfromBLAStocuBLAS(e.g.cblas_sgemvècublasSgemv)
44
cuBLASPortability
• SomecommonopFmizaFonsfollowinganaiveBLASècuBLASportare:– ReusingdevicememoryallocaFons– Removingredundantdatatransfersfromandtothedevice– AddingstreamedexecuFonusingcublasSetStream
45
cuBLASSummary
• cuBLASmakesacceleraFnglegacyBLASapplicaFonssimpleandeasy– Verylipleaddedcode– StraighqorwardmappingfromBLASrouFnestocuBLAS
rouFnes– FlexibleAPIimprovesportability
• FornewlinearalgebraapplicaFons,cuBLASoffersahigh-performancealternaFvetoBLAS– High-performancekernelswithverylipleprogrammerFme
46
cuFFT
• cuFFToffersanopFmizedimplementaFonofthefastFouriertransform
47
cuFFTConfigura>on
48
• IncuFFTterminology,plans==handles– cuFFTplansdefineasingleFFTtransformaFontobeperformed
• cuFFTusesplanstoderivetheinternalmemoryallocaFons,
transfers,kernelsrequiredtoimplementthedesiredtransform
• Plansarecreatedwith:cufftResult cufftPlan1d(cufftHandle *plan, int nx, cufftType type, int batch);
cufftResult cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type);
cufftResult cufftPlan3d(cufftHandle *plan, int nx, int ny, int nz, cufftType type);
cuFFTConfigura>on
49
• cufftTypereferstothedatatypesofatransformaFon,forexample:– Complex-to-complex:CUFFT_C2C – Real-to-complex:CUFFT_R2C – Complex-to-real:CUFFT_C2R
cuFFTExample
• Acomplex-to-complex1DcuFFTplanandexecuFngit,using6ofthe10stepsinthecommonlibraryworkflow:
1. CreateandconfigureacuFFTplan2. AllocateGPUmemoryfortheinputsamplesandoutput
frequenciesusingcudaMalloc 3. PopulateGPUmemorywithinputsamplesusing
cudaMemcpy 4. ExecutetheplanusingacufftExec*funcFon5. RetrievethecalculatedfrequenciesfromGPUmemory
usingcudaMemcpy 6. ReleaseCUDAandcuFFTresourcesusingcudaFree,
cufftDestroy 50
cuFFTExample
• Youcanbuildandrunanexamplecufft.cu:
cufftPlan1d(&plan, N, CUFFT_C2C, 1); cudaMalloc((void **)&dComplexSamples, sizeof(cufftComplex) * N); cudaMemcpy(dComplexSamples, complexSamples, sizeof(cufftComplex) * N, cudaMemcpyHostToDevice); cufftExecC2C(plan, dComplexSamples, dComplexSamples, CUFFT_FORWARD); cudaMemcpy(complexFreq, dComplexSamples, sizeof(cufftComplex) * N, cudaMemcpyDeviceToHost);
51
cuFFTSummary
• LikecuBLAS,cuFFToffersahigh-levelandusableAPIforporFnglegacyFFTapplicaFonsorwriFngnewones– cuFFT’sAPIisdeliberatelysimilartoindustry-standardlibrary
FFTWtoimproveprogrammability– Offershigherperformanceforlipledevelopereffort
52
Drop-InCUDALibraries
• Drop-InCUDALibrariesallowseamlessintegraFonofCUDAperformancewithexisFngcodebases– FullcompaFbilitywithindustry-standardlibraries,exposethe
sameexternalAPIs– BLASèNVBLAS– FFTWècuFFTW
• TwowaystouseDrop-InLibraries:– Re-linktoCUDALibraries– LD_PRELOADCUDALibrariesbeforetheirhostequivalents
53
Drop-InCUDALibraries
• Re-linkinglegacyapplicaFonstoCUDALibraries:– SupposeyouhavealegacyapplicaFonthatreliesonBLAS:
$ gcc app.c –lblas –o app – RecompilingwithNVBLASlinkedwillautomaFcallyaccelerate
allBLAScalls$ gcc app.c –lnvblas –o app
• AlternaFvely,simplysetLD_PRELOADwhenexecuFngtheapplicaFon:
$ env LD_PRELOAD=libnvblas.so ./app
54
SurveyofCUDALibraryPerformance
55
• We’veseenthatcuBLASandcuFFTarehigh-level,programmablelibraries(liketheirhostcounterparts)– NoCUDA-specificconcepts(e.g.threadblocks,pinned
memory,etc)
• Let’sdoabriefsurveyofCUDALibraryperformancetoseetheperformanceimprovementspossible– Focusonthesamelibraries(cuBLASandcuFFT)butsimilar
dataonotherlibrariesisavailableinthebookandonline
SurveyofCUDALibraryPerformance
56
SurveyofCUDALibraryPerformance
57
SuggestedReadings
1. AllsecFonsinChapter8ofProfessionalCUDACProgrammingexceptUsingOpenACC
2. cuSPARSEUserGuide.2014.hpp://docs.nvidia.com/cuda/cusparse/3. cuBLASUserGuide.2014.hpp://docs.nvidia.com/cuda/cublas/4. cuRANDUserGuide.2014.hpp://docs.nvidia.com/cuda/curand/5. cuFFTUserGuide.2014.hpp://docs.nvidia.com/cuda/cuw/6. CUDAToolkit5.0PerformanceReport.2013.hpp://on-
demand.gputechconf.com/gtc-express/2013/presentaFons/cuda--5.0-math-libraries-performance.pdf
58
ManycoreGPUArchitecturesandProgramming:Outline
• IntroducFon– GPUarchitectures,GPGPUs,andCUDA• GPUExecuFonmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstrucFonintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• DirecFve-basedhigh-levelprogrammingmodel– OpenACCandOpenMP
59
GPUParalleliza>on
• Amany-facetedprocess– PerformancevariesdramaFcallydependingonthe
implementaFonofthesamealgorithms• NaïvetohighlyopFmizedversion
• ManytypesofopFmizaFonsforGPUs– Sharedmemory– Constantmemory– Globalmemoryaccesspaperns– WarpshuffleinstrucFons– ComputaFon-communicaFonoverlap– CUDAcompilerflags,e.g.loopunrolling,etc– Increasingparallelism– ...
60
Op>miza>onOpportuni>es
• Kernel-levelopFmizaFon:– ExposingSufficientParallelism– OpFmizingMemoryAccess– OpFmizingInstrucFonExecuFon
• Host-GPUopFmizaFon– E.g.kernelanddatatransferoverlapusingCUDAstreams
• Profile-drivenopFmizaFonimprovesopFmizaFonsselecFon
61
Kernel-LevelOp>miza>on
• ExposingSufficientParallelism– IncreasetheamountofconcurrentworkontheGPUsoasto
saturateinstrucFonandmemorybandwidth
• Canbeaccomplishedby:1. MoreconcurrentlyacFvewarpsperSM2. Moreindependentworkassignedtoeachthread
Warp-LevelParallelism
InstrucFon-LevelParallelism
62
Kernel-LevelOp>miza>on
• IncreasingthenumberofwarpsperSM/threadblockdoesnotguaranteeperformanceimprovement– Resultinfewerper-SMresourcesassignedtoeachthread(e.g.
registers,sharedmemory)
Less parallelism, more per-thread resources
More parallelism, smaller per-thread resources
63
Kernel-LevelOp>miza>on
• CreaFngmoreindependentworkperthread– loopunrollingorothercodetransformaFonsthatexpose
instrucFon-levelparallelism,– Butmayalsoincreaseper-threadresourcerequirements
int sum = 0; for (int i = 0; i < 4; i++) { sum += a[i]; }
int i1 = a[0]; int i2 = a[1]; int i3 = a[2]; int i4 = a[4]; int sum = i1 + i2 + i3 + i4;
Requires 2 registers (sum, i), no instruction-level parallelism Requires 5 registers (i1, i2, i3, i4,
sum), four-way instruction-level parallelism
64
Kernel-LevelOp>miza>on
• OpFmizingmemoryaccesstomaximize:– MemorybandwidthuFlizaFon(efficiencyofmemoryaccess
paperns)– Memoryaccessconcurrency(sufficientmemoryrequeststo
hidememorylatency)
65
Kernel-LevelOp>miza>on
• Aligned,coalescedglobalandsharedmemoryaccessesopFmizememorybandwidthuFlizaFon– Constantmemoryprefersabroadcastaccesspapern
66
Kernel-LevelOp>miza>on
• OpFmizingInstrucFonExecuFonfocuseson:– HidinginstrucFonlatencybykeepingasufficientnumberof
warpsacFve– AvoidingdivergentexecuFonpathswithinwarps• Ifinsideakernel
• ExperimenFngwiththreadexecuFonconfiguraFoncanproduceunexpectedperformancegainsfrommoreorlessacFvewarps
• DivergentexecuFonwithinawarpproducesreducedparallelismaswarpexecuFonofmulFplecodepathsisserialized
67
Profile-DrivenOp>miza>on
• Profile-drivenopFmizaFonisaniteraFveprocesstoopFmizeprogrambasedonquan>ta>veprofilinginfo– AsweapplyopFmizaFontechniques,weanalyzetheresults
usingnvprofanddecideiftheyarebeneficial
68
Determine performance
inhibitors
Identify hotspots
Gather profiling information
Optimize
Repeat
Profile-DrivenOp>miza>on
69
• Thekeychallengeinprofile-drivenopFmizaFonistodetermineperformanceinhibitorsinhotspots– nvvpandnvprofareinvaluabletoolsforthis
Profile-DrivenOp>miza>on
• nvprofprofilingmodes:– SummaryMode:defaultmode,displaysexecuFonFme
informaFononhigh-levelacFonssuchaskernelsordatatransfers
– TraceMode:ProvidesaFmelineofCUDAeventsoracFonsinchronologicalorder
– Event/MetricSummaryMode:Aggregatesevent/metriccountsacrossallkernelinvocaFons
– Event/MetricTraceMode:Displaysevent/metriccountsforeachkernelinvocaFon
70
Profile-DrivenOp>miza>on
• TheNVIDIAVisualProfiler(nvvp)isalsoapowerfultoolforguidingprofile-drivenopFmizaFon– OffersanumberofviewstoinspectdifferentpartsofaCUDA
applicaFon
71
CUDADebugging
• AnimportantpartofCUDAsomwaredevelopmentistheabilitytodebugCUDAapplicaFons
• CUDAoffersanumberofdebuggingtools,splitintotwocategories:– KernelDebugging– MemoryDebugging
72
CUDADebugging
• KernelDebuggingtoolshelpustoanalyzethecorrectnessofrunningCUDAkernelsbyinspecFngrunningapplicaFonstate
• MemoryDebuggingToolshelpusdetectapplicaFonbugsbyobservingirregularorout-of-boundmemoryaccessesperformedbyCUDAkernels
73
CUDAKernelDebugging
• Primarytoolforthejob:cuda-gdb – IntenFonallybuilttobesimilartothehostdebuggingtoolgdb– RequirescompilaFonwithspecialflagstobeuseful:
$ nvcc –g –G foo.cu -o foo
• OnceanapplicaFoniscompiledindebugmode,runningitundercuda-gdb ispossibleusing:
$ cuda-gdb foo ... (cuda-gdb)
74
CUDAKernelDebugging
• cuda-gdb usesmostofthesamecommandsasgdb
• OnemaindifferenceistheideaofCUDAFocus,orthecurrentthreadthatcuda-gdb isfocusedonandagainstwhichallcommandsrun– Querythecurrentfocususing:
(cuda-gdb) cuda thread lane warp block sm grid device kernel
– Exampleofse{ngfocustothe128ththreadinthecurrentblock:
(cuda-gdb) cuda thread (128)
75
CUDAKernelDebugging
• printfisanotherformofCUDAKernelDebugging– Onlyavailableondevicesofcomputecapability2.0orhigher
• Printsarebufferedonthedeviceandperiodicallytransferredbacktothehostfordisplay– SizeofthisbufferconfigurablewithcudaSetDeviceLimit
• BuffercontentsaretransferredtothehostameranyCUDAkernellaunch,anyhost-sideexplicitsynchronizaFon,anysynchronousmemorycopies
76
CUDAMemoryDebugging
• MemoryDebuggingdetectsmemoryerrorsinCUDAkernelsthatarelikelyindicaFveofbugsinthecode– Forexample:out-of-boundsmemoryaccesses
• ThereisasingletoolforMemoryDebugging,cuda-memcheck,whichcontainstwouFliFes:– Thememchecktool– Theracechecktool
77
CUDAMemoryDebugging
• ThecompilaFonprocessforcuda-memcheck ismoreinvolvedthanforcuda-gdb – BuildingwithfulldebugopFonsaffectsperformance,which
maymakememoryerrorshardertohit– ApplicaFonsshouldalwaysbecompiledwith-lineinfo – ApplicaFonsshouldalsobecompiledtoincludesymbol
informaFon,butdoingthisvariesbyplaqorm• Linux:-Xcompiler–rdynamic• Windows:-Xcompiler/Zi• ...
78
CUDAMemoryDebugging
• OncetheapplicaFoniscompiled,memcheckcanbeusedtocheckfor6differenttypesofmemoryerrors:1. MemoryAccessError:Out-of-boundsormisaligned
memoryaccess2. HardwareExcepFon:Errorreportedbyhardware3. malloc/freeErrors:ImproperuseofCUDAdynamic
memoryallocaFon4. CUDAAPIErrors:AnyerrorreturncodefromaCUDAAPI
call5. cudaMallocMemoryLeaks:cudaMallocallocaFonsthat
arenotcudaFree’d 6. DeviceHeapMemoryLeaks:DynamicmemoryallocaFons
thatareneverfreed79
CUDAMemoryDebugging
• Thetwocuda-memcheck uFliFesofferverydifferentcapabiliFes:– memcheckperformsawiderangeofmemorycorrectness
checks– racecheckverifiesthat__shared__memoryusageis
correctinanapplicaFon,aparFcularlydifficulttasktoperformmanually
• cuda-memcheck offersamoreautomatedapproachtodebuggingthancuda-gdb
80
CUDAErrorHandling
• PropererrorhandlingisanimportantpartofrobustCUDAdeployment– EveryCUDAfuncFonreturnsanerrorcodethatmustbe
checked– IfasynchronousoperaFonsareused,thiserrormaybearesult
ofadifferentasynchronousoperaFonfailing– ReturncodeofcudaSuccessindicatessuccess
• CUDAalsooffersanumberoferror-handlingfuncFons
81
CUDAErrorHandling
cudaError_t cudaGetLastError(); – RetrievethelatestCUDAerror,clearingtheCUDArunFme’s
internalerrorstatetobecudaSuccess
cudaError_t cudaPeekLastError(); – RetrievethelatestCUDAerror,butdonotcleartheCUDA
runFme’sinternalerrorstate
const char *cudaGetErrorString(cudaError_t error); – Fetchahuman-readablestringfortheprovidederror
82
SuggestedReadings
1. Chapter10inProfessionalCUDACProgramming2. AdamDeConinck.Introduc7ontotheCUDAToolkitasanApplica7onBuild
Tool.GTC2013.hpp://on-demand.gputechconf.com/gtc/2013/webinar/cuda-toolkit-as-build-tool.pdf
3. SandarbhJain.CUDAProfilingTools.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4587-cuda-profiling-tools.pdf
4. ThomasBradley.GPUPerformanceAnalysisandOp7miza7on.2012.hpp://people.maths.ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_lowres.pdf
5. JulienDemouth.CUDAOp7miza7onwithNVIDIANsight(TM)VisualStudioEdi7on:ACaseStudy.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4160-cuda-opFmizaFon-nvidia-nsight-vse-case-study.pdf
83