Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures...

Lecture21:ManycoreGPUArchitecturesandProgramming,Part3

--Streaming,LibraryandTuning

ConcurrentandMul>coreProgramming

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

1

ManycoreGPUArchitecturesandProgramming:Outline

•  IntroducFon–  GPUarchitectures,GPGPUs,andCUDA•  GPUExecuFonmodel•  CUDAProgrammingmodel•  WorkingwithMemoryinCUDA–  Globalmemory,sharedandconstantmemory•  Streamsandconcurrency•  CUDAinstrucFonintrinsicandlibrary•  Performance,profiling,debugging,anderrorhandling•  DirecFve-basedhigh-levelprogrammingmodel–  OpenACCandOpenMP

2

OffloadingProcessingFlow

1.  CopyinputdatafromCPUmemorytoGPUmemory

PCIBus

3



2.  LoadGPUprogramandexecute,cachingdataonchipforperformance

PCIBus

4



2.  LoadGPUprogramandexecute,cachingdataonchipforperformance

3.  CopyresultsfromGPUmemorytoCPUmemory

PCIBus

5

OverlappingCommunica>onandComputa>on

GPU

PCIe Bus Copy Copy Copy Copy Copy

Compute Compute Compute Compute

•  ThreesequenFalstepsforasinglekernelexecuFon•  MulFplekernels–  Asynchronyisafirst-classciFzenofmostGPUprogramming

frameworks–  ComputaFon-communicaFonoverlapisacommontechnique

inGPUprogramming

6

AbstractConcurrency

•  DifferentkindsofacFonoverlaparepossibleinCUDA?1.  OverlappedhostcomputaFonanddevicecomputaFon2.  OverlappedhostcomputaFonandhost-devicedata

transfer3.  Overlappedhost-devicedatatransferanddevice

computaFon4.  ConcurrentdevicecomputaFon

•  CUDAStreamstoachieveeachofthesetypesofoverlap

7

CUDAStreams

•  CUDAStreams:aFIFOqueueofCUDAacFonstobeperformed–  PlacinganewacFonattheheadofastreamisasynchronous–  ExecuFngacFonsfromthetailasCUDAresourcesallow–  EveryacFon(kernellaunch,cudaMemcpy,etc)runsinan

implicitorexplicitstream

CUDAStream

CUDAApplicaFon

CUDARunFme&GPUKernel cudaMemcpy cudaMemcpy

head tail

8

CUDAStreams

•  TwotypesofstreamsinaCUDAprogram–  Theimplicitlydeclaredstream(NULLstream)–  Explicitlydeclaredstreams(non-NULLstreams)

•  UpunFlnow,allcodehasbeenusingtheNULLstreambydefault

cudaMemcpy(...); kernel<<<...>>>(...); cudaMemcpy(...);

•  Non-NULLstreamsrequiremanualallocaFonandmanagementbytheCUDAprogrammer

9

CUDAStreams

•  TocreateaCUDAstream: cudaError_t cudaStreamCreate(cudaStream_t *stream);

•  TodestroyaCUDAstream: cudaError_t cudaStreamDestroy(cudaStream_t stream);

•  TowaitforallacFonsinaCUDAstreamtofinish: cudaError_t cudaStreamSynchronize(cudaStream_t stream);

•  TocheckifallacFonsinaCUDAstreamhavefinished: cudaError_t cudaStreamQuery(cudaStream_t stream);

10

CUDAStreams

•  cudaMemcpyAsync:AsynchronousmemcpycudaError_t cudaMemcpyAsync(void *dst, const void *src,

size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);

•  cudaMemcpyAsyncdoesthesameascudaMemcpy,butmayreturnbeforethetransferisactuallycomplete

•  PinnedhostmemoryisarequirementforcudaMemcpyAsync –  Memorythatisresidentinphysicalmemorypages,and

cannotbeswappedout,alsoreferredaspage-locked•  Recallmallocnormallyreservevirtualaddressspacefirstandthenactuallyphysicalpagesareallocated

11

CUDAStreams

•  PerformingacudaMemcpyAsync:

int *h_arr, *d_arr; cudaStream_t stream; cudaMalloc((void **)&d_arr, nbytes); cudaMallocHost((void **)&h_arr, nbytes); cudaStreamCreate(&stream); cudaMemcpyAsync(d_arr, h_arr, nbytes, cudaMemcpyHostToDevice, stream); ... cudaStreamSynchronize(stream); cudaFree(d_arr); cudaFreeHost(h_arr); cudaStreamDestroy(stream);

page-lockedmemoryallocaFon

Callreturnbeforetransfercomplete

Dosomethingwhiledataisbeingmoved

SynctomakesureoperaFonscomplete12

CUDAStreams

•  Associatekernellauncheswithanon-NULLstream–  Notethatkernelsarealwaysasynchronous

kernel<<<nblocks, threads_per_block, smem_size, stream>>>(...);

•  TheeffectsofcudaMemcpyAsyncandkernellaunching–  OperaFonsareputinthestreamqueueforexecuFon–  ActuallyoperaFonsmaynothappenyet

•  Host-sideFmertoFmethoseoperaFons–  NottheactualFmeoftheoperaFons

13

•  Vectorsumexample,A+B=C

•  ParFFonthevectorsanduseCUDAstreamstooverlapcopyandcompute

CUDAStreams

CopyA CopyB vector_sum<<<...>>> CopyCNULL stream

A B v_s C

A B v_s C

A B v_s C

A B v_s C

Stream A

Stream B

Stream C

Stream D

14

•  Howcanthisbeimplementedincode?

for (int i = 0; i < nstreams; i++) { int offset = i * eles_per_stream; cudaMemcpyAsync(&d_A[offset], &h_A[offset], eles_per_stream *

sizeof(int), cudaMemcpyHostToDevice, streams[i]); cudaMemcpyAsync(&d_B[offset], &h_B[offset], eles_per_stream * sizeof(int), cudaMemcpyHostToDevice, streams[i]);

…… vector_sum<<<..., streams[i]>>>(d_A + offset, d_B + offset, d_C + offset); cudaMemcpyAsync(&h_C[offset], &d_C[offset], eles_per_stream *

sizeof(int), cudaMemcpyDeviceToHost, streams[i]); }

for (int i = 0; i < nstreams; i++) cudaStreamSynchronize(streams[i]);

CUDAStreams

15

•  TimingasynchronousoperaFons–  Host-sideFmer:onlymeasuretheFmeforthecall,nottheactual

FmeforthedatamovementorkernelexecuFon

•  Eventstostreams,whichmarkspecificpointsinstreamexecuFon

•  Eventsaremanuallycreatedanddestroyed: cudaError_t cudaEventCreate(cudaEvent_t *event); cudaError_t cudaEventDestroy(cudaEvent_t *event);

CUDAEvents

CopyA CopyB vector_sum<<<...>>> CopyC

Event

16

•  ToaddaneventtoaCUDAstream: cudaError_t cudaEventRecord(cudaEvent_t event,

cudaStream_t stream);

–  Eventmarksthepoint-in-FmeamerallprecedingacFonsinstreamcomplete,andbeforeanyacFonsaddedamercudaEventRecordrun

•  HosttowaitforsomeCUDAacFonstofinish cudaError_t cudaEventSynchronize(cudaEvent_t event);

–  WaitforalltheoperaFonsbeforethiseventstocomplete,butnotthoseamer

CUDAEvents


Event

17

•  CheckifaneventhasbeenreachedwithoutwaiFngforit:

cudaError_t cudaEventQuery(cudaEvent_t event);

•  Gettheelapsedmillisecondsbetweentwoevents: cudaError_t cudaEventElapsedTime(float *ms, cudaEvent_t start, cudaEvent_t stop);

CUDAEvents


start stop 18

•  Incodes:

float time; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start); kernel<<<grid, block>>>(arguments); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);

CUDAEvents

19

ImplicitandExplicitSynchroniza>on

•  Twotypesofhost-devicesynchronizaFon:–  Implicitsynchroniza>oncausesthehosttowaitontheGPU,

butasasideeffectofotherCUDAacFons–  Explicitsynchroniza>oncausesthehosttowaitontheGPU

becausetheprogrammerhasaskedforthatbehavior

20

•  FiveCUDAoperaFonsthatincludeimplicitsynchronizaFon:1.  ApinnedhostmemoryallocaFon(cudaMallocHost,

cudaHostAlloc)2.  AdevicememoryallocaFon(cudaMalloc)3.  Adevicememset(cudaMemset)4.  Amemorycopybetweentwoaddressesonthesame

device(cudaMemcpy(..., cudaMemcpyDeviceToDevice))

5.  AmodificaFontotheL1/sharedmemoryconfiguraFon(cudaThreadSetCacheConfig, cudaDeviceSetCacheConfig)


21

•  FourwaystoexplicitlysynchronizeinCUDA:1.  Synchronizeonadevice

cudaError_t cudaDeviceSynchronize();

2.  SynchronizeonastreamcudaError_t cudaStreamSynchronize();

3.  SynchronizeonaneventcudaError_t cudaEventSynchronize();

4.  SynchronizeacrossstreamsusinganeventcudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event);


22

•  cudaStreamWaitEventaddsinter-streamdependencies–  Causesthespecifiedstreamtowaitonthespecifiedevent

beforeexecuFnganyfurtheracFons–  eventdoesnotneedtobeaneventrecordedinstream

cudaEventRecord(event, stream1); ... cudaStreamWaitEvent(stream2, event);

...

–  NoacFonsaddedtostream2amerthecalltocudaStreamWaitEventwillexecuteunFleventissaFsfied


23

SuggestedReadings

1.  Chapter6inProfessionalCUDACProgramming2.  JusFnLuitjens.CUDAStreams:BestPrac7cesandCommon

Pi9alls.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4158-cuda-streams-best-pracFces-common-piqalls.pdf

3.  SteveRennich.CUDAC/C++StreamsandConcurrency.2011.hpp://on-demand.gputechconf.com/gtc-express/2011/presentaFons/StreamsAndConcurrencyWebinar.pdf

24


•  IntroducFon–  GPUarchitectures,GPGPUs,andCUDA•  GPUExecuFonmodel•  CUDAProgrammingmodel•  WorkingwithMemoryinCUDA–  Globalmemory,sharedandconstantmemory•  Streamsandconcurrency•  CUDAinstruc>onintrinsicandlibrary•  Performance,profiling,debugging,anderrorhandling•  DirecFve-basedhigh-levelprogrammingmodel–  OpenACCandOpenMP

25

CUDALibraries

•  CUDALibrariesofferpre-packagedandexpertly-opFmizedfuncFonsthatimplementcommonlyusefuloperaFons.–  VectoraddiFon,matrixvector,matrixmatrix,FFT,etc

26

CUDALibraries

•  WhataretheadvantagesofCUDALibraries?–  SupportawiderangeofapplicaFondomains–  Highlyusable,high-levelAPIsthatarefamiliartodomain

experts–  TunedbyCUDAexpertstoperformwellacrossplaqormsand

datasets–  OmenofferthequickestrouteforporFng,simplyswapoutAPI

calls–  Lowmaintenance,developerofthelibrarytakeson

responsibilityofbugfixesandfeaturerequests

27

CUDALibraries

28

WorkflowtoUseCUDALibrary

1.  Createalibrary-specifichandlethatmanagescontextualinformaFonusefulforthelibrary’soperaFon.–  ManyCUDALibrarieshavetheconceptofahandlewhich

storesopaquelibrary-specificinformaFononthehostwhichmanylibraryfuncFonsaccess

–  Programmer’sresponsibilitytomanagethishandle–  For example: cublasHandle_t,cufftHandle,cusparseHandle_t,curandGenerator_t

2.  AllocatedevicememoryforinputsandoutputstothelibraryfuncFon.–  Use cudaMalloc as usual

29

CommonLibraryWorkflow

3.  Ifinputsarenotalreadyinalibrary-supportedformat,convertthemtobeaccessiblebythelibrary.–  ManyCUDALibrariesonlyacceptdatainaspecificformat–  Forexample:column-majorvs.row-majorarrays

4.  Populatethepre-allocateddevicememorywithinputsinasupportedformat.–  Inmanycases,thisstepsimplyimpliesacudaMemcpyorone

ofitsvariantstomakethedataaccessibleontheGPU–  SomelibrariesprovidecustomtransferfuncFons,forexample:cublasSetVectoropFmizesstridedcopiesfortheCUBLASlibrary

30


5.  ConfigurethelibrarycomputaFontobeexecuted.–  Insomelibraries,thisisano-op–  OthersrequireaddiFonalmetadatatoexecutelibrary

computaFoncorrectly–  InsomecasesthisconfiguraFontakestheformofextra

parameterspassedtolibraryfuncFons,otherssetfieldsinthelibraryhandle

6.  ExecutealibrarycallthatoffloadsthedesiredcomputaFontotheGPU.–  NoGPU-specificknowledgerequired

31


7.  RetrievetheresultsofthatcomputaFonfromdevicememory,possiblyinalibrary-determinedformat.–  Again,thismaybeassimpleasacudaMemcpyorrequirea

library-specificfuncFon

8.  Ifnecessary,converttheretrieveddatatotheapplicaFon’snaFveformat.–  Ifaconversiontoalibrary-specificformatwasnecessary,this

stepensurestheapplicaFoncannowusethecalculateddata–  Ingeneral,itisbesttokeeptheapplicaFonformatandlibrary

formatthesame,reducingoverheadfromrepeatedconversions

32


9.  ReleaseCUDAresources.–  IncludestheusualCUDAcleanup(cudaFree,cudaStreamDestroy,etc)plusanylibrary-specificcleanup

10. ConFnuewiththeremainderoftheapplicaFon.

33


34

•  Notalllibrariesfollowthisworkflow,andnotalllibrariesrequireeverystepinthisworkflow–  Infact,formanylibrariesmanystepsareskipped–  Keepingthisworkflowinmindwillhelpgiveyoucontexton

whatthelibrarymightbedoingbehindthescenesandwhereyouareintheprocess

•  Next,we’lltakealookattwocommonlyusefullibraries–  Trytokeepthecommonworkflowinmindwhileweworkwith

them

cuBLAS

•  cuBLASisaportofapopularlinearalgebralibrary,BLAS

•  cuBLAS(likeBLAS)splitsitssubrouFnesintomulFplelevelsbasedondatatypesprocessed:–  Level1:vector-onlyoperaFons(e.g.vectoraddiFon)–  Level2:matrix-vectoroperaFons(e.g.matrix-vector

mulFplicaFon)–  Level3:matrix-matrixoperaFons(e.g.matrixmulFplicaFon)

35

cuBLASIdiosyncracies

•  ForlegacycompaFbility,cuBLASoperatesoncolumn-majormatrices

•  cuBLASalsohasalegacyAPIwhichwasdroppedsinceCUDA4.0,thislecturewillusethenewcuBLASAPI–  IfyoufindcuBLAScodethatdoesn’tquitematchup,youmay

belookingattheoldcuBLASAPI

3 0 0 6 0 0 0 2 1

3 6 0 0 0 2 0 0 1

36

cuBLASDataManagement

•  DevicememoryincuBLASisallocatedasyou’reusedto:cudaMalloc

•  Transferringdatato/fromthedeviceusescuBLAS-specificfuncFons:–  cublasGetVector/cublasSetVector–  cublasGetMatrix/cublasSetMatrix

37


•  Example:cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy);

where:•  nisthenumberofelementstotransfertotheGPU•  elemSizeisthesizeofeachelement(e.g.sizeof(int))•  x isthevectoronthehosttocopyfrom•  incxisastrideinxofthearraycellstotransferto•  y isthevectorontheGPUtocopyto•  incyisastrideinyofthearraycellstotransferto

38


•  Example:cublasSetVector(5, sizeof(int), h_x, 3, d_x, 2);

h_x

d_x

39


•  Similarly:cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb);

where:•  rowsisthenumberofrowsinamatrixtocopy•  colsisthenumberofcolsinamatrixtocopy•  elemSizeisthesizeofeachcellinthematrix(e.g.sizeof(int))

•  A isthesourcematrixonthehost•  ldaisthenumberofrowsintheunderlyingarrayforA •  BisthedesFnaFonmatrixontheGPU•  ldbisthenumberofrowsintheunderlyingarrayforB

40


•  Similarly:cublasSetMatrix(3, 3, sizeof(int), h_A, 4, d_A, 5);

4

5

41

cuBLASExample

•  Matrix-vectormulFplicaFon–  Uses6ofthe10stepsinthecommonlibraryworkflow:

1.  CreateacuBLAShandleusingcublasCreateHandle 2.  Allocatedevicememoryforinputsandoutputsusing

cudaMalloc 3.  PopulatedevicememoryusingcublasSetVector,

cublasSetMatrix 4.  CallcublasSgemvtorunmatrix-vectormulFplicaFonon

theGPU5.  RetrieveresultsfromtheGPUusingcublasGetVector 6.  ReleaseCUDAandcuBLASresourcesusingcudaFree,

cublasDestroy 42

cuBLASExample

•  Youcanbuildandruntheexamplecublas.cu:

cublasCreate(&handle); cudaMalloc((void **)&dA, sizeof(float) * M * N); cudaMalloc((void **)&dX, sizeof(float) * N); cudaMalloc((void **)&dY, sizeof(float) * M); cublasSetVector(N, sizeof(float), X, 1, dX, 1); cublasSetVector(M, sizeof(float), Y, 1, dY, 1); cublasSetMatrix(M, N, sizeof(float), A, M, dA, M); cublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1, &beta, dY, 1); cublasGetVector(M, sizeof(float), dY, 1, Y, 1); /* for sgemm */ cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,matrix_size.uiWB,matrix_size.uiHA,matrix_size.uiWA,&alpha,d_B,matrix_size.uiWB,d_A,matrix_size.uiWA,&beta,d_C,matrix_size.uiWA)

43

cuBLASPortability

•  PorFngtocuBLASfromBLASisastraighqorwardprocess.Ingeneral,itrequires:–  AddingdevicememoryallocaFon/freeing(cudaMalloc,cudaFree)

–  AddingdevicetransferfuncFons(cublasSetVector,cublasSetMatrix,etc)

–  TransformlibraryrouFnecallsfromBLAStocuBLAS(e.g.cblas_sgemvècublasSgemv)

44

cuBLASPortability

•  SomecommonopFmizaFonsfollowinganaiveBLASècuBLASportare:–  ReusingdevicememoryallocaFons–  Removingredundantdatatransfersfromandtothedevice–  AddingstreamedexecuFonusingcublasSetStream

45

cuBLASSummary

•  cuBLASmakesacceleraFnglegacyBLASapplicaFonssimpleandeasy–  Verylipleaddedcode–  StraighqorwardmappingfromBLASrouFnestocuBLAS

rouFnes–  FlexibleAPIimprovesportability

•  FornewlinearalgebraapplicaFons,cuBLASoffersahigh-performancealternaFvetoBLAS–  High-performancekernelswithverylipleprogrammerFme

46

cuFFT

•  cuFFToffersanopFmizedimplementaFonofthefastFouriertransform

47

cuFFTConfigura>on

48

•  IncuFFTterminology,plans==handles–  cuFFTplansdefineasingleFFTtransformaFontobeperformed

•  cuFFTusesplanstoderivetheinternalmemoryallocaFons,

transfers,kernelsrequiredtoimplementthedesiredtransform

•  Plansarecreatedwith:cufftResult cufftPlan1d(cufftHandle *plan, int nx, cufftType type, int batch);

cufftResult cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type);

cufftResult cufftPlan3d(cufftHandle *plan, int nx, int ny, int nz, cufftType type);

cuFFTConfigura>on

49

•  cufftTypereferstothedatatypesofatransformaFon,forexample:–  Complex-to-complex:CUFFT_C2C –  Real-to-complex:CUFFT_R2C –  Complex-to-real:CUFFT_C2R

cuFFTExample

•  Acomplex-to-complex1DcuFFTplanandexecuFngit,using6ofthe10stepsinthecommonlibraryworkflow:

1.  CreateandconfigureacuFFTplan2.  AllocateGPUmemoryfortheinputsamplesandoutput

frequenciesusingcudaMalloc 3.  PopulateGPUmemorywithinputsamplesusing

cudaMemcpy 4.  ExecutetheplanusingacufftExec*funcFon5.  RetrievethecalculatedfrequenciesfromGPUmemory

usingcudaMemcpy 6.  ReleaseCUDAandcuFFTresourcesusingcudaFree,

cufftDestroy 50

cuFFTExample

•  Youcanbuildandrunanexamplecufft.cu:

cufftPlan1d(&plan, N, CUFFT_C2C, 1); cudaMalloc((void **)&dComplexSamples, sizeof(cufftComplex) * N); cudaMemcpy(dComplexSamples, complexSamples, sizeof(cufftComplex) * N, cudaMemcpyHostToDevice); cufftExecC2C(plan, dComplexSamples, dComplexSamples, CUFFT_FORWARD); cudaMemcpy(complexFreq, dComplexSamples, sizeof(cufftComplex) * N, cudaMemcpyDeviceToHost);

51

cuFFTSummary

•  LikecuBLAS,cuFFToffersahigh-levelandusableAPIforporFnglegacyFFTapplicaFonsorwriFngnewones–  cuFFT’sAPIisdeliberatelysimilartoindustry-standardlibrary

FFTWtoimproveprogrammability–  Offershigherperformanceforlipledevelopereffort

52

Drop-InCUDALibraries

•  Drop-InCUDALibrariesallowseamlessintegraFonofCUDAperformancewithexisFngcodebases–  FullcompaFbilitywithindustry-standardlibraries,exposethe

sameexternalAPIs–  BLASèNVBLAS–  FFTWècuFFTW

•  TwowaystouseDrop-InLibraries:–  Re-linktoCUDALibraries–  LD_PRELOADCUDALibrariesbeforetheirhostequivalents

53

Drop-InCUDALibraries

•  Re-linkinglegacyapplicaFonstoCUDALibraries:–  SupposeyouhavealegacyapplicaFonthatreliesonBLAS:

$ gcc app.c –lblas –o app –  RecompilingwithNVBLASlinkedwillautomaFcallyaccelerate

allBLAScalls$ gcc app.c –lnvblas –o app

•  AlternaFvely,simplysetLD_PRELOADwhenexecuFngtheapplicaFon:

$ env LD_PRELOAD=libnvblas.so ./app

54

SurveyofCUDALibraryPerformance

55

•  We’veseenthatcuBLASandcuFFTarehigh-level,programmablelibraries(liketheirhostcounterparts)–  NoCUDA-specificconcepts(e.g.threadblocks,pinned

memory,etc)

•  Let’sdoabriefsurveyofCUDALibraryperformancetoseetheperformanceimprovementspossible–  Focusonthesamelibraries(cuBLASandcuFFT)butsimilar

dataonotherlibrariesisavailableinthebookandonline


56


57

SuggestedReadings

1.  AllsecFonsinChapter8ofProfessionalCUDACProgrammingexceptUsingOpenACC

2.  cuSPARSEUserGuide.2014.hpp://docs.nvidia.com/cuda/cusparse/3.  cuBLASUserGuide.2014.hpp://docs.nvidia.com/cuda/cublas/4.  cuRANDUserGuide.2014.hpp://docs.nvidia.com/cuda/curand/5.  cuFFTUserGuide.2014.hpp://docs.nvidia.com/cuda/cuw/6.  CUDAToolkit5.0PerformanceReport.2013.hpp://on-

demand.gputechconf.com/gtc-express/2013/presentaFons/cuda--5.0-math-libraries-performance.pdf

58


•  IntroducFon–  GPUarchitectures,GPGPUs,andCUDA•  GPUExecuFonmodel•  CUDAProgrammingmodel•  WorkingwithMemoryinCUDA–  Globalmemory,sharedandconstantmemory•  Streamsandconcurrency•  CUDAinstrucFonintrinsicandlibrary•  Performance,profiling,debugging,anderrorhandling•  DirecFve-basedhigh-levelprogrammingmodel–  OpenACCandOpenMP

59

GPUParalleliza>on

•  Amany-facetedprocess–  PerformancevariesdramaFcallydependingonthe

implementaFonofthesamealgorithms•  NaïvetohighlyopFmizedversion

•  ManytypesofopFmizaFonsforGPUs–  Sharedmemory–  Constantmemory–  Globalmemoryaccesspaperns–  WarpshuffleinstrucFons–  ComputaFon-communicaFonoverlap–  CUDAcompilerflags,e.g.loopunrolling,etc–  Increasingparallelism–  ...

60

Op>miza>onOpportuni>es

•  Kernel-levelopFmizaFon:–  ExposingSufficientParallelism–  OpFmizingMemoryAccess–  OpFmizingInstrucFonExecuFon

•  Host-GPUopFmizaFon–  E.g.kernelanddatatransferoverlapusingCUDAstreams

•  Profile-drivenopFmizaFonimprovesopFmizaFonsselecFon

61

Kernel-LevelOp>miza>on

•  ExposingSufficientParallelism–  IncreasetheamountofconcurrentworkontheGPUsoasto

saturateinstrucFonandmemorybandwidth

•  Canbeaccomplishedby:1. MoreconcurrentlyacFvewarpsperSM2. Moreindependentworkassignedtoeachthread

Warp-LevelParallelism

InstrucFon-LevelParallelism

62


•  IncreasingthenumberofwarpsperSM/threadblockdoesnotguaranteeperformanceimprovement–  Resultinfewerper-SMresourcesassignedtoeachthread(e.g.

registers,sharedmemory)

Less parallelism, more per-thread resources

More parallelism, smaller per-thread resources

63


•  CreaFngmoreindependentworkperthread–  loopunrollingorothercodetransformaFonsthatexpose

instrucFon-levelparallelism,–  Butmayalsoincreaseper-threadresourcerequirements

int sum = 0; for (int i = 0; i < 4; i++) { sum += a[i]; }

int i1 = a[0]; int i2 = a[1]; int i3 = a[2]; int i4 = a[4]; int sum = i1 + i2 + i3 + i4;

Requires 2 registers (sum, i), no instruction-level parallelism Requires 5 registers (i1, i2, i3, i4,

sum), four-way instruction-level parallelism

64


•  OpFmizingmemoryaccesstomaximize:–  MemorybandwidthuFlizaFon(efficiencyofmemoryaccess

paperns)–  Memoryaccessconcurrency(sufficientmemoryrequeststo

hidememorylatency)

65


•  Aligned,coalescedglobalandsharedmemoryaccessesopFmizememorybandwidthuFlizaFon–  Constantmemoryprefersabroadcastaccesspapern

66


•  OpFmizingInstrucFonExecuFonfocuseson:–  HidinginstrucFonlatencybykeepingasufficientnumberof

warpsacFve–  AvoidingdivergentexecuFonpathswithinwarps•  Ifinsideakernel

•  ExperimenFngwiththreadexecuFonconfiguraFoncanproduceunexpectedperformancegainsfrommoreorlessacFvewarps

•  DivergentexecuFonwithinawarpproducesreducedparallelismaswarpexecuFonofmulFplecodepathsisserialized

67

Profile-DrivenOp>miza>on

•  Profile-drivenopFmizaFonisaniteraFveprocesstoopFmizeprogrambasedonquan>ta>veprofilinginfo–  AsweapplyopFmizaFontechniques,weanalyzetheresults

usingnvprofanddecideiftheyarebeneficial

68

Determine performance

inhibitors

Identify hotspots

Gather profiling information

Optimize

Repeat


69

•  Thekeychallengeinprofile-drivenopFmizaFonistodetermineperformanceinhibitorsinhotspots–  nvvpandnvprofareinvaluabletoolsforthis


•  nvprofprofilingmodes:–  SummaryMode:defaultmode,displaysexecuFonFme

informaFononhigh-levelacFonssuchaskernelsordatatransfers

–  TraceMode:ProvidesaFmelineofCUDAeventsoracFonsinchronologicalorder

–  Event/MetricSummaryMode:Aggregatesevent/metriccountsacrossallkernelinvocaFons

–  Event/MetricTraceMode:Displaysevent/metriccountsforeachkernelinvocaFon

70


•  TheNVIDIAVisualProfiler(nvvp)isalsoapowerfultoolforguidingprofile-drivenopFmizaFon–  OffersanumberofviewstoinspectdifferentpartsofaCUDA

applicaFon

71

CUDADebugging

•  AnimportantpartofCUDAsomwaredevelopmentistheabilitytodebugCUDAapplicaFons

•  CUDAoffersanumberofdebuggingtools,splitintotwocategories:–  KernelDebugging–  MemoryDebugging

72

CUDADebugging

•  KernelDebuggingtoolshelpustoanalyzethecorrectnessofrunningCUDAkernelsbyinspecFngrunningapplicaFonstate

•  MemoryDebuggingToolshelpusdetectapplicaFonbugsbyobservingirregularorout-of-boundmemoryaccessesperformedbyCUDAkernels

73

CUDAKernelDebugging

•  Primarytoolforthejob:cuda-gdb –  IntenFonallybuilttobesimilartothehostdebuggingtoolgdb–  RequirescompilaFonwithspecialflagstobeuseful:

$ nvcc –g –G foo.cu -o foo

•  OnceanapplicaFoniscompiledindebugmode,runningitundercuda-gdb ispossibleusing:

$ cuda-gdb foo ... (cuda-gdb)

74

CUDAKernelDebugging

•  cuda-gdb usesmostofthesamecommandsasgdb

•  OnemaindifferenceistheideaofCUDAFocus,orthecurrentthreadthatcuda-gdb isfocusedonandagainstwhichallcommandsrun–  Querythecurrentfocususing:

(cuda-gdb) cuda thread lane warp block sm grid device kernel

–  Exampleofse{ngfocustothe128ththreadinthecurrentblock:

(cuda-gdb) cuda thread (128)

75

CUDAKernelDebugging

•  printfisanotherformofCUDAKernelDebugging–  Onlyavailableondevicesofcomputecapability2.0orhigher

•  Printsarebufferedonthedeviceandperiodicallytransferredbacktothehostfordisplay–  SizeofthisbufferconfigurablewithcudaSetDeviceLimit

•  BuffercontentsaretransferredtothehostameranyCUDAkernellaunch,anyhost-sideexplicitsynchronizaFon,anysynchronousmemorycopies

76

CUDAMemoryDebugging

•  MemoryDebuggingdetectsmemoryerrorsinCUDAkernelsthatarelikelyindicaFveofbugsinthecode–  Forexample:out-of-boundsmemoryaccesses

•  ThereisasingletoolforMemoryDebugging,cuda-memcheck,whichcontainstwouFliFes:–  Thememchecktool–  Theracechecktool

77

CUDAMemoryDebugging

•  ThecompilaFonprocessforcuda-memcheck ismoreinvolvedthanforcuda-gdb –  BuildingwithfulldebugopFonsaffectsperformance,which

maymakememoryerrorshardertohit–  ApplicaFonsshouldalwaysbecompiledwith-lineinfo –  ApplicaFonsshouldalsobecompiledtoincludesymbol

informaFon,butdoingthisvariesbyplaqorm•  Linux:-Xcompiler–rdynamic• Windows:-Xcompiler/Zi•  ...

78

CUDAMemoryDebugging

•  OncetheapplicaFoniscompiled,memcheckcanbeusedtocheckfor6differenttypesofmemoryerrors:1.  MemoryAccessError:Out-of-boundsormisaligned

memoryaccess2.  HardwareExcepFon:Errorreportedbyhardware3. malloc/freeErrors:ImproperuseofCUDAdynamic

memoryallocaFon4.  CUDAAPIErrors:AnyerrorreturncodefromaCUDAAPI

call5.  cudaMallocMemoryLeaks:cudaMallocallocaFonsthat

arenotcudaFree’d 6.  DeviceHeapMemoryLeaks:DynamicmemoryallocaFons

thatareneverfreed79

CUDAMemoryDebugging

•  Thetwocuda-memcheck uFliFesofferverydifferentcapabiliFes:–  memcheckperformsawiderangeofmemorycorrectness

checks–  racecheckverifiesthat__shared__memoryusageis

correctinanapplicaFon,aparFcularlydifficulttasktoperformmanually

•  cuda-memcheck offersamoreautomatedapproachtodebuggingthancuda-gdb

80

CUDAErrorHandling

•  PropererrorhandlingisanimportantpartofrobustCUDAdeployment–  EveryCUDAfuncFonreturnsanerrorcodethatmustbe

checked–  IfasynchronousoperaFonsareused,thiserrormaybearesult

ofadifferentasynchronousoperaFonfailing–  ReturncodeofcudaSuccessindicatessuccess

•  CUDAalsooffersanumberoferror-handlingfuncFons

81

CUDAErrorHandling

cudaError_t cudaGetLastError(); –  RetrievethelatestCUDAerror,clearingtheCUDArunFme’s

internalerrorstatetobecudaSuccess

cudaError_t cudaPeekLastError(); –  RetrievethelatestCUDAerror,butdonotcleartheCUDA

runFme’sinternalerrorstate

const char *cudaGetErrorString(cudaError_t error); –  Fetchahuman-readablestringfortheprovidederror

82

SuggestedReadings

1.  Chapter10inProfessionalCUDACProgramming2.  AdamDeConinck.Introduc7ontotheCUDAToolkitasanApplica7onBuild

Tool.GTC2013.hpp://on-demand.gputechconf.com/gtc/2013/webinar/cuda-toolkit-as-build-tool.pdf

3.  SandarbhJain.CUDAProfilingTools.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4587-cuda-profiling-tools.pdf

4.  ThomasBradley.GPUPerformanceAnalysisandOp7miza7on.2012.hpp://people.maths.ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_lowres.pdf

5.  JulienDemouth.CUDAOp7miza7onwithNVIDIANsight(TM)VisualStudioEdi7on:ACaseStudy.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4160-cuda-opFmizaFon-nvidia-nsight-vse-case-study.pdf

83

Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures...

Documents

Transcript of Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures...