Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures...

83
Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core Programming Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1

Transcript of Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures...

Page 1: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Lecture21:ManycoreGPUArchitecturesandProgramming,Part3

--Streaming,LibraryandTuning

ConcurrentandMul>coreProgramming

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]/~yan

1

Page 2: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

ManycoreGPUArchitecturesandProgramming:Outline

•  IntroducFon–  GPUarchitectures,GPGPUs,andCUDA•  GPUExecuFonmodel•  CUDAProgrammingmodel•  WorkingwithMemoryinCUDA–  Globalmemory,sharedandconstantmemory•  Streamsandconcurrency•  CUDAinstrucFonintrinsicandlibrary•  Performance,profiling,debugging,anderrorhandling•  DirecFve-basedhigh-levelprogrammingmodel–  OpenACCandOpenMP

2

Page 3: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

OffloadingProcessingFlow

1.  CopyinputdatafromCPUmemorytoGPUmemory

PCIBus

3

Page 4: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

OffloadingProcessingFlow

1.  CopyinputdatafromCPUmemorytoGPUmemory

2.  LoadGPUprogramandexecute,cachingdataonchipforperformance

PCIBus

4

Page 5: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

OffloadingProcessingFlow

1.  CopyinputdatafromCPUmemorytoGPUmemory

2.  LoadGPUprogramandexecute,cachingdataonchipforperformance

3.  CopyresultsfromGPUmemorytoCPUmemory

PCIBus

5

Page 6: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

OverlappingCommunica>onandComputa>on

GPU

PCIe Bus Copy Copy Copy Copy Copy

Compute Compute Compute Compute

•  ThreesequenFalstepsforasinglekernelexecuFon•  MulFplekernels–  Asynchronyisafirst-classciFzenofmostGPUprogramming

frameworks–  ComputaFon-communicaFonoverlapisacommontechnique

inGPUprogramming

6

Page 7: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

AbstractConcurrency

•  DifferentkindsofacFonoverlaparepossibleinCUDA?1.  OverlappedhostcomputaFonanddevicecomputaFon2.  OverlappedhostcomputaFonandhost-devicedata

transfer3.  Overlappedhost-devicedatatransferanddevice

computaFon4.  ConcurrentdevicecomputaFon

•  CUDAStreamstoachieveeachofthesetypesofoverlap

7

Page 8: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAStreams

•  CUDAStreams:aFIFOqueueofCUDAacFonstobeperformed–  PlacinganewacFonattheheadofastreamisasynchronous–  ExecuFngacFonsfromthetailasCUDAresourcesallow–  EveryacFon(kernellaunch,cudaMemcpy,etc)runsinan

implicitorexplicitstream

CUDAStream

CUDAApplicaFon

CUDARunFme&GPUKernel cudaMemcpy cudaMemcpy

head tail

8

Page 9: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAStreams

•  TwotypesofstreamsinaCUDAprogram–  Theimplicitlydeclaredstream(NULLstream)–  Explicitlydeclaredstreams(non-NULLstreams)

•  UpunFlnow,allcodehasbeenusingtheNULLstreambydefault

cudaMemcpy(...); kernel<<<...>>>(...); cudaMemcpy(...);

•  Non-NULLstreamsrequiremanualallocaFonandmanagementbytheCUDAprogrammer

9

Page 10: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAStreams

•  TocreateaCUDAstream: cudaError_t cudaStreamCreate(cudaStream_t *stream);

•  TodestroyaCUDAstream: cudaError_t cudaStreamDestroy(cudaStream_t stream);

•  TowaitforallacFonsinaCUDAstreamtofinish: cudaError_t cudaStreamSynchronize(cudaStream_t stream);

•  TocheckifallacFonsinaCUDAstreamhavefinished: cudaError_t cudaStreamQuery(cudaStream_t stream);

10

Page 11: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAStreams

•  cudaMemcpyAsync:AsynchronousmemcpycudaError_t cudaMemcpyAsync(void *dst, const void *src,

size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);

•  cudaMemcpyAsyncdoesthesameascudaMemcpy,butmayreturnbeforethetransferisactuallycomplete

•  PinnedhostmemoryisarequirementforcudaMemcpyAsync –  Memorythatisresidentinphysicalmemorypages,and

cannotbeswappedout,alsoreferredaspage-locked•  Recallmallocnormallyreservevirtualaddressspacefirstandthenactuallyphysicalpagesareallocated

11

Page 12: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAStreams

•  PerformingacudaMemcpyAsync:

int *h_arr, *d_arr; cudaStream_t stream; cudaMalloc((void **)&d_arr, nbytes); cudaMallocHost((void **)&h_arr, nbytes); cudaStreamCreate(&stream); cudaMemcpyAsync(d_arr, h_arr, nbytes, cudaMemcpyHostToDevice, stream); ... cudaStreamSynchronize(stream); cudaFree(d_arr); cudaFreeHost(h_arr); cudaStreamDestroy(stream);

page-lockedmemoryallocaFon

Callreturnbeforetransfercomplete

Dosomethingwhiledataisbeingmoved

SynctomakesureoperaFonscomplete12

Page 13: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAStreams

•  Associatekernellauncheswithanon-NULLstream–  Notethatkernelsarealwaysasynchronous

kernel<<<nblocks, threads_per_block, smem_size, stream>>>(...);

•  TheeffectsofcudaMemcpyAsyncandkernellaunching–  OperaFonsareputinthestreamqueueforexecuFon–  ActuallyoperaFonsmaynothappenyet

•  Host-sideFmertoFmethoseoperaFons–  NottheactualFmeoftheoperaFons

13

Page 14: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  Vectorsumexample,A+B=C

•  ParFFonthevectorsanduseCUDAstreamstooverlapcopyandcompute

CUDAStreams

CopyA CopyB vector_sum<<<...>>> CopyCNULL stream

A B v_s C

A B v_s C

A B v_s C

A B v_s C

Stream A

Stream B

Stream C

Stream D

14

Page 15: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  Howcanthisbeimplementedincode?

for (int i = 0; i < nstreams; i++) { int offset = i * eles_per_stream; cudaMemcpyAsync(&d_A[offset], &h_A[offset], eles_per_stream *

sizeof(int), cudaMemcpyHostToDevice, streams[i]); cudaMemcpyAsync(&d_B[offset], &h_B[offset], eles_per_stream * sizeof(int), cudaMemcpyHostToDevice, streams[i]);

…… vector_sum<<<..., streams[i]>>>(d_A + offset, d_B + offset, d_C + offset); cudaMemcpyAsync(&h_C[offset], &d_C[offset], eles_per_stream *

sizeof(int), cudaMemcpyDeviceToHost, streams[i]); }

for (int i = 0; i < nstreams; i++) cudaStreamSynchronize(streams[i]);

CUDAStreams

15

Page 16: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  TimingasynchronousoperaFons–  Host-sideFmer:onlymeasuretheFmeforthecall,nottheactual

FmeforthedatamovementorkernelexecuFon

•  Eventstostreams,whichmarkspecificpointsinstreamexecuFon

•  Eventsaremanuallycreatedanddestroyed: cudaError_t cudaEventCreate(cudaEvent_t *event); cudaError_t cudaEventDestroy(cudaEvent_t *event);

CUDAEvents

CopyA CopyB vector_sum<<<...>>> CopyC

Event

16

Page 17: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  ToaddaneventtoaCUDAstream: cudaError_t cudaEventRecord(cudaEvent_t event,

cudaStream_t stream);

–  Eventmarksthepoint-in-FmeamerallprecedingacFonsinstreamcomplete,andbeforeanyacFonsaddedamercudaEventRecordrun

•  HosttowaitforsomeCUDAacFonstofinish cudaError_t cudaEventSynchronize(cudaEvent_t event);

–  WaitforalltheoperaFonsbeforethiseventstocomplete,butnotthoseamer

CUDAEvents

CopyA CopyB vector_sum<<<...>>> CopyC

Event

17

Page 18: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  CheckifaneventhasbeenreachedwithoutwaiFngforit:

cudaError_t cudaEventQuery(cudaEvent_t event);

•  Gettheelapsedmillisecondsbetweentwoevents: cudaError_t cudaEventElapsedTime(float *ms, cudaEvent_t start, cudaEvent_t stop);

CUDAEvents

CopyA CopyB vector_sum<<<...>>> CopyC

start stop 18

Page 19: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  Incodes:

float time; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start); kernel<<<grid, block>>>(arguments); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);

CUDAEvents

19

Page 20: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

ImplicitandExplicitSynchroniza>on

•  Twotypesofhost-devicesynchronizaFon:–  Implicitsynchroniza>oncausesthehosttowaitontheGPU,

butasasideeffectofotherCUDAacFons–  Explicitsynchroniza>oncausesthehosttowaitontheGPU

becausetheprogrammerhasaskedforthatbehavior

20

Page 21: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  FiveCUDAoperaFonsthatincludeimplicitsynchronizaFon:1.  ApinnedhostmemoryallocaFon(cudaMallocHost,

cudaHostAlloc)2.  AdevicememoryallocaFon(cudaMalloc)3.  Adevicememset(cudaMemset)4.  Amemorycopybetweentwoaddressesonthesame

device(cudaMemcpy(..., cudaMemcpyDeviceToDevice))

5.  AmodificaFontotheL1/sharedmemoryconfiguraFon(cudaThreadSetCacheConfig, cudaDeviceSetCacheConfig)

ImplicitandExplicitSynchroniza>on

21

Page 22: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  FourwaystoexplicitlysynchronizeinCUDA:1.  Synchronizeonadevice

cudaError_t cudaDeviceSynchronize();

2.  SynchronizeonastreamcudaError_t cudaStreamSynchronize();

3.  SynchronizeonaneventcudaError_t cudaEventSynchronize();

4.  SynchronizeacrossstreamsusinganeventcudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event);

ImplicitandExplicitSynchroniza>on

22

Page 23: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

•  cudaStreamWaitEventaddsinter-streamdependencies–  Causesthespecifiedstreamtowaitonthespecifiedevent

beforeexecuFnganyfurtheracFons–  eventdoesnotneedtobeaneventrecordedinstream

cudaEventRecord(event, stream1); ... cudaStreamWaitEvent(stream2, event);

...

–  NoacFonsaddedtostream2amerthecalltocudaStreamWaitEventwillexecuteunFleventissaFsfied

ImplicitandExplicitSynchroniza>on

23

Page 24: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

SuggestedReadings

1.  Chapter6inProfessionalCUDACProgramming2.  JusFnLuitjens.CUDAStreams:BestPrac7cesandCommon

Pi9alls.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4158-cuda-streams-best-pracFces-common-piqalls.pdf

3.  SteveRennich.CUDAC/C++StreamsandConcurrency.2011.hpp://on-demand.gputechconf.com/gtc-express/2011/presentaFons/StreamsAndConcurrencyWebinar.pdf

24

Page 25: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

ManycoreGPUArchitecturesandProgramming:Outline

•  IntroducFon–  GPUarchitectures,GPGPUs,andCUDA•  GPUExecuFonmodel•  CUDAProgrammingmodel•  WorkingwithMemoryinCUDA–  Globalmemory,sharedandconstantmemory•  Streamsandconcurrency•  CUDAinstruc>onintrinsicandlibrary•  Performance,profiling,debugging,anderrorhandling•  DirecFve-basedhigh-levelprogrammingmodel–  OpenACCandOpenMP

25

Page 26: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDALibraries

•  CUDALibrariesofferpre-packagedandexpertly-opFmizedfuncFonsthatimplementcommonlyusefuloperaFons.–  VectoraddiFon,matrixvector,matrixmatrix,FFT,etc

26

Page 27: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDALibraries

•  WhataretheadvantagesofCUDALibraries?–  SupportawiderangeofapplicaFondomains–  Highlyusable,high-levelAPIsthatarefamiliartodomain

experts–  TunedbyCUDAexpertstoperformwellacrossplaqormsand

datasets–  OmenofferthequickestrouteforporFng,simplyswapoutAPI

calls–  Lowmaintenance,developerofthelibrarytakeson

responsibilityofbugfixesandfeaturerequests

27

Page 28: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDALibraries

28

Page 29: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

WorkflowtoUseCUDALibrary

1.  Createalibrary-specifichandlethatmanagescontextualinformaFonusefulforthelibrary’soperaFon.–  ManyCUDALibrarieshavetheconceptofahandlewhich

storesopaquelibrary-specificinformaFononthehostwhichmanylibraryfuncFonsaccess

–  Programmer’sresponsibilitytomanagethishandle–  For example: cublasHandle_t,cufftHandle,cusparseHandle_t,curandGenerator_t

2.  AllocatedevicememoryforinputsandoutputstothelibraryfuncFon.–  Use cudaMalloc as usual

29

Page 30: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CommonLibraryWorkflow

3.  Ifinputsarenotalreadyinalibrary-supportedformat,convertthemtobeaccessiblebythelibrary.–  ManyCUDALibrariesonlyacceptdatainaspecificformat–  Forexample:column-majorvs.row-majorarrays

4.  Populatethepre-allocateddevicememorywithinputsinasupportedformat.–  Inmanycases,thisstepsimplyimpliesacudaMemcpyorone

ofitsvariantstomakethedataaccessibleontheGPU–  SomelibrariesprovidecustomtransferfuncFons,forexample:cublasSetVectoropFmizesstridedcopiesfortheCUBLASlibrary

30

Page 31: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CommonLibraryWorkflow

5.  ConfigurethelibrarycomputaFontobeexecuted.–  Insomelibraries,thisisano-op–  OthersrequireaddiFonalmetadatatoexecutelibrary

computaFoncorrectly–  InsomecasesthisconfiguraFontakestheformofextra

parameterspassedtolibraryfuncFons,otherssetfieldsinthelibraryhandle

6.  ExecutealibrarycallthatoffloadsthedesiredcomputaFontotheGPU.–  NoGPU-specificknowledgerequired

31

Page 32: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CommonLibraryWorkflow

7.  RetrievetheresultsofthatcomputaFonfromdevicememory,possiblyinalibrary-determinedformat.–  Again,thismaybeassimpleasacudaMemcpyorrequirea

library-specificfuncFon

8.  Ifnecessary,converttheretrieveddatatotheapplicaFon’snaFveformat.–  Ifaconversiontoalibrary-specificformatwasnecessary,this

stepensurestheapplicaFoncannowusethecalculateddata–  Ingeneral,itisbesttokeeptheapplicaFonformatandlibrary

formatthesame,reducingoverheadfromrepeatedconversions

32

Page 33: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CommonLibraryWorkflow

9.  ReleaseCUDAresources.–  IncludestheusualCUDAcleanup(cudaFree,cudaStreamDestroy,etc)plusanylibrary-specificcleanup

10. ConFnuewiththeremainderoftheapplicaFon.

33

Page 34: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CommonLibraryWorkflow

34

•  Notalllibrariesfollowthisworkflow,andnotalllibrariesrequireeverystepinthisworkflow–  Infact,formanylibrariesmanystepsareskipped–  Keepingthisworkflowinmindwillhelpgiveyoucontexton

whatthelibrarymightbedoingbehindthescenesandwhereyouareintheprocess

•  Next,we’lltakealookattwocommonlyusefullibraries–  Trytokeepthecommonworkflowinmindwhileweworkwith

them

Page 35: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLAS

•  cuBLASisaportofapopularlinearalgebralibrary,BLAS

•  cuBLAS(likeBLAS)splitsitssubrouFnesintomulFplelevelsbasedondatatypesprocessed:–  Level1:vector-onlyoperaFons(e.g.vectoraddiFon)–  Level2:matrix-vectoroperaFons(e.g.matrix-vector

mulFplicaFon)–  Level3:matrix-matrixoperaFons(e.g.matrixmulFplicaFon)

35

Page 36: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASIdiosyncracies

•  ForlegacycompaFbility,cuBLASoperatesoncolumn-majormatrices

•  cuBLASalsohasalegacyAPIwhichwasdroppedsinceCUDA4.0,thislecturewillusethenewcuBLASAPI–  IfyoufindcuBLAScodethatdoesn’tquitematchup,youmay

belookingattheoldcuBLASAPI

3 0 0 6 0 0 0 2 1

3 6 0 0 0 2 0 0 1

36

Page 37: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASDataManagement

•  DevicememoryincuBLASisallocatedasyou’reusedto:cudaMalloc

•  Transferringdatato/fromthedeviceusescuBLAS-specificfuncFons:–  cublasGetVector/cublasSetVector–  cublasGetMatrix/cublasSetMatrix

37

Page 38: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASDataManagement

•  Example:cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy);

where:•  nisthenumberofelementstotransfertotheGPU•  elemSizeisthesizeofeachelement(e.g.sizeof(int))•  x isthevectoronthehosttocopyfrom•  incxisastrideinxofthearraycellstotransferto•  y isthevectorontheGPUtocopyto•  incyisastrideinyofthearraycellstotransferto

38

Page 39: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASDataManagement

•  Example:cublasSetVector(5, sizeof(int), h_x, 3, d_x, 2);

h_x

d_x

39

Page 40: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASDataManagement

•  Similarly:cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb);

where:•  rowsisthenumberofrowsinamatrixtocopy•  colsisthenumberofcolsinamatrixtocopy•  elemSizeisthesizeofeachcellinthematrix(e.g.sizeof(int))

•  A isthesourcematrixonthehost•  ldaisthenumberofrowsintheunderlyingarrayforA •  BisthedesFnaFonmatrixontheGPU•  ldbisthenumberofrowsintheunderlyingarrayforB

40

Page 41: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASDataManagement

•  Similarly:cublasSetMatrix(3, 3, sizeof(int), h_A, 4, d_A, 5);

4

5

41

Page 42: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASExample

•  Matrix-vectormulFplicaFon–  Uses6ofthe10stepsinthecommonlibraryworkflow:

1.  CreateacuBLAShandleusingcublasCreateHandle 2.  Allocatedevicememoryforinputsandoutputsusing

cudaMalloc 3.  PopulatedevicememoryusingcublasSetVector,

cublasSetMatrix 4.  CallcublasSgemvtorunmatrix-vectormulFplicaFonon

theGPU5.  RetrieveresultsfromtheGPUusingcublasGetVector 6.  ReleaseCUDAandcuBLASresourcesusingcudaFree,

cublasDestroy 42

Page 43: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASExample

•  Youcanbuildandruntheexamplecublas.cu:

cublasCreate(&handle); cudaMalloc((void **)&dA, sizeof(float) * M * N); cudaMalloc((void **)&dX, sizeof(float) * N); cudaMalloc((void **)&dY, sizeof(float) * M); cublasSetVector(N, sizeof(float), X, 1, dX, 1); cublasSetVector(M, sizeof(float), Y, 1, dY, 1); cublasSetMatrix(M, N, sizeof(float), A, M, dA, M); cublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1, &beta, dY, 1); cublasGetVector(M, sizeof(float), dY, 1, Y, 1); /* for sgemm */ cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,matrix_size.uiWB,matrix_size.uiHA,matrix_size.uiWA,&alpha,d_B,matrix_size.uiWB,d_A,matrix_size.uiWA,&beta,d_C,matrix_size.uiWA)

43

Page 44: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASPortability

•  PorFngtocuBLASfromBLASisastraighqorwardprocess.Ingeneral,itrequires:–  AddingdevicememoryallocaFon/freeing(cudaMalloc,cudaFree)

–  AddingdevicetransferfuncFons(cublasSetVector,cublasSetMatrix,etc)

–  TransformlibraryrouFnecallsfromBLAStocuBLAS(e.g.cblas_sgemvècublasSgemv)

44

Page 45: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASPortability

•  SomecommonopFmizaFonsfollowinganaiveBLASècuBLASportare:–  ReusingdevicememoryallocaFons–  Removingredundantdatatransfersfromandtothedevice–  AddingstreamedexecuFonusingcublasSetStream

45

Page 46: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuBLASSummary

•  cuBLASmakesacceleraFnglegacyBLASapplicaFonssimpleandeasy–  Verylipleaddedcode–  StraighqorwardmappingfromBLASrouFnestocuBLAS

rouFnes–  FlexibleAPIimprovesportability

•  FornewlinearalgebraapplicaFons,cuBLASoffersahigh-performancealternaFvetoBLAS–  High-performancekernelswithverylipleprogrammerFme

46

Page 47: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuFFT

•  cuFFToffersanopFmizedimplementaFonofthefastFouriertransform

47

Page 48: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuFFTConfigura>on

48

•  IncuFFTterminology,plans==handles–  cuFFTplansdefineasingleFFTtransformaFontobeperformed

•  cuFFTusesplanstoderivetheinternalmemoryallocaFons,

transfers,kernelsrequiredtoimplementthedesiredtransform

•  Plansarecreatedwith:cufftResult cufftPlan1d(cufftHandle *plan, int nx, cufftType type, int batch);

cufftResult cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type);

cufftResult cufftPlan3d(cufftHandle *plan, int nx, int ny, int nz, cufftType type);

Page 49: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuFFTConfigura>on

49

•  cufftTypereferstothedatatypesofatransformaFon,forexample:–  Complex-to-complex:CUFFT_C2C –  Real-to-complex:CUFFT_R2C –  Complex-to-real:CUFFT_C2R

Page 50: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuFFTExample

•  Acomplex-to-complex1DcuFFTplanandexecuFngit,using6ofthe10stepsinthecommonlibraryworkflow:

1.  CreateandconfigureacuFFTplan2.  AllocateGPUmemoryfortheinputsamplesandoutput

frequenciesusingcudaMalloc 3.  PopulateGPUmemorywithinputsamplesusing

cudaMemcpy 4.  ExecutetheplanusingacufftExec*funcFon5.  RetrievethecalculatedfrequenciesfromGPUmemory

usingcudaMemcpy 6.  ReleaseCUDAandcuFFTresourcesusingcudaFree,

cufftDestroy 50

Page 51: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuFFTExample

•  Youcanbuildandrunanexamplecufft.cu:

cufftPlan1d(&plan, N, CUFFT_C2C, 1); cudaMalloc((void **)&dComplexSamples, sizeof(cufftComplex) * N); cudaMemcpy(dComplexSamples, complexSamples, sizeof(cufftComplex) * N, cudaMemcpyHostToDevice); cufftExecC2C(plan, dComplexSamples, dComplexSamples, CUFFT_FORWARD); cudaMemcpy(complexFreq, dComplexSamples, sizeof(cufftComplex) * N, cudaMemcpyDeviceToHost);

51

Page 52: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

cuFFTSummary

•  LikecuBLAS,cuFFToffersahigh-levelandusableAPIforporFnglegacyFFTapplicaFonsorwriFngnewones–  cuFFT’sAPIisdeliberatelysimilartoindustry-standardlibrary

FFTWtoimproveprogrammability–  Offershigherperformanceforlipledevelopereffort

52

Page 53: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Drop-InCUDALibraries

•  Drop-InCUDALibrariesallowseamlessintegraFonofCUDAperformancewithexisFngcodebases–  FullcompaFbilitywithindustry-standardlibraries,exposethe

sameexternalAPIs–  BLASèNVBLAS–  FFTWècuFFTW

•  TwowaystouseDrop-InLibraries:–  Re-linktoCUDALibraries–  LD_PRELOADCUDALibrariesbeforetheirhostequivalents

53

Page 54: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Drop-InCUDALibraries

•  Re-linkinglegacyapplicaFonstoCUDALibraries:–  SupposeyouhavealegacyapplicaFonthatreliesonBLAS:

$ gcc app.c –lblas –o app –  RecompilingwithNVBLASlinkedwillautomaFcallyaccelerate

allBLAScalls$ gcc app.c –lnvblas –o app

•  AlternaFvely,simplysetLD_PRELOADwhenexecuFngtheapplicaFon:

$ env LD_PRELOAD=libnvblas.so ./app

54

Page 55: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

SurveyofCUDALibraryPerformance

55

•  We’veseenthatcuBLASandcuFFTarehigh-level,programmablelibraries(liketheirhostcounterparts)–  NoCUDA-specificconcepts(e.g.threadblocks,pinned

memory,etc)

•  Let’sdoabriefsurveyofCUDALibraryperformancetoseetheperformanceimprovementspossible–  Focusonthesamelibraries(cuBLASandcuFFT)butsimilar

dataonotherlibrariesisavailableinthebookandonline

Page 56: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

SurveyofCUDALibraryPerformance

56

Page 57: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

SurveyofCUDALibraryPerformance

57

Page 58: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

SuggestedReadings

1.  AllsecFonsinChapter8ofProfessionalCUDACProgrammingexceptUsingOpenACC

2.  cuSPARSEUserGuide.2014.hpp://docs.nvidia.com/cuda/cusparse/3.  cuBLASUserGuide.2014.hpp://docs.nvidia.com/cuda/cublas/4.  cuRANDUserGuide.2014.hpp://docs.nvidia.com/cuda/curand/5.  cuFFTUserGuide.2014.hpp://docs.nvidia.com/cuda/cuw/6.  CUDAToolkit5.0PerformanceReport.2013.hpp://on-

demand.gputechconf.com/gtc-express/2013/presentaFons/cuda--5.0-math-libraries-performance.pdf

58

Page 59: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

ManycoreGPUArchitecturesandProgramming:Outline

•  IntroducFon–  GPUarchitectures,GPGPUs,andCUDA•  GPUExecuFonmodel•  CUDAProgrammingmodel•  WorkingwithMemoryinCUDA–  Globalmemory,sharedandconstantmemory•  Streamsandconcurrency•  CUDAinstrucFonintrinsicandlibrary•  Performance,profiling,debugging,anderrorhandling•  DirecFve-basedhigh-levelprogrammingmodel–  OpenACCandOpenMP

59

Page 60: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

GPUParalleliza>on

•  Amany-facetedprocess–  PerformancevariesdramaFcallydependingonthe

implementaFonofthesamealgorithms•  NaïvetohighlyopFmizedversion

•  ManytypesofopFmizaFonsforGPUs–  Sharedmemory–  Constantmemory–  Globalmemoryaccesspaperns–  WarpshuffleinstrucFons–  ComputaFon-communicaFonoverlap–  CUDAcompilerflags,e.g.loopunrolling,etc–  Increasingparallelism–  ...

60

Page 61: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Op>miza>onOpportuni>es

•  Kernel-levelopFmizaFon:–  ExposingSufficientParallelism–  OpFmizingMemoryAccess–  OpFmizingInstrucFonExecuFon

•  Host-GPUopFmizaFon–  E.g.kernelanddatatransferoverlapusingCUDAstreams

•  Profile-drivenopFmizaFonimprovesopFmizaFonsselecFon

61

Page 62: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Kernel-LevelOp>miza>on

•  ExposingSufficientParallelism–  IncreasetheamountofconcurrentworkontheGPUsoasto

saturateinstrucFonandmemorybandwidth

•  Canbeaccomplishedby:1. MoreconcurrentlyacFvewarpsperSM2. Moreindependentworkassignedtoeachthread

Warp-LevelParallelism

InstrucFon-LevelParallelism

62

Page 63: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Kernel-LevelOp>miza>on

•  IncreasingthenumberofwarpsperSM/threadblockdoesnotguaranteeperformanceimprovement–  Resultinfewerper-SMresourcesassignedtoeachthread(e.g.

registers,sharedmemory)

Less parallelism, more per-thread resources

More parallelism, smaller per-thread resources

63

Page 64: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Kernel-LevelOp>miza>on

•  CreaFngmoreindependentworkperthread–  loopunrollingorothercodetransformaFonsthatexpose

instrucFon-levelparallelism,–  Butmayalsoincreaseper-threadresourcerequirements

int sum = 0; for (int i = 0; i < 4; i++) { sum += a[i]; }

int i1 = a[0]; int i2 = a[1]; int i3 = a[2]; int i4 = a[4]; int sum = i1 + i2 + i3 + i4;

Requires 2 registers (sum, i), no instruction-level parallelism Requires 5 registers (i1, i2, i3, i4,

sum), four-way instruction-level parallelism

64

Page 65: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Kernel-LevelOp>miza>on

•  OpFmizingmemoryaccesstomaximize:–  MemorybandwidthuFlizaFon(efficiencyofmemoryaccess

paperns)–  Memoryaccessconcurrency(sufficientmemoryrequeststo

hidememorylatency)

65

Page 66: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Kernel-LevelOp>miza>on

•  Aligned,coalescedglobalandsharedmemoryaccessesopFmizememorybandwidthuFlizaFon–  Constantmemoryprefersabroadcastaccesspapern

66

Page 67: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Kernel-LevelOp>miza>on

•  OpFmizingInstrucFonExecuFonfocuseson:–  HidinginstrucFonlatencybykeepingasufficientnumberof

warpsacFve–  AvoidingdivergentexecuFonpathswithinwarps•  Ifinsideakernel

•  ExperimenFngwiththreadexecuFonconfiguraFoncanproduceunexpectedperformancegainsfrommoreorlessacFvewarps

•  DivergentexecuFonwithinawarpproducesreducedparallelismaswarpexecuFonofmulFplecodepathsisserialized

67

Page 68: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Profile-DrivenOp>miza>on

•  Profile-drivenopFmizaFonisaniteraFveprocesstoopFmizeprogrambasedonquan>ta>veprofilinginfo–  AsweapplyopFmizaFontechniques,weanalyzetheresults

usingnvprofanddecideiftheyarebeneficial

68

Determine performance

inhibitors

Identify hotspots

Gather profiling information

Optimize

Repeat

Page 69: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Profile-DrivenOp>miza>on

69

•  Thekeychallengeinprofile-drivenopFmizaFonistodetermineperformanceinhibitorsinhotspots–  nvvpandnvprofareinvaluabletoolsforthis

Page 70: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Profile-DrivenOp>miza>on

•  nvprofprofilingmodes:–  SummaryMode:defaultmode,displaysexecuFonFme

informaFononhigh-levelacFonssuchaskernelsordatatransfers

–  TraceMode:ProvidesaFmelineofCUDAeventsoracFonsinchronologicalorder

–  Event/MetricSummaryMode:Aggregatesevent/metriccountsacrossallkernelinvocaFons

–  Event/MetricTraceMode:Displaysevent/metriccountsforeachkernelinvocaFon

70

Page 71: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

Profile-DrivenOp>miza>on

•  TheNVIDIAVisualProfiler(nvvp)isalsoapowerfultoolforguidingprofile-drivenopFmizaFon–  OffersanumberofviewstoinspectdifferentpartsofaCUDA

applicaFon

71

Page 72: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDADebugging

•  AnimportantpartofCUDAsomwaredevelopmentistheabilitytodebugCUDAapplicaFons

•  CUDAoffersanumberofdebuggingtools,splitintotwocategories:–  KernelDebugging–  MemoryDebugging

72

Page 73: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDADebugging

•  KernelDebuggingtoolshelpustoanalyzethecorrectnessofrunningCUDAkernelsbyinspecFngrunningapplicaFonstate

•  MemoryDebuggingToolshelpusdetectapplicaFonbugsbyobservingirregularorout-of-boundmemoryaccessesperformedbyCUDAkernels

73

Page 74: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAKernelDebugging

•  Primarytoolforthejob:cuda-gdb –  IntenFonallybuilttobesimilartothehostdebuggingtoolgdb–  RequirescompilaFonwithspecialflagstobeuseful:

$ nvcc –g –G foo.cu -o foo

•  OnceanapplicaFoniscompiledindebugmode,runningitundercuda-gdb ispossibleusing:

$ cuda-gdb foo ... (cuda-gdb)

74

Page 75: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAKernelDebugging

•  cuda-gdb usesmostofthesamecommandsasgdb

•  OnemaindifferenceistheideaofCUDAFocus,orthecurrentthreadthatcuda-gdb isfocusedonandagainstwhichallcommandsrun–  Querythecurrentfocususing:

(cuda-gdb) cuda thread lane warp block sm grid device kernel

–  Exampleofse{ngfocustothe128ththreadinthecurrentblock:

(cuda-gdb) cuda thread (128)

75

Page 76: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAKernelDebugging

•  printfisanotherformofCUDAKernelDebugging–  Onlyavailableondevicesofcomputecapability2.0orhigher

•  Printsarebufferedonthedeviceandperiodicallytransferredbacktothehostfordisplay–  SizeofthisbufferconfigurablewithcudaSetDeviceLimit

•  BuffercontentsaretransferredtothehostameranyCUDAkernellaunch,anyhost-sideexplicitsynchronizaFon,anysynchronousmemorycopies

76

Page 77: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAMemoryDebugging

•  MemoryDebuggingdetectsmemoryerrorsinCUDAkernelsthatarelikelyindicaFveofbugsinthecode–  Forexample:out-of-boundsmemoryaccesses

•  ThereisasingletoolforMemoryDebugging,cuda-memcheck,whichcontainstwouFliFes:–  Thememchecktool–  Theracechecktool

77

Page 78: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAMemoryDebugging

•  ThecompilaFonprocessforcuda-memcheck ismoreinvolvedthanforcuda-gdb –  BuildingwithfulldebugopFonsaffectsperformance,which

maymakememoryerrorshardertohit–  ApplicaFonsshouldalwaysbecompiledwith-lineinfo –  ApplicaFonsshouldalsobecompiledtoincludesymbol

informaFon,butdoingthisvariesbyplaqorm•  Linux:-Xcompiler–rdynamic• Windows:-Xcompiler/Zi•  ...

78

Page 79: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAMemoryDebugging

•  OncetheapplicaFoniscompiled,memcheckcanbeusedtocheckfor6differenttypesofmemoryerrors:1.  MemoryAccessError:Out-of-boundsormisaligned

memoryaccess2.  HardwareExcepFon:Errorreportedbyhardware3. malloc/freeErrors:ImproperuseofCUDAdynamic

memoryallocaFon4.  CUDAAPIErrors:AnyerrorreturncodefromaCUDAAPI

call5.  cudaMallocMemoryLeaks:cudaMallocallocaFonsthat

arenotcudaFree’d 6.  DeviceHeapMemoryLeaks:DynamicmemoryallocaFons

thatareneverfreed79

Page 80: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAMemoryDebugging

•  Thetwocuda-memcheck uFliFesofferverydifferentcapabiliFes:–  memcheckperformsawiderangeofmemorycorrectness

checks–  racecheckverifiesthat__shared__memoryusageis

correctinanapplicaFon,aparFcularlydifficulttasktoperformmanually

•  cuda-memcheck offersamoreautomatedapproachtodebuggingthancuda-gdb

80

Page 81: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAErrorHandling

•  PropererrorhandlingisanimportantpartofrobustCUDAdeployment–  EveryCUDAfuncFonreturnsanerrorcodethatmustbe

checked–  IfasynchronousoperaFonsareused,thiserrormaybearesult

ofadifferentasynchronousoperaFonfailing–  ReturncodeofcudaSuccessindicatessuccess

•  CUDAalsooffersanumberoferror-handlingfuncFons

81

Page 82: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

CUDAErrorHandling

cudaError_t cudaGetLastError(); –  RetrievethelatestCUDAerror,clearingtheCUDArunFme’s

internalerrorstatetobecudaSuccess

cudaError_t cudaPeekLastError(); –  RetrievethelatestCUDAerror,butdonotcleartheCUDA

runFme’sinternalerrorstate

const char *cudaGetErrorString(cudaError_t error); –  Fetchahuman-readablestringfortheprovidederror

82

Page 83: Lecture 21: Manycore GPU Architectures and Programming, …...Lecture 21: Manycore GPU Architectures and Programming, Part 3 -- Streaming, Library and Tuning Concurrent and Mul>core

SuggestedReadings

1.  Chapter10inProfessionalCUDACProgramming2.  AdamDeConinck.Introduc7ontotheCUDAToolkitasanApplica7onBuild

Tool.GTC2013.hpp://on-demand.gputechconf.com/gtc/2013/webinar/cuda-toolkit-as-build-tool.pdf

3.  SandarbhJain.CUDAProfilingTools.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4587-cuda-profiling-tools.pdf

4.  ThomasBradley.GPUPerformanceAnalysisandOp7miza7on.2012.hpp://people.maths.ox.ac.uk/gilesm/cuda/lecs/NV_Profiling_lowres.pdf

5.  JulienDemouth.CUDAOp7miza7onwithNVIDIANsight(TM)VisualStudioEdi7on:ACaseStudy.GTC2014.hpp://on-demand.gputechconf.com/gtc/2014/presentaFons/S4160-cuda-opFmizaFon-nvidia-nsight-vse-case-study.pdf

83