CUDA Asynchronous Memory Usage and Execution Yukai Hung [email protected] Department of Mathematics...

61
CUDA Asynchronous Memory Usage and Execution Yukai Hung [email protected] Department of Mathematics National Taiwan University

Transcript of CUDA Asynchronous Memory Usage and Execution Yukai Hung [email protected] Department of Mathematics...

Page 1: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

CUDA Asynchronous Memory Usage and ExecutionCUDA Asynchronous Memory Usage and ExecutionYukai Hung

[email protected] of MathematicsNational Taiwan University

Yukai [email protected]

Department of MathematicsNational Taiwan University

Page 2: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

Page-Locked MemoryPage-Locked Memory

Page 3: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

3

Page-Locked MemoryPage-Locked Memory

Regular pageable and page-locked or pinned host memory - use too much page-locked memory reduces system performance physical memory

local disk for virtual memory

paging memory into local diskrestore from local disk into physical memory

Page 4: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

4

Page-Locked MemoryPage-Locked Memory

Regular pageable and page-locked or pinned host memory - copy between page-locked memory and device memory can be performed concurrently with kernel execution for some devices

Md Pd

Pdsub

Nd

load blue block to shared memory

compute blue block on shared memoryand load next block to shared memory

Page 5: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

5

Page-Locked MemoryPage-Locked Memory

Regular pageable and page-locked or pinned host memory - copy between page-locked memory and device memory can be performed concurrently with kernel execution for some devices

host memory

device memory

copy data from host to device

copy data from host to device

execute kernel execute kernel

Page 6: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

6

Page-Locked MemoryPage-Locked Memory

Regular pageable and page-locked or pinned host memory - use page-locked host memory can support executing more than one device kernel concurrently for compute capability 2.0 hardware

Page 7: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

7

Page-Locked MemoryPage-Locked Memory

Portable memory - the block of page-locked memory is only available for the thread that allocates it by the default setting, use portable memory flag to share the page-locked memory with other threads

host thread 0 host thread 1 host thread 2

host memory

Page 8: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

8

Page-Locked MemoryPage-Locked Memory

How to allocate portable memory?

float* pointer;

//allocate host page-locked write-combining memorycudaHostAlloc((void**)&pointer,bytes,cudaHostAllocPortable);

//free allocated memory space cudaFreeHost(pointer);

Page 9: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

9

Page-Locked MemoryPage-Locked Memory

Write-Combining memory

- page-locked memory is allocated as cacheable by default - page-locked memory can be allocated as write-combining memory by using special flag, which frees up L1 and L2 cache resource usage

Advantage and disadvantage

- write-combining memory is not snooped during transfers across bus, which can improve transfer performance by up to 40% - reading from write-combining memory from host is slow, which should in general be used for memory that the host only write to

Page 10: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

10

Page-Locked MemoryPage-Locked Memory

How to allocate write-combining memory?

float* pointer;

//allocate host page-locked write-combining memorycudaHostAlloc((void**)&pointer,bytes,cudaHostAllocWriteCombined)

;

//free allocated memory space cudaFreeHost(pointer);

1 GB data size normal write-combininghost to device 0.533522 0.338092device to host 0.591750 0.320989

Page 11: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

11

Page-Locked MemoryPage-Locked Memory

Mapped memory - the page-locked host memory can be mapped into the address space of the device by passing special flag to allocate memory

host memory

device memory

host memory pointer

device memory pointer

memory copy between host and device

Page 12: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

12

Page-Locked MemoryPage-Locked Memory

Mapped memory - the page-locked host memory can be mapped into the address space of the device by passing special flag to allocate memory

host memory

device memory

host memory pointermap host memory

into device memory

device memory pointer

read or write data will acts as only one memory space implicit asynchronous transfer

Page 13: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

13

Page-Locked MemoryPage-Locked Memory

How to allocate mapped memory?

float* pointer;

//allocate host page-locked write-combining memorycudaHostAlloc((void**)&pointer,bytes,cudaHostAllocMapped);

//free allocated memory space cudaFreeHost(pointer);

Page 14: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

14

Page-Locked MemoryPage-Locked Memory

Check the hardware is support or not? - check the hardware properties to ensure it is available for mapping host page-locked memory with device memory

cudaDeviceProp deviceprop;

//query the device hardwared properties//the structure records all device propertiescudaGetDeviceProperties(&deviceprop,0);

//check the map memory is available or notif(!deviceprop.canMapHostMemory)printf(“cudaError:cannot map host to devicmemory\n”);

Page 15: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

15

Page-Locked MemoryPage-Locked Memory

What is the property structure contents?

Page 16: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

16

Mapped MemoryMapped Memory

#define size 1048576

int main(int argc,char** argv){ int loop;

float residual;

float *h_a, *d_a; float *h_b, *d_b; float *h_c, *d_c;

cudaDeviceProp deviceprop;

//query the device hardwared properties //the structure records all device properties cudaGetDeviceProperties(&deviceprop,0); //check the map memory is available or not if(!deviceprop.canMapHostMemory) printf(“cudaError:cannot map host to device memory\n”);

Page 17: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

17

Mapped MemoryMapped Memory

//this flag must be set in order to allocate pinned //host memory that is accessible to the device cudaSetDeviceFlags(cudaDeviceMapHost); //allocate host page-locked and accessible to the device memory //maps the memory allocation on host into cuda device address cudaHostAlloc((void**)&h_a,sizeof(float)*size,cudaHostAllocMapped); cudaHostAlloc((void**)&h_b,sizeof(float)*size,cudaHostAllocMapped); cudaHostAlloc((void**)&h_c,sizeof(float)*size,cudaHostAllocMapped);

//initialize host vectors for(loop=0;loop<size;loop++) { h_a[loop]=(float)rand()/(RAND_MAX-1); h_b[loop]=(float)rand()/(RAND_MAX-1); }

//pass back the device pointer and map with host cudaHostGetDevicePointer((void**)&d_a,(void*)h_a,0); cudaHostGetDevicePointer((void**)&d_b,(void*)h_b,0); cudaHostGetDevicePointer((void**)&d_c,(void*)h_c,0);

Page 18: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

18

Mapped MemoryMapped Memory

//execute device kenel for vector addtion vectorAdd<<<(int)ceil((float)size/256),256s>>>(d_a,d_b,d_c,size); cudaThreadSynchronize();

//check the result residual value for(loop=0,residual=0.0;loop<size;loop++) residual=residual+(h_a[loop]+h_b[loop]-h_c[loop]);

printf(“residual value is %f\n”,residual);

//free the memory space which must have been returnedd //by a previous call to cudaMallocHost or cudaHostAlloc cudaFreeHost(h_a); cudaFreeHost(h_b); cudaFreeHost(h_c);

//catch and check cuda error message if((error=cudaGetLastError())!=cudaSuccess) printf(“cudaError:%s\n”,cudaGetErrorString(error));

return 0;}

Page 19: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

19

Mapped MemoryMapped Memory

__global__ void vectorAdd(float* da,float* db,float* dc,int size){ int index;

//calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;

if(index<size) //each thread computer one component dc[index]=da[index]+db[index];

return;}

Page 20: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

20

Page-Locked MemoryPage-Locked Memory

Several advantages

- there is no need to allocate a block in device memory and copy data between this block and block in host memory, the data transfers are implicitly performed as needed by the kernel

- there is no need to use streams to overlap data transfers with kernel execution, the kernel-originated data transfers overlap with kernel execution automatically

- mapped memory is able to exploit he full duplex of the PCI express bus by reading and writing at the same time, since memory copy only move data in one direction, half duplex

Page 21: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

21

Page-Locked MemoryPage-Locked Memory

Several disadvantages

- the page-locked memory is shared with host and device, any application must avoid write on the both side simultaneously

- the atomic functions operating on mapped page-locked memory are not atomic from the point of view of the host or other devices

Page 22: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

22

Page-Locked MemoryPage-Locked Memory

Portable and mapped memory - the page-locked host memory can be allocated as both portable and mapped memory, such that each host thread can map the same page-locked memory into different device address

host thread 0 host thread 1 host thread 2

host memory

device 0 memory device 1 memory device 2 memorymap all devices to the same page-locked memory

Page 23: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

23

Page-Locked MemoryPage-Locked Memory

Integrated system

- the mapped page-locked memory is very suitable on integrated system that utilize the a part of host memory as device memory - check the integrated field on cuda device properties structure

- mapped memory is faster, if data only read from or write to global memory once, the coalescing is even more important with mapped memory in order to reduce the data transfer times

Page 24: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

Asynchronous ExecutionAsynchronous Execution

Page 25: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

25

Asynchronous ExecutionAsynchronous Execution

Asynchronous execution

- some functions are supported asynchronous launching in order to facilitate concurrent execution between host and device resource - control is returned to the host thread before the work is finished

transfer data between host and deviceperform some device kernelsperform some host functions

overlapping

Page 26: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

26

Asynchronous ExecutionAsynchronous Execution

Asynchronous execution

- some functions are supported asynchronous launching in order to facilitate concurrent execution between host and device resource - control is returned to the host thread before the work is finished

perform device kernel launchkernel<<<blocknum,blocksize,0,stream>>>(…)

perform data transfer between host and deviceperform data transfer between device and devicecudaMemcpyAsync(destination,source,bytes,direction,stream);

perform global memory set

Page 27: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

27

Asynchronous ExecutionAsynchronous Execution

#define size 1048576

int main(int argc,char** argv){ int loop; int bytes;

float *h_a, *d_a; float *h_b, *d_b; float *h_c, *d_c;

//allocate host page-locked memory cudaMallocHost((void**)&h_a,sizeof(float)*size); cudaMallocHost((void**)&h_b,sizeof(float)*size); cudaMallocHost((void**)&h_c,sizeof(float)*size);

//allocate device global memory cudaMalloc((void**)&d_a,sizeof(float)*size); cudaMalloc((void**)&d_b,sizeof(float)*size); cudaMalloc((void**)&d_c,sizeof(float)*size);

Page 28: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

28

Asynchronous ExecutionAsynchronous Execution

cudaEvent_t stop; cudaEvent_t start;

//create an event object which is used to //record device execution elasped time cudaCreateEvent(&stop); cudaCreateEvent(&start);

//initialize host vectors for(loop=0;loop<size;loop) { h_a[loop]=(float)rand()/(RAND_MAX-1);; h_b[loop]=(float)rand()/(RAND_MAX-1);; }

Page 29: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

29

Asynchronous ExecutionAsynchronous Execution

bytes=sizeof(float)*size;

//set time event recorder cudaEventRecord(start,0);

//copy data from host to device memory asynchronously cudaMemcpyAsync(d_a,h_a,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_b,h_b,bytes,cudaMemcpyHostToDevice,0);

//execute device kernel asynchronously vectorAdd<<<(int)ceil((float)size/256,256,0,0)>>>(d_a,d_,d_c,size);

//copy data from device to host memory asynchronously cudaMemcpyAsync(h_c,d_c,bytes,cudaMemcpyDeviceToHost,0);

//set time event recorder cudaEventRecord(stop,0);

Page 30: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

30

Asynchronous ExecutionAsynchronous Execution

counter=0; //increase the counter before the queried //cuda event has actually been finished while(cudaEventQuery(stop)==cudaErrorNotReady) counter=counter+1;

//calculate device execution elapsed time cudaEventElapsedTime(&elapsed,start,stop);

//check the result residual value for(loop=0,residual=0.0;loop<size;loop++) residual=residual+(h_c[loop]-h_a[loop]-h_b[loop]);

printf(“counter:%d\n”,counter); printf(“residual:%f\n”,residual);

Page 31: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

31

Asynchronous ExecutionAsynchronous Execution

//free the memory space which must have been returnedd //by a previous call to cudaMallocHost or cudaHostAlloc cudaFreeHost(h_a); cudaFreeHost(h_b); cudaFreeHost(h_c);

//free the device memory space cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

//free the cuda event object cudaEventDestroy(stop); cudaEventDestroy(start);

//catch and check cuda error message if((error=cudaGetLastError())!=cudaSuccess) printf(“cudaError:%s\n”,cudaGetErrorString(error));

return 0;}

Page 32: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

32

Asynchronous ExecutionAsynchronous Execution

__global__ void vectorAdd(float* da,float* db,float* dc,int size){ int index;

//calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;

if(index<size) //each thread computer one component dc[index]=da[index]+db[index];

return;}

Page 33: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

StreamStream

Page 34: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

34

StreamStream

Stream - applications manage concurrency through stream - a stream is a sequence of commands that execute in order - all device requests made from the host code are put into a queues

host thread

device driver

device command stream or queuefirst in first out

Page 35: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

35

StreamStream

How to create a stream?

cudaStream_t stream;

//create an asynchronous new streamcudaStreamCreate(&stream);

//destroy streamcudaStreamDestroy(stream);

Page 36: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

36

StreamStream

Stream - different streams may execute their commands or host requests out of order with respect to one another or concurrently, but the same stream is still a sequence of commands that execute in order

host thread

device driver

host thread

device driver

Page 37: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

37

StreamStream

#define snum 10#define size 1048576

int main(int argc,char** argv){ int loop; int bytes;

float *h_a, *d_a; float *h_b, *d_b; float *h_c, *d_c;

cudaStream_t stream[snum];

//create new asynchronous stream //which acts as device work queue for(loop=0;loop<snum;loop++) cudaStreamCreate(stream+loop);

Page 38: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

38

StreamStream

//allocate host page-locked memory cudaMallocHost((void**)&h_a,sizeof(float)*size*snum); cudaMallocHost((void**)&h_b,sizeof(float)*size*snum); cudaMallocHost((void**)&h_c,sizeof(float)*size*snum);

//allocate device global memory cudaMalloc((void**)&d_a,sizeof(float)*size*snum); cudaMalloc((void**)&d_b,sizeof(float)*size*snum); cudaMalloc((void**)&d_c,sizeof(float)*size*snum);

//initialize host vectors for(loop=0;loop<size*snum;loop++) { h_a[loop]=(float)rand()/(RAND_MAX-1);; h_b[loop]=(float)rand()/(RAND_MAX-1);; }

Page 39: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

39

StreamStream

//put all the works into default stream //executes all works by using noly one stream for(loop=0;loop<snum;loop++) { bytes=sizeof(float)*size;

sp1=h_a+loop*size; dp1=d_a+loop*size; sp2=h_b+loop*size; dp2=d_b+loop*size; sp3=d_c+loop*size; dp3=h_c+loop*size;

//copy data from host to device memory asynchronously cudaMemcpyAsync(dp1,sp1,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(dp2,sp2,bytes,cudaMemcpyHostToDevice,0);

//execute device kernel asynchronously kernel<<<blocknum,blocksize,0,0>>>(d_a,d_b,d_c,size); //copy data from device to host memory asynchronously cudaMemcpyAsync(dp3,sp3,bytes,cudaMemcpyDeviceToHost,0); }

//wait until the stream is finished cudaThreadSynchronize();

Page 40: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

40

StreamStream

//put all the works into different asynchronous streams //each stream only executes three copies and one kernel for(loop=0;loop<snum;loop++) { bytes=sizeof(float)*size;

sp1=h_a+loop*size; dp1=d_a+loop*size; sp2=h_b+loop*size; dp2=d_b+loop*size; sp3=d_c+loop*size; dp3=h_c+loop*size;

//copy data from host to device memory asynchronously cudaMemcpyAsync(dp1,sp1,bytes,cudaMemcpyHostToDevice,stream[loop]); cudaMemcpyAsync(dp2,sp2,bytes,cudaMemcpyHostToDevice,stream[loop]);

//execute device kernel asynchronously kernel<<<blocknum,blocksize,0,stream[loop]>>>(d_a,d_b,d_c,size); //copy data from device to host memory asynchronously cudaMemcpyAsync(dp3,sp3,bytes,cudaMemcpyDeviceToHost,stream[loop]); } //wait until all stream are finished cudaThreadSynchronize();

Page 41: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

41

StreamStream

//free the memory space which must have been returnedd //by a previous call to cudaMallocHost or cudaHostAlloc cudaFreeHost(h_a); cudaFreeHost(h_b); cudaFreeHost(h_c);

//free the device memory space cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); //free the asynchronous streams for(loop=0;loop<snum;loop++) cudaStreamDestroy(stream[loop]);

return 0;}

Page 42: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

42

StreamStream

__global__ void vectorAdd(float* da,float* db,float* dc,int size){ int loop; int index;

volatile float temp1; volatile float temp2;

//calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;

if(index<size) for(loop=0;loop<iteration;loop++) { temp1=da[index]; temp2=db[index]; dc[index]=temp1+temp2; }

return;}

Page 43: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

43

StreamStream

How about the performance ?

Fermi C2050 Tesla C1060single stream 64.096382 180.179825

multiple stream 31.996338 166.010757

Page 44: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

44

StreamStream

Stream controlling

cudaThreadSynchronize() called in the end to make sure all streams are finished before preceding further, it forces the runtime to wait until all device tasks or commands in all asynchronous streams have completed

cudaStreamSynchronize() force the runtime to wait until all preceding device tasks or host commands in one specific stream have completed

cudaStreamQuery() provide applications with a way to know if all preceding device tasks or host commands in a stream have completed

Page 45: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

45

StreamStream

Stream controlling

cudaStreamDestroy() wait for all preceding tasks in the give stream to complete before destroying the stream and returning control to the host thread, which is blocked until the stream finished all commands or tasks

Page 46: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

46

StreamStream

Overlap of data transfer and kernel execution transfer data between host page-locked memory and device memory and kernel execution can be performed concurrently

host thread

device driver

host thread

device driver

kernel execution data transfers between host and device memory

can be performed concurrently

Page 47: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

47

StreamStream

Overlap of data transfer and kernel execution transfer data between host page-locked memory and device memory and kernel execution can be performed concurrently

any application may query the hardware capability by calling the device manage function and checking the property flag

cudaDeviceProp deviceprop;

//query the device hardwared properties//the structure records all device propertiescudaGetDeviceProperties(&deviceprop,0);

//check the overlapping is available or notif(!deviceprop.deviceOverlap)printf(“cudaError:cannot overlap kernel and transfer\n”);

Page 48: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

48

StreamStream

Concurrent kernel execution some hardware can execute multiple kernels concurrently

host threadhost thread

kernel execution kernel execution

can be performed concurrently cannot be performed

concurrently

device driverdevice driver

Page 49: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

49

StreamStream

Concurrent kernel execution some hardware can execute multiple kernels concurrently

any application may query the hardware capability by calling the device manage function and checking the property flag

cudaDeviceProp deviceprop;

//query the device hardwared properties//the structure records all device propertiescudaGetDeviceProperties(&deviceprop,0);

//check the concurent kernels is available or notif(!deviceprop.concurrentKernels)printf(“cudaError:cannot use concurrent kernels\n”);

Page 50: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

50

StreamStream

Concurrent kernel execution some hardware can execute multiple kernels concurrently

any application may query the hardware capability by calling the device manage function and checking the property flag

the kernels that may use many textures or registers or shared memory are less likely to execute with other kernels concurrently

Page 51: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

51

StreamStream

Concurrent data transfer some devices can perform copy data from page-locked memory to device memory with copy data from device memory to host page-locked memory concurrently

host threadhost thread

device driverdevice driver

copy data from page-locked to device memory

copy data from device to page-locked memory

cannot be performedconcurrently on OLD device

can be performedconcurrently on NEW device

Page 52: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

52

StreamStream

Streams on the matrix-matrix multiplication

Md Pd

Nd

stream 0

stream 1

stream 2

stream 0

stream 1

stream 2

Page 53: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

EventEvent

Page 54: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

54

EventEvent

Event the runtime provides a ways to closely monitor the device progress by letting the program record events at any point in the program

cudaEvent_t event1;cudaEvent_t event2;

//create and initialize eventcudaEventCreate(&event1);cudaEventCreate(&event2);

//insert event recorder into streamcudaEventRecord(event1,stream);cudaEventRecord(event2,stream);

//destroy created event recordercudaEventDestroy(event1);cudaEventDestroy(event2);

Page 55: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

55

EventEvent

Event

cudaEventSynchronize() this function blocks until the event has actually been recorded , since the event recorder is an asynchronous method

cudaEventQuery() provide any applications with a way to know if one specific event recorder in the stream have completed , which returns cudaSuccess

Page 56: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

Multi-GPUMulti-GPU

Page 57: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

57

Multi-GPUMulti-GPU

GPU can not share global memory - one GPU can not access another GPUs memory directly - application code is responsible for moving data between GPUs

Page 58: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

58

Multi-GPUMulti-GPU

A host thread can maintain one context at a time - need as many host threads as GPUs to maintain all device - multiple host threads can establish context with the same GPU hardware diver handles time-sharing and resource partitioning

host thread 0 host thread 1 host thread 2

host memory

device 0 device 1 device 2

Page 59: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

59

Multi-GPUMulti-GPU

Device management calls

cudaGetDeviceCount() returns the number of devices on the current system with compute capability greater or equal to 1.0, that are available for execution

cudaSetDevice() set the specific device on which the active host thread executes the device code. If the host thread has already initialized he cuda runtime by calling non-device management runtime functions, returns error

must be called prior to context creation, fails if the context has already been established, one can forces the context creation with cudaFree(0)

cudaGetDevice() returns the device on which the active host thread executes the code

Page 60: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

60

Multi-GPUMulti-GPU

Device management calls

cudaThreadExit() explicitly clean up all runtime-related resource associated with the calling host thread, any subsequent calls reinitializes the runtime, this function is implicitly called on the host thread exit

cudaGetDeviceProperties() returns the properties of one specific device, this is useful when a system contains different devices in order to choose best devices

Page 61: CUDA Asynchronous Memory Usage and Execution Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com.

61

Reference - Mark Harris http://www.markmark.net/

- Wei-Chao Chen http://www.cs.unc.edu/~ciao/

- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php