Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the...

27
Streams con thrust y alguito mas... Clase 15.5

Transcript of Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the...

Page 1: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Streams con thrust y alguito mas...

Clase 15.5

Page 2: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Códigos para la clase

Códigos: async_thrust.cu, async_reduce_thrust.cu● cp -r /share/apps/codigos/alumnos_icnpg2016/clase_15/thrust_streams . ;● cd thrust18

Page 3: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Thrust 1.8: streams, new execution policies, dynamic paralellism, etc.

Page 4: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Streams with Thrust

Page 5: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Streams

Ver async.cu, clase streams...

Page 6: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Streams con Thrustthrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system.

Ahora puede tomar como argumento el stream!

Notación:

thrust::cuda::par.on(stream[i]),

On the other hand, it's safe to use thrust::cuda::par with raw pointers allocated bycudaMalloc, even when the pointer isn't wrapped by thrust::device_ptr:

Page 7: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Streams con Thrust

Ejemplo de clase de streams reciclado, mas nuevo material de demo ...● cp -r /share/apps/codigos/alumnos_icnpg2016/thrust_streams . ; cd thrust_streams

Incluimos thrust 1.8 headers (no necesario con cuda 7 toolkit en adelante), y el macro THRUST18 (omitir para cambiar a la version async.cu).

● nvcc async_thrust.cu -I /share/apps/codigos/thrust-master -DTHRUST18

Ejecutamos nvprof -o a.prof ./a.out ● qsub submit_gpu.sh

Verificar concurrencia de streams (overlaps copy-kernel, kernel-kernel...)● nvpp → import a.prof

Page 8: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Streams con Thrust

// baseline case - sequential transfer and execute

// asynchronous version 1: loop over {copy, kernel, copy}

Page 9: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Streams con Thrust

A optimizar! ...

Page 10: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Algorithm invocation in CUDA __device__ code

Page 11: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Algorithm invocation in CUDA __device__ codetemplate<typename Iterator, typename T, typename BinaryOperation, typename Pointer>__global__ void reduce_kernel(Iterator first, Iterator last, T init, BinaryOperation binary_op, Pointer result){ *result = thrust::reduce(thrust::cuda::par, first, last, init, binary_op);}

int main(){ size_t n = 1 << 20; thrust::device_vector<unsigned int> data(n, 1); thrust::device_vector<unsigned int> result(1, 0);

// method 1: call thrust::reduce from an asynchronous CUDA kernel launch

// create a CUDA stream cudaStream_t s; cudaStreamCreate(&s);

// launch a CUDA kernel with only 1 thread on our stream reduce_kernel<<<1,1,0,s>>>(data.begin(), data.end(), 0, thrust::plus<int>(), result.data());

// wait for the stream to finish cudaStreamSynchronize(s);

// our result should be ready assert(result[0] == n);

cudaStreamDestroy(s);

return 0;} Ejemplo de la librería:

/share/apps/codigos/thrust-master/examples/cuda/async_reduce.cu

Page 12: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

template<typename Iterator, typename T, typename BinaryOperation, typename Pointer>__global__ void reduce_kernel(Iterator first, Iterator last, T init, BinaryOperation binary_op, Pointer result){ *result = thrust::reduce(thrust::cuda::par, first, last, init, binary_op);}

int main(){ size_t n = 1 << 18; thrust::device_vector<unsigned int> data(n, 1); thrust::device_vector<unsigned int> result(1, 0);

// method 1: call thrust::reduce from an asynchronous CUDA kernel launch

// create a CUDA stream cudaStream_t s; cudaStreamCreate(&s); // launch a CUDA kernel with only 1 thread on our stream reduce_kernel<<<1,1,0,s>>>(data.begin(), data.end(), 0, thrust::plus<int>(), result.data());

nvtxRangePushA("la CPU hace su propia reduccion... quien gana?"); unsigned res=0; unsigned inc=1; for(int i=0;i<n;i++) res+=inc; nvtxRangePop();

// wait for the stream to finish cudaStreamSynchronize(s);

Ejemplo de la librería modificado para la clase: /share/apps/codigos/alumnos_icnpg2016/thrust_streams/async_reduce_thrust.cu

Page 13: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Algorithm invocation in CUDA __device__ code

Si todavia no lo hicieron...● cp -r /share/apps/codigos/alumnos_icnpg2015/thrust18 . ; cd thrust18

Incluimos thrust 1.8 headers, NVTX, y flags para paralelismo dinamico...

● nvcc -arch=sm_35 -rdc=true async_reduce_thrust.cu -lcudadevrt -lnvToolsExt -I /share/apps/codigos/thrust-master -o a.out

Ejecutamos nvprof -o a.prof ./a.out ● qsub submit_gpu.sh

Page 14: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Algorithm invocation in CUDA __device__ code

Ejemplo de reduccion asincrónica usando GPU

size_t n = 1 << 18;

Prueben aumentarlo...

Page 15: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Paralelismo dinámico

Dynamic Parallelism in CUDA is supported via an extension to the CUDA programming model that enables a CUDA kernel to create and synchronize new nested work. Basically, a child CUDA Kernel can be called from within a parent CUDA kernel and then optionally synchronize on the completion of that child CUDA Kernel. The parent CUDA kernel can consume the output produced from the child CUDA Kernel, all without CPU involvement.

Page 16: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Dynamic Parallelism

http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/

Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk parallel model is not too limiting, but some parallel patterns—such as nested parallelism—cannot be expressed so easily. Nested parallelism arises naturally in many applications, such as those using adaptive grids, which are often used in real-world applications to reduce computational complexity while capturing the relevant level of detail. Flat, bulk parallel applications have to use either a fine grid, and do unwanted computations, or use a coarse grid and lose finer details.

Page 17: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Dynamic Parallelismhttp://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/

Dynamic parallelism is generally useful for problems where nested parallelism cannot be avoided. This includes, but is not limited to, the following classes of algorithms:

● algorithms using hierarchical data structures, such as adaptive grids;

● algorithms using recursion, where each level of recursion has parallelism, such as quicksort;

● algorithms where work is naturally split into independent batches, where each batch involves complex parallel processing but cannot fully use a single GPU.

Page 18: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Mandelbrot Set

NVIDIA_CUDA-6.5_Samples/2_Graphics/Mandelbrot/

En negro

Page 19: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Mandelbrot Set

The Escape Time Algorithm

Page 20: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Mandelbrot Set

The Escape Time Algorithm

Page 21: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Mandelbrot SetMariani-Silver Algorithm

Recursion: Subdivision de cuadrados

Grilla Madre

Grilla Nieta

Grilla Hija

Page 22: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Dynamic Parallelismhttp://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/

nvcc -arch=sm_35 -rdc=true myprog.cu -lcudadevrt -o myprog.o

De un solo paso seria asi:

Page 23: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Dynamic Parallelismhttp://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/

Page 24: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Dynamic Parallelismhttp://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/

Grid Nesting and Synchronization Memory Consistency

Passing Pointers to Child Grids ● Device Streams and Events● Recursion Depth and Device

Limits● Etc...

child grids always complete before the parent grids that launch them, even if there is no explicit synchronization,

Page 25: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Dynamic Parallelism

http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/

Intro to Parallel Programming UDACITY, Lesson 7.2 https://www.udacity.com/course/cs344.

Cool Thing You Could Do with Dynamic Parallelism - Intro to Parallel Programming

https://www.youtube.com/watch?v=QVvHbsMIQzY

https://www.youtube.com/watch?v=8towMTm82DM

Problems Suitable for Dynamic Parallelism - Intro to Parallel Programming

Page 26: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Dinamica Molecular con interacciones de largo alcance, por ejemplo, gravitatorias...

● Esquema PP: O(N*N)● Esquema PM: O(M*log M) ● Esquema PPPM=P3M● Tree codes:

Page 27: Streams con thrust y alguito mas - Argentina.gob.ar · Streams con Thrust thrust::cuda::par is the parallel execution policy associated with Thrust's CUDA backend system. Ahora puede

Algorithm invocation in CUDA __device__ code

farber1-11_async.cu

Piense como mejorarlo.... ahora que sabe reducción asincrónica

1 thread de la CPU ocioso

farber1-11_sync.cu

CPU ociosa