How to Combine OpenMP, Streams, and ArrayFire...
Transcript of How to Combine OpenMP, Streams, and ArrayFire...
![Page 1: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/1.jpg)
How to Combine OpenMP, Streams, and ArrayFire for Maximum Multi-GPU
Throughput
Shehzan Mohammed@shehzanm
@arrayfire
![Page 2: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/2.jpg)
Outline
● Introduction to ArrayFire● Case Study 1: glasses.com● Case Study 2: Accelerate Diagnostics
![Page 3: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/3.jpg)
ArrayFire
● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner
● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development
● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs
![Page 4: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/4.jpg)
ArrayFire
● Hundreds of parallel functions○ Targeting image processing, machine learning, etc.
● Support for multiple languages○ C/C++, Fortran, Java and R
● Linux, Windows, Mac OS X ● OpenGL based graphics● Based around one data structure● Just-in-Time (JIT)
○ Combine multiple operations into one kernel
● GFOR, the only data parallel loop
![Page 5: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/5.jpg)
ArrayFire Functions
● Hundreds of parallel functions○ Building blocks (non-exhaustive)
■ Reductions, scan, sort■ Set operations■ Statistics■ Matrix operations■ Image processing■ Signal processing■ Sparse matrix■ Visualizations
![Page 6: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/6.jpg)
Case 1:glasses.com
![Page 7: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/7.jpg)
Case 1: Glasses.com
● 3D face reconstruction from images● Image and Coordinate geometry processing● Came to us with a slow application
○ Made use of OpenCV and OpenMP○ One thread per PC, 8 threads: 30+ seconds○ Developed on OSX
![Page 8: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/8.jpg)
Case 1: Glasses.com
● Required a significant hardware investment○ Increased maintenance○ Financially not viable in production○ Had windows infrastructure
The challenge: Speed, Speed, and much more
![Page 9: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/9.jpg)
Challenge 1: Multithreading
● Multithreading benefits CPU code○ Calling CUDA from multiple threads may not offer too much benefit
■ Overheads of memory management, kernel launches■ Streams, pinned mem, etc required to harness full potential
○ GPU parallelism is faster than CPU for these operations
● Goal:○ Make host code single threaded○ Move all multithreaded sections to GPU
![Page 10: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/10.jpg)
Challenge 1: Multithreading
● Most multithreaded sections easily ported○ Images are easy to combine and operate on at once○ ArrayFire gfor
● Some sections more difficult○ Require serial access and/or complex operations○ Needs to be run on host - more memory transfers○ Need combination of OpenMP and ArrayFire
![Page 11: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/11.jpg)
Challenge 2: Multithreading ArrayFire
● ArrayFire was not thread safe○ Designed for GPU performance○ Required substantial work
■ iterative process○ Trade-offs
■ Cost of adding critical section vs ■ Cost of adding multithreading support
○ Limiting access to data for each thread
![Page 12: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/12.jpg)
Challenge 2: Multithreading ArrayFire
● Required adding critical sections around key operations
● Constant memory and textures○ No way to make this thread safe
■ Except critical section○ Add critical section vs use global memory
■ Analyse and customize for specific operation
![Page 13: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/13.jpg)
Challenge 3: Batching
● Image operations can be easily batched○ Most operations work on pixel-level or neighborhood-level
● Problem when operations are more complex● Batching does not always map
○ like affine matrix multiplications○ Indexing needs to be changes - expensive memcpys
![Page 14: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/14.jpg)
Challenge 3: Batching
● Used OpenMP for parallelism○ One frame per thread○ Optimized for CPU
● One CPU thread + GPU○ Parallelism on GPU vs. Parallelism on CPU
● Combined OpenMP threads
![Page 15: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/15.jpg)
Challenge 3: Batching
● Many small operations○ Individually it didn’t make sense to port to the GPU
● Increase dimensionality of the data○ 2D -> 3D○ GFOR and Strided Access
● Moved to single threaded code
![Page 16: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/16.jpg)
![Page 17: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/17.jpg)
![Page 18: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/18.jpg)
Challenge 3: Batching
● Call custom CUDA kernels○ Special indexing
● Specialized Matrix Multiply○ ssyrk vs. gemm○ 2x faster○ concurrent execution using streams
float * bound = boundary.device<float>();kernel<<< threads, blocks >>>(bound, boundary.elements());
![Page 19: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/19.jpg)
Challenge 3: Batching
● Results○ 90ms -> 28ms on a GTX 690
● Other Improvements○ Overlapped pinned memory transfers○ Generic to Specialized matrix multiply○ Streams
![Page 20: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/20.jpg)
Concurrent Computation
● Overlap CPU and GPU computation○ CPU handles variable length data sets one frame at a time○ GPU handles fixed length data sets all frames concurrently
#pragma omp sections
{
#pragma omp section
{
// GPU Code
}
#pragma omp section
{
// CPU Code
}
}
![Page 21: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/21.jpg)
Results
● 1 Process (5 threads): 8 seconds● 6 Processes(2 threads): 22 seconds● Demo
![Page 23: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/23.jpg)
Case 2: Accelerate Diagnostics
![Page 24: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/24.jpg)
Case 2: Accelerate Diagnostics
● Multithreaded Java code with CUDA integration● Image processing of large images (4k x 4k each)● Port to C++● Hard time constraint● Hard reliability constraint
The challenge: Maximize PCIe throughput○ Image processing is very parallel○ Memory transfer is majority of application run time
![Page 25: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/25.jpg)
Case 2: Accelerate Diagnostics
● Target hardware:○ Intel Xeon CPU○ 2 GTX Titans per system○ 64 GB RAM
● Required speed up: ~5x● Required reliability: 48 hour stress test
![Page 26: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/26.jpg)
The Framework
● Master thread - scheduling and management● Slave threads - each handling 1 ‘pipeline’● Each pipeline handled one ‘site’ at a time● Continuous execution
● Pipeline - serial flow of execution for one site● “Rabbit Hole”
● Site - independent data set of images● Rabbit
![Page 27: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/27.jpg)
The Framework - Initial
Master Thread
GPU 0 GPU 1
Pipe 1
Site
Pipe 0
Site
Site Database
Reads
![Page 28: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/28.jpg)
● Minimalist● Initializes and controls pipeline● Feeds sites to pipelines
Master Thread
Master Thread
GPU 0 GPU 1
Thread 1
Pipe 1
Site
Thread 0
Pipe 0
Site
![Page 29: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/29.jpg)
Pipeline
● Serial execution within pipelines● Processes one site at a time
Pipe 0
Site
![Page 30: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/30.jpg)
Challenge 1: CPU Parallelism
How to parallelize pipelines independently?● Each thread processes one pipeline
○ At pipeline level, application is single threaded
● Allot one GPU to each pipeline● Pipelines initialized once per run ● Perpetual execution
![Page 31: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/31.jpg)
Parallelism: Results
● On CPU side, worked fine● GPU - not so much
○ Too many blocking syncs to allocate and deallocate memory○ Copy/Kernel execution collisions between threads○ No concurrency○ Extremely slow memory transfer speeds
■ Each image is 16mb, multiple transfers per kernel call
○ Although pragmatically parallel, execution was almost serial, probably slower
![Page 32: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/32.jpg)
Parallelism: Results
● On CPU side, worked fine● GPU - not so much
○ Too many blocking syncs to allocate and deallocate memory○ Copy/Kernel execution collisions between threads○ No concurrency○ Extremely slow memory transfer speeds
■ Each image is 16mb, multiple transfers per kernel call
○ Although pragmatically parallel, execution was almost serial, probably slower
![Page 33: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/33.jpg)
Host
Challenge 2A: Pinned Memory
Pageable
Pinned
Device
DRAM
Host
Pageable
Pinned
Device
DRAM
Pageable Memory Copy Pinned Memory Copy
Transfer speeds can double with Pinned Memory● For pageable memory, CUDA first transfers to pinned
and then to GPU.● Non-pageable (pinned) memory not pageable by OS● CUDA skips pageable->pinned memory transfer
![Page 34: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/34.jpg)
Challenge 2B: Streams
Concurrency, concurrency, concurrency● Increases PCIe throughput
○ Streams allow simultaneous copy and execution○ Together with pinned memory, allows asynchronous copy
● Each pipeline has one stream allotted to it○ Stream remains active through the lifetime of the pipeline
○ All CUDA operations and kernel launches are done asynchronously for the stream
○ Use of cudaStreamSynchronize (vs cudaDeviceSynchronize)
![Page 35: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/35.jpg)
Pinned Memory and Streams: Results
● Memory transfers speeds increased ~2x● Problems:
○ Allocating/freeing pinned memory is a full system block■ All threads on CPU, all streams on GPU are blocked■ Very very bad■ Benefits of using streams is negated
○ Device memory alloc/free is also a blocking sync○ Too many memory API calls negating the benefits of parallelism○ Possible memory leaks - very bad for reliability
■ Will reveal in stress testing
![Page 36: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/36.jpg)
Challenge 3: Better Memory Management
Minimize number of memory allocation and deletion● On CPU and GPU● The memory used in the processing of each site is
deterministic and constant
● Solution: Create a memory manager
![Page 37: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/37.jpg)
Memory Manager
● Goals:○ Manage host and device memory for each pipeline○ Allocate, free memory○ Assign, retract memory○ Manage transfers between host and device○ Ensure consistency between host and device memory○ Free memory only at the end of the application
![Page 38: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/38.jpg)
Memory Manager
Memory Manager “Mirrored Array”
Type Stream
SizeFlags
Host Pointer
Device Pointer
Create()
Push() Pull()
Update()
Free()
Dev Mem Host Mem
Release()
![Page 39: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/39.jpg)
Memory Manager
● Memory usage is deterministic, memory once allocated can be reused as needed.○ After 1st site run, no new pinned or device memory will need to be
created○ Most of the memory required can be created in initialization○ Same chunk of memory can be reused using pointers
○ Pointers can release the memory back to manager when processing is completed
![Page 40: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/40.jpg)
Better Memory Management: Results
● Drastic reduction in alloc/free calls● Much better parallelism
○ Streams are much more concurrent as blocks are reduced○ CPU threads do not need to be synced
● Stable across multiple site processing● Memory leaks are easily discovered
○ Increase in usage after 1st run shows leaks
○ Memory can be used to make sure all memory is release at the end of each site
![Page 41: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/41.jpg)
The Framework
Master Thread
GPU 0 GPU 1
Pipe 1
Stream 1
Memory
Site
Pipe 0
Stream 0
Memory
Site
![Page 42: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/42.jpg)
Results
● Significant performance improvements
● Excellent PCIe throughput● Highly parallel
● GPU kernel execution lower compared to memcpy times
Master Thread
GPU 0 GPU 1
Pipe 1
Stream 1
Memory
Site
Pipe 0
Stream 0
Memory
Site
![Page 43: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/43.jpg)
● Increase pipelines to 4○ 2 per GPU
● 4 pipelines good for CPU○ 4 heavy processing threads○ 1 master light thread○ 4 threads = optimal usage
The Framework - Final
Master Thread
Pipe 3
Stream 3
Memory
GPU 0 GPU 1
Site
Pipe 2
Stream 2
Memory
Site
Pipe 1
Stream 1
Memory
Site
Pipe 0
Stream 0
Memory
Site
![Page 44: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/44.jpg)
![Page 45: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/45.jpg)
![Page 46: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/46.jpg)
![Page 47: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/47.jpg)
![Page 48: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/48.jpg)
Results
● Improvement in times○ Almost 2x better than required
● Stable memory usage● GPU usage optimal● Problems?
Master Thread
Pipe 3
Stream 3
Memory
GPU 0 GPU 1
Site
Pipe 2
Stream 2
Memory
Site
Pipe 1
Stream 1
Memory
Site
Pipe 0
Stream 0
Memory
Site
![Page 49: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/49.jpg)
Results
● Improvement in times○ Almost 2x better than required
● Stable memory usage● GPU usage optimal● Problem: OVERHEATING!
Master Thread
Pipe 3
Stream 3
Memory
GPU 0 GPU 1
Site
Pipe 2
Stream 2
Memory
Site
Pipe 1
Stream 1
Memory
Site
Pipe 0
Stream 0
Memory
Site
![Page 50: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/50.jpg)
Results
● Problem: OVERHEATING!● Solution:
○ Use software tools to lower gpu clock speeds○ Control fan speeds on gpu○ Create target power and temperature
● No major reduction in performance
![Page 51: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/51.jpg)
Case 2: Takeaways
● Application is only as fast as its slowest part● True multithreading is awesome
○ Not easy - but can be done
● Memory management is crucial to parallelism● Be ready to tackle any problem
○ Overheating? Really?
![Page 52: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f147a728adc3721a3769f94/html5/thumbnails/52.jpg)
Q & A