Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea...
Transcript of Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea...
Taming GPU Threads with F# and Alea GPU CodeMesh 2014 London November 5th 2014 Dr. Daniel Egloff [email protected] +41 44 520 0117 +41 79 430 036141
About Us
Technology driven financial services company
Software and solution provider for high performance and GPU computing
! GPGPU ! The Alea Platform ! Taming GPU Kernels with F#
! Alea Reactive Dataflow
Topics
! Increasing importance ! More data ! More compute intestine applications ! Mobile devices ! Embedded systems
! GPGPU programming APIs CUDA and OpenCL
! Simplify general purpose GPU programming ! Very popular
! Simple ! Scalable ! Easy to learn ! Good tooling ! Good platform support
GPGPU
! GPGPU ! The Alea Platform ! Taming GPU Kernels with F#
! Alea Reactive Dataflow
Topics
! CUDA / OpenCL development environment provided by vendors based on C/C++
! Anatomy of a GPU accelerated code base ! Data preparation ! Delayed computations ! Dispatching for different problem size and hardware architectures ! Small fraction of code is highly optimized GPU code
! Modern runtimes and languages such as F# are a better fit ! Functional patterns for delayed computation ! Pattern matching ! Containers, LINQ ! GC
! Developer benefit ! Productivity ! Cross platform
Why Modern Languages for GPUs?
Alea GPU Platform Language
C#, F#, VB.NET Alea CUDA language extension
Tooling
Visual Studio Xamarin Studio
Alea compilers CUDA tools (Nsight, profilers, etc.)
Framew
ork
.NET / Mono Alea runGme CUDA driver
Hardw
are
CPU CUDA enabled GPU
Library
Cross language .NET libraries
Alea reacGve dataflow framework Alea performance primiGves Binding for NVIDIA libraries (cuBLAS, cuFFT, …)
! Different compilation schemes ! Dynamic compilation ! Ahead of time compilation
! Same performance as CUDA C/C++
! Relying on LLVM technology
Alea Compilers
CUDA Driver
Alea.GPU Compilers
SSAS
PTX
LLVM IR
Quotations
F# F# C#VB.NET
IL
GPU
! GPGPU ! The Alea Platform ! Taming GPU Kernels with F#
! Alea Reactive Dataflow
Topics
! Kernel = function executing sequential code in parallel with many threads ! Threads grouped in blocks, multiple blocks form grid ! Thread in a block identified by (threadIdx.x, threadIdx.y, threadIdx.z) ! Block size (blockDim.x, blockDim.y, blockDim.z) ! Block in grid identified by (blockIdx.x, blockIdx.y, blockInx.z) ! Grid size (gridDim.x, gridDim.y, gridDim.z) ! Threads are always scheduled in groups, so called warps
CUDA Crash Course
Thread Thread Block Grid
! Transform a large array of data with an unary function in parallel ! Simple data parallel problem ! Standard approach
! Fix block and grid size based on GPU hardware capability ! Each thread does multiple array elements to process all elements
! Example: 3 blocks with 4 threads each
Parallel Transform Example
Quotation based approach <@ … @> or [<ReflectedDefinition>] Benefits ! Discriminated unions ! Records ! Higher order functions ! Interoperability with IL based approach
Alea GPU with F# - Kernel
let kernel transform = <@ fun (n:int) (input:deviceptr<float>) (output:deviceptr<float>) -‐> let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let mutable i = start while i < n do output.[i] <-‐ (%transform) input.[i] i <-‐ i + stride @>
Alea GPU with F# - Workflow
let transform transform = cuda { let! kernel = transform |> kernel |> Compiler.DefineKernel return Entry(fun (program:Program) -‐> let worker = program.Worker let kernel = program.Apply(kernel) fun (input:float[]) -‐> let n = input.Length use input = worker.Malloc(input) use output = worker.Malloc(n) let blockSize = 256 let numSm = program.Worker.Device.Attributes.MULTIPROCESSOR_COUNT let gridSize = min (16*numSm) (divup n blockSize) let lp = LaunchParam(gridSize, blockSize) kernel.Launch lp n input.Ptr output.Ptr output.Gather() ) }
use transform = Worker.Default.LoadProgram( transform (kernel <@ fun x -‐> x*x @>)) let input = Array.init 10 (fun i -‐> float(i)) let output = transform.Run input printfn "%A" output
Alea GPU with F# - Compilation / Exec
! Functional style ! GPU resource definition ! GPU algorithm composition ! Data preparation
! Imperative style ! GPU kernel algorithms ! Fine grained control about GPU data, memory ! Index calculation for difficult numerical calculations ! Some functional patterns are also possible in kernels, like pattern matching
Programming Style
! Warp scan performance primitive ! Advanced GPU programming at warp level ! Optimized version for compute architecture 3 with warp shuffle ! Dispatching logic
Example II
__shfl_up output 1 1
__shfl_up output 2 1
__shfl_up output 4 1
__shfl_up output 8 1
! Special Matrix Multiplication X = A .* v * B ! Very long and slim matrices ! Weighting with a probability vector v ! Implementation relies on low level block reduce performance primitive ! Used in large scale optimization problem for asset management
Example III
A v B X
! Two step reduction ! Num elements after 1st reduction = divUp (num Cols, 8 * 256)
Example III
A
v
B
X’
! GPGPU ! The Alea Platform ! Taming GPU Kernels with F#
! Alea Reactive Dataflow
Topics
GPU parallel programming for (almost) everyone ! Radical simplification
! No GPU experience required ! Fast development ! High performance comes automatically ! Guaranteed memory safety
! Cross platform ! Multiple targets
! Single GPU ! Multiple GPUs ! Clusters and grids of GPUs
! Based on the Alea GPU platform
Alea Reactive Dataflow – Goals
! Dataflow ! Graph of operations ! Data propagated through graph
! Reactive ! Feed input in arbitrary intervals ! Listen for asynchronous output
! Program is purely descriptive ! What, not how
! Efficient execution behind the scenes ! Vector-parallel operations ! Stream operations on GPU ! Minimize memory copying ! Hybrid multi-platform scheduling
Alea Reactive Dataflow – Model
! Unit of calculation (typically vector-parallel) ! Input and output ports ! Port = stream of typed data ! Consumes input, produces output
Alea Reactive Dataflow – Operations
Map
Input: T[]
Output: U[]
MatrixProduct
LeW: T[,] Right: T[,]
Output: T[,]
SpliYer
Input: Tuple<T, U>
First: T Second: U
Alea Reactive Dataflow – Graph
Random
Pairing
Map
Average
float[]
int[]
Pair<float>[]
let randoms = Random<_>(0.0, 1.0) let coordinates = Pairing<_>() let inUnitCircle = Map<_, _> (fun p -‐> if p.Left * p.Left + p.Right * p.Right <= 1.0 then 1.0 else 0.0) let average = new Average<_>() randoms.Output.ConnectTo(coordinates.Input) coordinates.Output.ConnectTo(inUnitCircle.Input) inUnitCircle.Output.ConnectTo(average.Input)
! Send data to input port ! Receive from output port ! All asynchronous
Alea Reactive Dataflow – Dataflow
Random
Pairing
Map
Average
1000
Write(…)
1000000
Write(…)
average.Output.OnReceive(fun x -‐> printfn "Pi = %f" (4.0*x)) random.Input.Send(1000) random.Input.Send(1000000)
! Operation implement GPU and/or CPU
! GPU operations combined to stream
! Memory copy only when needed
! Host scheduling with .NET TPL ! Automatic memory
management
Alea Reactive Dataflow – Scheduling
GPU
CPU
COPY
COPY
COPY
COPY
! Alea GPU platform is a professional GPU development stack for .NET / Mono ! Productivity with modern languages and first class tooling ! Simplifies GPU cross platform and portability ! Same performance as CUDA C/C++ ! Unique features such as GPU scripting ! Full flexibility of CUDA for advanced GPU coding on .NET
! Alea reactive dataflow ! Makes GPU acceleration accessible to programmers without GPU knowledge ! Fully independent of GPU programming model CUDA
! Alea GPU V 2 is scheduled for February 2015
Conclusion
Thank you