Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea...

27
Taming GPU Threads with F# and Alea GPU CodeMesh 2014 London November 5 th 2014 Dr. Daniel Egloff [email protected] +41 44 520 0117 +41 79 430 036141

Transcript of Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea...

Page 1: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

Taming GPU Threads with F# and Alea GPU CodeMesh 2014 London November 5th 2014 Dr. Daniel Egloff [email protected] +41 44 520 0117 +41 79 430 036141

Page 2: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

About Us

Technology driven financial services company

Software and solution provider for high performance and GPU computing

Page 3: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

Page 4: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Increasing importance !  More data !  More compute intestine applications !  Mobile devices !   Embedded systems

!   GPGPU programming APIs CUDA and OpenCL

!   Simplify general purpose GPU programming !   Very popular

!  Simple !  Scalable !  Easy to learn !  Good tooling !  Good platform support

GPGPU

Page 5: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

Page 6: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   CUDA / OpenCL development environment provided by vendors based on C/C++

!   Anatomy of a GPU accelerated code base !   Data preparation !   Delayed computations !   Dispatching for different problem size and hardware architectures !   Small fraction of code is highly optimized GPU code

!   Modern runtimes and languages such as F# are a better fit !   Functional patterns for delayed computation !   Pattern matching !   Containers, LINQ !   GC

!   Developer benefit !   Productivity !   Cross platform

Why Modern Languages for GPUs?

Page 7: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

Alea GPU Platform Language

C#,  F#,  VB.NET   Alea  CUDA  language  extension  

Tooling

Visual  Studio  Xamarin  Studio  

Alea  compilers  CUDA  tools  (Nsight,  profilers,  etc.)  

Framew

ork

.NET  /  Mono  Alea  runGme  CUDA  driver  

Hardw

are

CPU   CUDA  enabled  GPU  

Library

Cross  language  .NET  libraries  

Alea  reacGve  dataflow  framework  Alea  performance  primiGves  Binding  for  NVIDIA  libraries  (cuBLAS,  cuFFT,  …)  

Page 8: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Different compilation schemes !   Dynamic compilation !   Ahead of time compilation

!   Same performance as CUDA C/C++

!   Relying on LLVM technology

Alea Compilers

CUDA  Driver

Alea.GPU  Compilers

SSAS

PTX

LLVM  IR

Quotations

F# F#  C#VB.NET

IL

GPU

Page 9: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

Page 10: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Kernel = function executing sequential code in parallel with many threads !   Threads grouped in blocks, multiple blocks form grid !   Thread in a block identified by (threadIdx.x, threadIdx.y, threadIdx.z) !   Block size (blockDim.x, blockDim.y, blockDim.z) !   Block in grid identified by (blockIdx.x, blockIdx.y, blockInx.z) !   Grid size (gridDim.x, gridDim.y, gridDim.z) !   Threads are always scheduled in groups, so called warps

CUDA Crash Course

Thread Thread Block Grid

Page 11: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Transform a large array of data with an unary function in parallel !   Simple data parallel problem !   Standard approach

!   Fix block and grid size based on GPU hardware capability !   Each thread does multiple array elements to process all elements

!   Example: 3 blocks with 4 threads each

Parallel Transform Example

Page 12: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

Quotation based approach <@ … @> or [<ReflectedDefinition>] Benefits !   Discriminated unions !   Records !   Higher order functions !   Interoperability with IL based approach

Alea GPU with F# - Kernel

let  kernel  transform  =          <@  fun  (n:int)  (input:deviceptr<float>)  (output:deviceptr<float>)  -­‐>                  let  start  =  blockIdx.x  *  blockDim.x  +  threadIdx.x                  let  stride  =  gridDim.x  *  blockDim.x                  let  mutable  i  =  start                  while  i  <  n  do                          output.[i]  <-­‐  (%transform)  input.[i]                          i  <-­‐  i  +  stride            @>  

Page 13: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

Alea GPU with F# - Workflow

let  transform  transform  =  cuda  {          let!  kernel  =  transform  |>  kernel  |>  Compiler.DefineKernel              return  Entry(fun  (program:Program)  -­‐>                  let  worker  =  program.Worker                  let  kernel  =  program.Apply(kernel)                    fun  (input:float[])  -­‐>                          let  n  =  input.Length                          use  input  =  worker.Malloc(input)                          use  output  =  worker.Malloc(n)                            let  blockSize  =  256                          let  numSm  =  program.Worker.Device.Attributes.MULTIPROCESSOR_COUNT                          let  gridSize  =  min  (16*numSm)  (divup  n  blockSize)                          let  lp  =  LaunchParam(gridSize,  blockSize)                                              kernel.Launch  lp  n  input.Ptr  output.Ptr                                            output.Gather()  )  }      

Page 14: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

         use  transform  =  Worker.Default.LoadProgram(           transform  (kernel  <@  fun  x  -­‐>  x*x  @>))      let  input  =  Array.init  10  (fun  i  -­‐>  float(i))  let  output  =  transform.Run  input    printfn  "%A"  output        

Alea GPU with F# - Compilation / Exec

Page 15: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Functional style !   GPU resource definition !   GPU algorithm composition !   Data preparation

!   Imperative style !   GPU kernel algorithms !   Fine grained control about GPU data, memory !   Index calculation for difficult numerical calculations !   Some functional patterns are also possible in kernels, like pattern matching

Programming Style

Page 16: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Warp scan performance primitive !   Advanced GPU programming at warp level !   Optimized version for compute architecture 3 with warp shuffle !   Dispatching logic

Example II

__shfl_up  output  1  1

__shfl_up  output  2  1

__shfl_up  output  4  1

__shfl_up  output  8  1

Page 17: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Special Matrix Multiplication X = A .* v * B !   Very long and slim matrices !   Weighting with a probability vector v !   Implementation relies on low level block reduce performance primitive !   Used in large scale optimization problem for asset management

Example III

A v B X

Page 18: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Two step reduction ! Num elements after 1st reduction = divUp (num Cols, 8 * 256)

Example III

A

v

       

B

X’

Page 19: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   GPGPU !   The Alea Platform !   Taming GPU Kernels with F#

!   Alea Reactive Dataflow

Topics

Page 20: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

GPU parallel programming for (almost) everyone !   Radical simplification

!   No GPU experience required !   Fast development !   High performance comes automatically !   Guaranteed memory safety

!   Cross platform !   Multiple targets

!   Single GPU !   Multiple GPUs !   Clusters and grids of GPUs

!   Based on the Alea GPU platform

Alea Reactive Dataflow – Goals

Page 21: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Dataflow !   Graph of operations !   Data propagated through graph

!   Reactive !   Feed input in arbitrary intervals !   Listen for asynchronous output

!   Program is purely descriptive !   What, not how

!   Efficient execution behind the scenes !   Vector-parallel operations !   Stream operations on GPU !   Minimize memory copying !   Hybrid multi-platform scheduling

Alea Reactive Dataflow – Model

   

   

   

   

   

   

Page 22: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Unit of calculation (typically vector-parallel) !   Input and output ports !   Port = stream of typed data !   Consumes input, produces output

Alea Reactive Dataflow – Operations

Map  

Input:  T[]  

Output:  U[]  

MatrixProduct  

LeW:  T[,]   Right:  T[,]  

Output:  T[,]  

SpliYer  

Input:  Tuple<T,  U>  

First:    T   Second:  U  

Page 23: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

Alea Reactive Dataflow – Graph

Random  

Pairing  

Map  

Average  

float[]  

int[]  

Pair<float>[]  

let  randoms  =  Random<_>(0.0,  1.0)  let  coordinates  =  Pairing<_>()  let  inUnitCircle  =  Map<_,  _>        (fun  p  -­‐>  if  p.Left  *  p.Left  +                                    p.Right  *  p.Right  <=  1.0                              then  1.0  else  0.0)  let  average  =  new  Average<_>()    randoms.Output.ConnectTo(coordinates.Input)  coordinates.Output.ConnectTo(inUnitCircle.Input)  inUnitCircle.Output.ConnectTo(average.Input)  

Page 24: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Send data to input port !   Receive from output port !   All asynchronous

Alea Reactive Dataflow – Dataflow

Random  

Pairing  

Map  

Average  

1000  

Write(…)  

1000000  

Write(…)  

average.Output.OnReceive(fun  x  -­‐>                        printfn  "Pi  =  %f"  (4.0*x))    random.Input.Send(1000)  random.Input.Send(1000000)  

Page 25: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Operation implement GPU and/or CPU

!   GPU operations combined to stream

!   Memory copy only when needed

!   Host scheduling with .NET TPL !   Automatic memory

management

Alea Reactive Dataflow – Scheduling

   

   

   

   

   

GPU  

    CPU  

COPY  

COPY  

COPY  

COPY  

Page 26: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

!   Alea GPU platform is a professional GPU development stack for .NET / Mono !   Productivity with modern languages and first class tooling !   Simplifies GPU cross platform and portability !   Same performance as CUDA C/C++ !   Unique features such as GPU scripting !   Full flexibility of CUDA for advanced GPU coding on .NET

!   Alea reactive dataflow !   Makes GPU acceleration accessible to programmers without GPU knowledge !   Fully independent of GPU programming model CUDA

!   Alea GPU V 2 is scheduled for February 2015

Conclusion

Page 27: Taming GPU Threads with F# and Alea GPU · Taming GPU Threads with F# and Alea GPU ... Alea Reactive Dataflow ... 20141105_Taming GPU threads with Fshap ...

Thank you