Efficient Abstractions for GPGPU Programming...Asmallexample CPU RAM GPU1 RAM GPU0 RAM Example let...
Transcript of Efficient Abstractions for GPGPU Programming...Asmallexample CPU RAM GPU1 RAM GPU0 RAM Example let...
Efficient Abstractionsfor GPGPU Programming
Mathias Bourgoin & Emmanuel Chailloux
16.12.2015
Efficient abstractions for GPGPU programming
PhD (LIP6/UPMC)GPGPU programming→ general purpose computations on the GPU
Abstractions→ languages and algorithmic constructs
Efficient→ High Performance Computing
Applications→ computational science and numerical simulation
OpenGPU projectSystematic Cluster
Academic and Industrial partners
Goal : provide open-source solutions for GPGPU programming
Success : develop real size numerical applications
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 2 / 25
Graphic card
Properties of a dedicated graphic cardSeveral multi-processors
Dedicated memory
Connected to a host (CPU) via a PCI-Express bus
Implies data transfers between host and graphic card memories
Complex and specific programming
Current hardware
CPU GPU
# cores 4-18 300-3000Max Memory 8GB-1.5TB 1GB-12GB
GFLOPS SP 200-720 1000-6000GFLOPS DP 100-360 100-2000
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 3 / 25
GPGPU Programming
Two main frameworks
Cuda (NVidia)OpenCL (Consortium OpenCL)
Different languages
To write kernelsAssembly (PTX, SPIR, IL,…)Subsets of C/C++
To manage kernelsC/C++/Objective-CBindings : Fortran, Python, Java, …
Stream ProcessingFrom a data set (stream), a series of computations (kernel) is applied to eachelement of the stream.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 4 / 25
GPGPU programming in practice 1
Grid
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0 Block 1
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0
Thread (0,0) Thread (1,0)
Block 1
Thread (0,1) Thread (1,1)
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
GridBlock 0
Shared memory
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Shared memory
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1
Grid
Global memory
Block 0
Shared memory
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Shared memory
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 1Grid
Global memory
Block 0
Shared memory
Registers
Thread (0,0)
Localmem.
Registers
Thread (1,0)
Localmem.
Block 1
Shared memory
Registers
Thread (0,1)
Localmem.
Registers
Thread (1,1)
Localmem.
Do not forget transfers between the host and its guestsCPU-X86 GPU Mobile GPU Gamer GPU HPCXeon E7 v3 GTX 980M GTX 980 Ti Radeon R9 Fury X Tesla K40
Memory102GB/s 160GB/s 336.5GB/s 512GB/s 288GB/s
bandwidthPCI-Express 3.0 maximum bandwidth is 16GB/s
PCI-Express 4.0 (not before 2017) 32GB/s
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 5 / 25
GPGPU programming in practice 2Kernel : small example using OpenCL
Vector addition
__kernel void vec_add ( _ _ g l o b a l const double * a ,_ _ g l o b a l const double * b ,_ _ g l o b a l double * c , in t N)
{in t nIndex = g e t _ g l o b a l _ i d ( 0 ) ;i f ( n Index >= N)
return ;c [ n Index ] = a [ nIndex ] + b [ nIndex ] ;
}
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 6 / 25
GPGPU programming in practice 2Host : small example using C
/ / c r e a t e OpenCL d e v i c e & c o n t e x tc l _ c o n t e x t hContext ;hContext = c lCreateContex tFromType ( 0 , CL_DEVICE_TYPE_GPU ,
0 , 0 , 0 ) ;/ / que ry a l l d e v i c e s a v a i l a b l e t o t h e c o n t e x ts i z e _ t nCon t e x tDe s c r i p t o r S i z e ;c lGe tCon t e x t I n f o ( hContext , CL_CONTEXT_DEVICES ,
0 , 0 , &nCon t e x tDe s c r i p t o r S i z e ) ;c l _ d e v i c e _ i d * aDev i c e s = ma l l oc ( nCon t e x tDe s c r i p t o r S i z e ) ;c lG e tCon t e x t I n f o ( hContext , CL_CONTEXT_DEVICES ,
nCon t e x tDe s c r i p t o r S i z e , aDev ices , 0 ) ;/ / c r e a t e a command queue f o r f i r s t d e v i c e t h e c o n t e x t r e p o r t e dcl_command_queue hCmdQueue ;hCmdQueue = clCreateCommandQueue ( hContext , aDev i c e s [ 0 ] , 0 , 0 ) ;/ / c r e a t e & comp i l e programc l_program hProgram ;hProgram = c lCreateProgramWithSource ( hContext , 1 ,
sProgramSource , 0 , 0 ) ;c lBu i l dP rog ram ( hProgram , 0 , 0 , 0 , 0 , 0 ) ;
/ / c r e a t e k e r n e lc l _ k e r n e l hKerne l ;hKerne l = c lC r e a t eK e r n e l ( hProgram , “”vec_add , 0 ) ;
/ / a l l o c a t e d e v i c e memorycl_mem hDeviceMemA , hDeviceMemB , hDeviceMemC ;hDeviceMemA = c lC r e a t e B u f f e r ( hContext ,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,cnDimension * s i zeo f ( c l _ d oub l e ) ,pA ,0 ) ;
hDeviceMemB = c lC r e a t e Bu f f e r ( hContext ,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,cnDimension * s i zeo f ( c l _ d oub l e ) ,pA ,0 ) ;
hDeviceMemC = c lC r e a t e Bu f f e r ( hContext ,CL_MEM_WRITE_ONLY ,cnDimension * s i zeo f ( c l _ d oub l e ) ,0 , 0 ) ;
/ / s e t u p pa r ame t e r v a l u e sc l S e tK e r n e l A r g ( hKernel , 0 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemA ) ;c l S e tK e r n e l A r g ( hKernel , 1 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemB ) ;c l S e tK e r n e l A r g ( hKernel , 2 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemC ) ;/ / e x e c u t e k e r n e lclEnqueueNDRangeKernel ( hCmdQueue , hKerne l , 1 , 0 ,
&cnDimension , 0 , 0 , 0 , 0 ) ;/ / copy r e s u l t s from d e v i c e back t o h o s tc lEnqueueReadBuf f e r ( hContext , hDeviceMemC , CL_TRUE , 0 ,
cnDimension * s i zeo f ( c l _ d oub l e ) ,pC , 0 , 0 , 0 ) ;
c lReleaseMemObj ( hDeviceMemA ) ;c lReleaseMemObj ( hDeviceMemB ) ;c lReleaseMemObj ( hDeviceMemC ) ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 7 / 25
GPGPU Programming with OCaml
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 8 / 25
OCaml
High-level general-purpose programming languageEfficient sequential computationsStatically typedType inferenceMultiparadigm(imperative, object oriented, functional, modular)Compile to bytecode/native codeAutomatic memory management (with very efficient Garbage Collector)Interactive Toplevel (to learn, test and debug)Interoperable with C
PortableOS : Windows - Unix (OS-X, Linux…)Architectures : x86, x86-64, PowerPC, ARM…
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 9 / 25
OCaml - Usage
DomainsTeaching
Research
Industry
The OCaml consortium
Soware examples :The Coq proof assistant (Inria)
LexiFi’s Modeling Language for Finance
The FFTW library (MIT)
Facebook’s Hack and Flow compilers
XenServer (Citrix)
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 10 / 25
Main Goals
Target Cuda/OpenCL frameworks with OCaml
Unify these two frmeworks
Abstract memory transfers
Use static type checking to verify kernels
Propose abstractions for GPGPU programming
Keep the high performance
Host-side solution : an OCaml library
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 10 / 25
SPOC overview
Abstract frameworksUnify both APIs (Cuda/OpenCL), dynamic linking.Portable solution, multi-GPGPU, heterogeneous
Abstract transfersVectors move automatically between CPU and GPGPUs
On-demand (lazy) transfers
Automatic allocation/dealloction of the memory space used by vectors (onthe host as well as on GPGPU devices)
Failure during allocation on a GPGPU triggers a garbage collection
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 11 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v1v2v3
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
A small example
CPU RAM
GPU1 RAM
GPU0 RAM
v3
v1v2
Example
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 12 / 25
Sarek : Stream ARchitecture using Extensible Kernels
Vector addition with Sarek
l e t vec_add = kern a b c n −>l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then
c . [ < idx >] <− add a . [ < idx >] b . [ < idx >]
Vector addition with OpenCL
__kernel void vec_add ( _ _ g l o b a l const double * a ,_ _ g l o b a l const double * b ,_ _ g l o b a l double * c , in t N)
{in t nIndex = g e t _ g l o b a l _ i d ( 0 ) ;i f ( n Index >= N)
return ;c [ n Index ] = a [ nIndex ] + b [ nIndex ] ;
}
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 13 / 25
Sarek
Vector addition with Sarek
l e t vec_add = kern a b c n −>l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then
c . [ < idx >] <− add a . [ < idx >] b . [ < idx >]
Sarek featuresML-like syntax
type inference
static type checking
static compilation to OCaml code
dynamic compilation to Cuda/OpenCL
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 14 / 25
Sarek static compilation
Sarek code
kern a → let idx =Std.global_thread_id ()in a.[< idx >] ← 0
IR
Bind( (Id 0), (ModuleAccess((Std),(global_thread_id)),(VecSet(VecAcc…))))
typed IR
OCaml Codefun a − >let idx =Std.global_thread_id ()
in a.[< idx >] < − 0l
KirKernParamsVecVar 0VecVar 1
…
spoc_kernelclass spoc_class1method run = …method compile = …endnew spoc_class1
OCaml code generation
Kir generation
spoc_kernel generation
Typing
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 15 / 25
Sarek dynamic compilation
let my_kernel = kern … − > …
… ;;
Kirc.gen my_kernel ;
Compile to
Cuda C source file
OpenCL C99
Compileto
Cuda ptx assembly
nvcc-O3
-ptx…
Kirc.run my_kernel dev (block,grid) ;
OpenCL Cuda
OpenCL C99 Cuda ptx assembly
Return to OCaml code execution
CompileandRun
device
kernelsource
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 16 / 25
Vectors addition
SPOC + Sarekopen Spocl e t vec_add = kern a b c n −>
l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then
c . [ < idx >] <− add a . [ < idx >] b . [ < idx >]
l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n
l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }
l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;K i r c . gen vec_add ;K i r c . run vec_add ( v1 , v2 , v3 , n ) ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]
done ;
OCamlNo explicit transferType inferenceStatic type checkingPortableHeterogeneous
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 17 / 25
Sarek transformations
Using SarekTransformations are OCaml functions modifying Sarek AST :Example :map ( kern a −> b )Scalar computations (′a → ′b) are transformedinto vector ones (′a vector → ′b vector).
Vector addition
l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 10 _000and v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 10 _000 inl e t v3 = map2 ( kern a b −> a + b ) v1 v2
val map2 :( ’ a −> ’ b −> ’ c ) s a r e k _ k e r n e l −>?dev : Spoc . Dev i c e s . d e v i c e −>’ a Spoc . Vec to r . v e c t o r −>’ b Spoc . Vec to r . v e c t o r −> ’ c Spoc . Vec to r . v e c t o r
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 18 / 25
Real size example
PROP
Awarded by the UK Research Councils’ HEC Strategy Commiee
Simulates the scaering of e− in H-like ions at intermediates energies
Programmed in FORTRAN
Compatible with : sequential architectures, HPC clusters, super-computeurs
Vers
ions
Time (s)
1 018s
1 195s
951s
4 271sFORTRAN CPU
FORTRAN GPU
OCaml GPU
OCaml GPU (with native kernels)
SPOC+Sarek achieves 80% ofhand-tuned Fortranperformance.SPOC+external kernels is onpar with Fortran (93%)
Type-safe 30% code reductionMemory manager + GC No more transfers
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 19 / 25
Tools - 1
OCaml Toplevel + SPOC + HPC libraries
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 20 / 25
Tools - 2
SPOC for the webAccess GPGPU from web browsers
Using the js_of_ocaml compiler
Translation of the lowl-level part of SPOC + development of a dedicatedmemory manager
Source and web demos/tutorials :http://www.algo-prog.info/spoc/SPOC can be installed via OPAM (OCaml Package Manager)
Accessibility and teachingSimpler than classic tools : no more transfers
Web = instantly accessiblePerfect playground for GPGPU/HPC courses
focused on kernel optimizationbut mostly on algorithms composition
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 21 / 25
SPOC for the web
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 21 / 25
Conclusion
SPOCUnifies Cuda/OpenCL
Automatic transfers
Compatible with existing optimized libraries
SarekOCaml-like syntax
Type inference and static type checking
Easily extensible
Skeletons/TransformationsSimplifies programming
Offers additional automatic optimizations
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 22 / 25
Conclusion
BenchmarksSame performance as with other solutions
Heterogenous
Efficient with GPGPUs as well as with multicore CPUs
Application : PROPMore safety (memory/types)
Keeps the level of performance
Validates our solution
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 23 / 25
Current (and future) work
Extend implementationExtend Sarek : types, functions, recursion, polymorphism…
Optimize code generation
Dynamic and automatic optmizations for multiples architectures
Target new architectures
Extend skeletonsCost model for Sarek
More skeletons based on Sarek
Skeletons dedicated to very heterogeneous architectures (super-computers)
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 24 / 25
Thanks
SPOC : http://www.algo-prog.info/spoc/Spoc is compatible with x86_64 Unix (Linux, Mac OS X), Windows
for more information :[email protected]
Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 25 / 25