Efficient Abstractions for GPGPU Programming...Asmallexample CPU RAM GPU1 RAM GPU0 RAM Example let...

Efficient Abstractionsfor GPGPU Programming

Mathias Bourgoin & Emmanuel Chailloux

16.12.2015

Efficient abstractions for GPGPU programming

PhD (LIP6/UPMC)GPGPU programming→ general purpose computations on the GPU

Abstractions→ languages and algorithmic constructs

Efficient→ High Performance Computing

Applications→ computational science and numerical simulation

OpenGPU projectSystematic Cluster

Academic and Industrial partners

Goal : provide open-source solutions for GPGPU programming

Success : develop real size numerical applications

Mathias Bourgoin & Emmanuel Chailloux Efficient Abstractions for GPGPU Programming 16.12.15 2 / 25

Graphic card

Properties of a dedicated graphic cardSeveral multi-processors

Dedicated memory

Connected to a host (CPU) via a PCI-Express bus

Implies data transfers between host and graphic card memories

Complex and specific programming

Current hardware

CPU GPU

# cores 4-18 300-3000Max Memory 8GB-1.5TB 1GB-12GB

GFLOPS SP 200-720 1000-6000GFLOPS DP 100-360 100-2000


GPGPU Programming

Two main frameworks

Cuda (NVidia)OpenCL (Consortium OpenCL)

Different languages

To write kernelsAssembly (PTX, SPIR, IL,…)Subsets of C/C++

To manage kernelsC/C++/Objective-CBindings : Fortran, Python, Java, …

Stream ProcessingFrom a data set (stream), a series of computations (kernel) is applied to eachelement of the stream.


GPGPU programming in practice 1

Grid



GridBlock 0 Block 1



GridBlock 0

Thread (0,0) Thread (1,0)

Block 1

Thread (0,1) Thread (1,1)



GridBlock 0

Registers

Thread (0,0)

Localmem.

Registers

Thread (1,0)

Localmem.

Block 1

Registers

Thread (0,1)

Localmem.

Registers

Thread (1,1)

Localmem.



GridBlock 0

Shared memory

Registers

Thread (0,0)

Localmem.

Registers

Thread (1,0)

Localmem.

Block 1

Shared memory

Registers

Thread (0,1)

Localmem.

Registers

Thread (1,1)

Localmem.



Grid

Global memory

Block 0

Shared memory

Registers

Thread (0,0)

Localmem.

Registers

Thread (1,0)

Localmem.

Block 1

Shared memory

Registers

Thread (0,1)

Localmem.

Registers

Thread (1,1)

Localmem.


GPGPU programming in practice 1Grid

Global memory

Block 0

Shared memory

Registers

Thread (0,0)

Localmem.

Registers

Thread (1,0)

Localmem.

Block 1

Shared memory

Registers

Thread (0,1)

Localmem.

Registers

Thread (1,1)

Localmem.

Do not forget transfers between the host and its guestsCPU-X86 GPU Mobile GPU Gamer GPU HPCXeon E7 v3 GTX 980M GTX 980 Ti Radeon R9 Fury X Tesla K40

Memory102GB/s 160GB/s 336.5GB/s 512GB/s 288GB/s

bandwidthPCI-Express 3.0 maximum bandwidth is 16GB/s

PCI-Express 4.0 (not before 2017) 32GB/s


GPGPU programming in practice 2Kernel : small example using OpenCL

Vector addition

__kernel void vec_add ( _ _ g l o b a l const double * a ,_ _ g l o b a l const double * b ,_ _ g l o b a l double * c , in t N)

{in t nIndex = g e t _ g l o b a l _ i d ( 0 ) ;i f ( n Index >= N)

return ;c [ n Index ] = a [ nIndex ] + b [ nIndex ] ;

}


GPGPU programming in practice 2Host : small example using C

/ / c r e a t e OpenCL d e v i c e & c o n t e x tc l _ c o n t e x t hContext ;hContext = c lCreateContex tFromType ( 0 , CL_DEVICE_TYPE_GPU ,

0 , 0 , 0 ) ;/ / que ry a l l d e v i c e s a v a i l a b l e t o t h e c o n t e x ts i z e _ t nCon t e x tDe s c r i p t o r S i z e ;c lGe tCon t e x t I n f o ( hContext , CL_CONTEXT_DEVICES ,

0 , 0 , &nCon t e x tDe s c r i p t o r S i z e ) ;c l _ d e v i c e _ i d * aDev i c e s = ma l l oc ( nCon t e x tDe s c r i p t o r S i z e ) ;c lG e tCon t e x t I n f o ( hContext , CL_CONTEXT_DEVICES ,

nCon t e x tDe s c r i p t o r S i z e , aDev ices , 0 ) ;/ / c r e a t e a command queue f o r f i r s t d e v i c e t h e c o n t e x t r e p o r t e dcl_command_queue hCmdQueue ;hCmdQueue = clCreateCommandQueue ( hContext , aDev i c e s [ 0 ] , 0 , 0 ) ;/ / c r e a t e & comp i l e programc l_program hProgram ;hProgram = c lCreateProgramWithSource ( hContext , 1 ,

sProgramSource , 0 , 0 ) ;c lBu i l dP rog ram ( hProgram , 0 , 0 , 0 , 0 , 0 ) ;

/ / c r e a t e k e r n e lc l _ k e r n e l hKerne l ;hKerne l = c lC r e a t eK e r n e l ( hProgram , “”vec_add , 0 ) ;

/ / a l l o c a t e d e v i c e memorycl_mem hDeviceMemA , hDeviceMemB , hDeviceMemC ;hDeviceMemA = c lC r e a t e B u f f e r ( hContext ,

CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,cnDimension * s i zeo f ( c l _ d oub l e ) ,pA ,0 ) ;

hDeviceMemB = c lC r e a t e Bu f f e r ( hContext ,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,cnDimension * s i zeo f ( c l _ d oub l e ) ,pA ,0 ) ;

hDeviceMemC = c lC r e a t e Bu f f e r ( hContext ,CL_MEM_WRITE_ONLY ,cnDimension * s i zeo f ( c l _ d oub l e ) ,0 , 0 ) ;

/ / s e t u p pa r ame t e r v a l u e sc l S e tK e r n e l A r g ( hKernel , 0 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemA ) ;c l S e tK e r n e l A r g ( hKernel , 1 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemB ) ;c l S e tK e r n e l A r g ( hKernel , 2 , s i zeo f ( cl_mem ) , ( void *)&hDeviceMemC ) ;/ / e x e c u t e k e r n e lclEnqueueNDRangeKernel ( hCmdQueue , hKerne l , 1 , 0 ,

&cnDimension , 0 , 0 , 0 , 0 ) ;/ / copy r e s u l t s from d e v i c e back t o h o s tc lEnqueueReadBuf f e r ( hContext , hDeviceMemC , CL_TRUE , 0 ,

cnDimension * s i zeo f ( c l _ d oub l e ) ,pC , 0 , 0 , 0 ) ;

c lReleaseMemObj ( hDeviceMemA ) ;c lReleaseMemObj ( hDeviceMemB ) ;c lReleaseMemObj ( hDeviceMemC ) ;


GPGPU Programming with OCaml


OCaml

High-level general-purpose programming languageEfficient sequential computationsStatically typedType inferenceMultiparadigm(imperative, object oriented, functional, modular)Compile to bytecode/native codeAutomatic memory management (with very efficient Garbage Collector)Interactive Toplevel (to learn, test and debug)Interoperable with C

PortableOS : Windows - Unix (OS-X, Linux…)Architectures : x86, x86-64, PowerPC, ARM…


OCaml - Usage

DomainsTeaching

Research

Industry

The OCaml consortium

Soware examples :The Coq proof assistant (Inria)

LexiFi’s Modeling Language for Finance

The FFTW library (MIT)

Facebook’s Hack and Flow compilers

XenServer (Citrix)


Main Goals

Target Cuda/OpenCL frameworks with OCaml

Unify these two frmeworks

Abstract memory transfers

Use static type checking to verify kernels

Propose abstractions for GPGPU programming

Keep the high performance

Host-side solution : an OCaml library


SPOC overview

Abstract frameworksUnify both APIs (Cuda/OpenCL), dynamic linking.Portable solution, multi-GPGPU, heterogeneous

Abstract transfersVectors move automatically between CPU and GPGPUs

On-demand (lazy) transfers

Automatic allocation/dealloction of the memory space used by vectors (onthe host as well as on GPGPU devices)

Failure during allocation on a GPGPU triggers a garbage collection


A small example

CPU RAM

GPU1 RAM

GPU0 RAM

Example

l e t dev = Dev i c e s . i n i t ( )l e t n = 1 _000_000l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 nl e t v3 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 n

l e t k = vec_add ( v1 , v2 , v3 , n )l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }

l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;Ke rne l . run k ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]

done ;


A small example

CPU RAM

GPU1 RAM

GPU0 RAM

v1v2v3

Example




done ;


A small example

CPU RAM

GPU1 RAM

GPU0 RAM

v3

v1v2

Example




done ;


Sarek : Stream ARchitecture using Extensible Kernels

Vector addition with Sarek

l e t vec_add = kern a b c n −>l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then

c . [ < idx >] <− add a . [ < idx >] b . [ < idx >]

Vector addition with OpenCL

__kernel void vec_add ( _ _ g l o b a l const double * a ,_ _ g l o b a l const double * b ,_ _ g l o b a l double * c , in t N)

{in t nIndex = g e t _ g l o b a l _ i d ( 0 ) ;i f ( n Index >= N)

return ;c [ n Index ] = a [ nIndex ] + b [ nIndex ] ;

}


Sarek

Vector addition with Sarek

l e t vec_add = kern a b c n −>l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then


Sarek featuresML-like syntax

type inference

static type checking

static compilation to OCaml code

dynamic compilation to Cuda/OpenCL


Sarek static compilation

Sarek code

kern a → let idx =Std.global_thread_id ()in a.[< idx >] ← 0

IR

Bind( (Id 0), (ModuleAccess((Std),(global_thread_id)),(VecSet(VecAcc…))))

typed IR

OCaml Codefun a − >let idx =Std.global_thread_id ()

in a.[< idx >] < − 0l

KirKernParamsVecVar 0VecVar 1

…

spoc_kernelclass spoc_class1method run = …method compile = …endnew spoc_class1

OCaml code generation

Kir generation

spoc_kernel generation

Typing


Sarek dynamic compilation

let my_kernel = kern … − > …

… ;;

Kirc.gen my_kernel ;

Compile to

Cuda C source file

OpenCL C99

Compileto

Cuda ptx assembly

nvcc-O3

-ptx…

Kirc.run my_kernel dev (block,grid) ;

OpenCL Cuda

OpenCL C99 Cuda ptx assembly

Return to OCaml code execution

CompileandRun

device

kernelsource


Vectors addition

SPOC + Sarekopen Spocl e t vec_add = kern a b c n −>

l e t open Std inl e t open Math . F l o a t 6 4 inl e t i d x = g l o b a l _ t h r e a d _ i d ini f i d x < n then



l e t b l o ck = { b lockX = 1 0 2 4 ; b lockY = 1 ; b lockZ = 1 }l e t g r i d = { g r i dX = ( n +1024−1 ) /1024 ; g r i dY =1 ; g r idZ =1 }

l e t main ( ) =r a nd om_ f i l l v1 ;r a n d om_ f i l l v2 ;K i r c . gen vec_add ;K i r c . run vec_add ( v1 , v2 , v3 , n ) ( b lock , g r i d ) dev . ( 0 ) ;for i = 0 to Vec to r . l e ng th v3 − 1 doP r i n t f . p r i n t f ” r e s [%d ] = %f ; ” i v3 . [ < i > ]

done ;

OCamlNo explicit transferType inferenceStatic type checkingPortableHeterogeneous


Sarek transformations

Using SarekTransformations are OCaml functions modifying Sarek AST :Example :map ( kern a −> b )Scalar computations (′a → ′b) are transformedinto vector ones (′a vector → ′b vector).

Vector addition

l e t v1 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 10 _000and v2 = Vec to r . c r e a t e Vec to r . f l o a t 6 4 10 _000 inl e t v3 = map2 ( kern a b −> a + b ) v1 v2

val map2 :( ’ a −> ’ b −> ’ c ) s a r e k _ k e r n e l −>?dev : Spoc . Dev i c e s . d e v i c e −>’ a Spoc . Vec to r . v e c t o r −>’ b Spoc . Vec to r . v e c t o r −> ’ c Spoc . Vec to r . v e c t o r


Real size example

PROP

Awarded by the UK Research Councils’ HEC Strategy Commiee

Simulates the scaering of e− in H-like ions at intermediates energies

Programmed in FORTRAN

Compatible with : sequential architectures, HPC clusters, super-computeurs

Vers

ions

Time (s)

1 018s

1 195s

951s

4 271sFORTRAN CPU

FORTRAN GPU

OCaml GPU

OCaml GPU (with native kernels)

SPOC+Sarek achieves 80% ofhand-tuned Fortranperformance.SPOC+external kernels is onpar with Fortran (93%)

Type-safe 30% code reductionMemory manager + GC No more transfers


Tools - 1

OCaml Toplevel + SPOC + HPC libraries


Tools - 2

SPOC for the webAccess GPGPU from web browsers

Using the js_of_ocaml compiler

Translation of the lowl-level part of SPOC + development of a dedicatedmemory manager

Source and web demos/tutorials :http://www.algo-prog.info/spoc/SPOC can be installed via OPAM (OCaml Package Manager)

Accessibility and teachingSimpler than classic tools : no more transfers

Web = instantly accessiblePerfect playground for GPGPU/HPC courses

focused on kernel optimizationbut mostly on algorithms composition


http://www.algo-prog.info/spoc/

SPOC for the web


Conclusion

SPOCUnifies Cuda/OpenCL

Automatic transfers

Compatible with existing optimized libraries

SarekOCaml-like syntax

Type inference and static type checking

Easily extensible

Skeletons/TransformationsSimplifies programming

Offers additional automatic optimizations


Conclusion

BenchmarksSame performance as with other solutions

Heterogenous

Efficient with GPGPUs as well as with multicore CPUs

Application : PROPMore safety (memory/types)

Keeps the level of performance

Validates our solution


Current (and future) work

Extend implementationExtend Sarek : types, functions, recursion, polymorphism…

Optimize code generation

Dynamic and automatic optmizations for multiples architectures

Target new architectures

Extend skeletonsCost model for Sarek

More skeletons based on Sarek

Skeletons dedicated to very heterogeneous architectures (super-computers)


Thanks

SPOC : http://www.algo-prog.info/spoc/Spoc is compatible with x86_64 Unix (Linux, Mac OS X), Windows

for more information :[email protected]


http://www.algo-prog.info/spoc/

mailto:[email protected]

Efficient Abstractions for GPGPU Programming...Asmallexample CPU RAM GPU1 RAM GPU0 RAM Example let...

Documents

Transcript of Efficient Abstractions for GPGPU Programming...Asmallexample CPU RAM GPU1 RAM GPU0 RAM Example let...