GPU Computing
-
Upload
christian-kehl -
Category
Education
-
view
1.837 -
download
3
Transcript of GPU Computing
Parallel Computing on GPUs
Christian Kehl01.01.2011
2
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
3
Basics of Parallel Computing
Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007
4
Basics of Parallel Computing
5
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
6
Brief History of SIMD vs. MIMD Architectures
7
Brief History of SIMD vs. MIMD Architectures
8
Brief History of SIMD vs. MIMD Architectures
9
Brief History of SIMD vs. MIMD Architectures
• 2004 – programmable GPU Core via Shader Technology• 2007 – CUDA (Compute Unified Device
Architecture) Release 1.0• December 2008 – First Open Compute
Language Spec• March 2009 – Uniform Shader, first BETA
Releases of OpenCL• August 2009 – Release and Implementation
of OpenCL 1.0
10
Brief History of SIMD vs. MIMD Architectures
• SIMD technologies in GPUs:– Vector processing (ILLIAC IV)–mathematical operation units (ILLIAC IV)– Pipelining (CRAY-1)– local memory caching (CRAY-1)– atomic instructions (CRAY-1)– synchronized instruction execution and memory
access (MASPAR)
11
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
12
Platform Model
• One Host + one or more Compute Devices• Each Compute Device is composed of one or
more Compute Units• Each Compute Unit is further divided into one or
more Processing Elements
OpenCL
13
• Total number of work-items = Gx * Gy
• Size of each work-group = Sx * Sy
• Global ID can be computed from work-group ID and local ID
Kernel ExecutionOpenCL
14
Memory ManagementOpenCL
15
Memory ManagementOpenCL
16
• Address spaces– Private - private to a work-item– Local - local to a work-group– Global - accessible by all work-items in all work-
groups– Constant - read only global space
Memory ModelOpenCL
17
Programming LanguageOpenCL
• Every GPU Computing technology natively written in C/C++ (Host)
• Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby)
• Device Code exclusively written in standard C + Extensions
18
• Pointers to functions not allowed• Pointers to pointers allowed within a kernel, but not
as an argument• Bit-fields not supported• Variable-length arrays and structures not supported• Recursion not supported• Writes to a pointer of types less than 32-bit not
supported• Double types not supported, but reserved• 3D Image writes not supported
• Some restrictions are addressed through extensions
Language Restrictions
OpenCL
19
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
20
• Multimedia Data and Tasks best-suited for SIMD Processing
• Multimedia Data – sequential Bytestreams; each Byte independent
• Image Processing in particular suited for GPUs
• original GPU task: „Compute <several FLOP> for every Pixel of the screen“ ( Computer Graphics)
• same task for images, only FLOP‘s are different
Common Application Domain
21
• possible features realizable on the GPU– contrast- and luminance configuration– gamma scaling– (pixel-by-pixel-) histogram scaling– convolution filtering– edge highlighting– negative image / image inversion– …
Common Application Domain – Image Processing
22
• simple example: Inversion• implementation and use of a framework for
switching between different GPGPU technologies• creation of a command queue for each GPU• reading GPU kernel via kernel file on-the-fly• creation of buffers for input and output image• memory copy of input image data to global GPU
memory• set of kernel arguments and kernel execution• memory copy of GPU output buffer data to new
image
InversionImage Processing
23
evaluated and confirmed minimum speedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP
4 : 1
Image Processing Inversion
25
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP
26
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
27
Task
• Spring-Mass-System defined by a differential equation
• Behavior of the system must be simulated over varying damping values
• Therefore: numerical solution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000
• Analysis of computation time and speed-up for different compute architectures
28
Task
based on Simulation News Europe (SNE) CP2:• 1000 simulation iterations over simulation
horizon with generated damping values (Monte-Carlo Study)
• consequtive averaging for s(t)• tε[0 … 2] sec; h=0.01 200 steps
29
Task
on present architectures too lightweighted-> Modification:
• 5000 iterations with Monte-Carlo• h=0.001 2000 steps
Aim of Analysis: Knowledge about spring behavior for different damping values (trajectory array)
30
Task• Simple Spring-Mass-System
d … damping constantc … spring constant
• Movement equation derived by Newton‘s 2nd axiom
• Modelling needed -> „Massenfreischnitt“– mass is moved– force balancing Equation
31
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
32
Modelling
• numerical integration based on 2nd order differential equation
DE order n n DEs 1st order
order 2nd DE -
)()/()'()/(')'(
)()'(')'(
FFF
0
:axiomNewton 2.
CDT
tsmktsmdts
tcstdstms
FFF TDC
)(')'(
)()'(
)()(
tats
tvts
tsts
33
Modelling• Transformation by substitution
122
2
21
21
)/()/('
)()/()'()/(')'()'(
)'()'(
')(),()(
)()/()'()/(')'(
smcsmds
tsmctsmdtsts
ststs
stststs
tsmctsmdts
• random damping parameter d for interval limits [800;1200]; • 5000 iterations
m 0 s(0)
m/s 0.1v(0)(0)s'
:esstart valu
2s t0s;t
450;9000
:CP2 SNEby given
Endstart
kgmc
34
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
35
Euler as simple ODE solver
• numerical integration by explicit Euler method
...
);()(
);()(
);()()(
)(
:Lsg
's and t, s esstart valu
System ODE !
s(t) tajectory?
2223
1112
000110
00
000
stfhsts
stfhsts
stfhsstshts
sts
')()(
')()(
)()/()()/('
)('
- steps allover iterate
:Problem-Mass-Springfor Use
222
111
122
21
shtshts
shtshts
tsmctsmds
tss
36
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
37
existing MIMD Solutions
38
existing MIMD Solutions
• Approach can not be applied to GPU Architectures
• MIMD-Requirements:– each PE with own instruction flow– each PE can access RAM individually
• GPU Architecture -> SIMD– each PE computes the same instruction at the
same time– each PE has to be at the same instruction for
accessing RAM
Therefore: Development SIMD-Approach
39
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
40
An SIMD Approach
• S.P./R.F.:– simultaneous execution of sequential
Simulation with varying d-Parameter on spatially distributed PE‘s
– Averaging dependend on trajectories
• C.K.:– simultaneous computation with all d-
Parameters for time tn, iterative repetition until tend
– Averaging dependend on steps
41
An SIMD-Approach
42
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
43
OpenMP
• Parallization Technology based on shared memory principle
• synchronization hidden for developer• thread management controlable• For System-V-based OS:
– parallization by process forking
• For Windows-based OS:– parallization by WinThread creation (AMD
Study/Intel Tech Paper)
44
OpenMP
• in C/C++: pragma-based preprocessor directives
• in C# represented by ParallelLoops• more than just parallizing Loops (AMD Tech
Report)• Literature:
– AMD/Intel Tech Papers– Thomas Rauber, „Parallele Programmierung“– Barbara Chapman, „Using OpenMP: Portable
Shared Memory Parallel Programming“
45
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plot• Speed-Up-Study• Parallization Conclusions• Resumée
46
Result Plot
resulting trajectory for all technologies
47
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
48
Speed-Up Study
# Cores MIMD Single
MIMD OpenMP
SIMD Single SIMD OpenMP
SIMD OpenCL
1 1.0 (T=56.53)
1.0 0.9 (T=64.63)
0.9 0.4 (T=144.6)
2 X 1.8 X 1.4 X
4 X 3.5 X 2.0 X
8 X 5.7 X 1.7 X
16 X 5.1 X 0.5 X
dyn/std 1.0 5.7 0.9 1.7 0.4
OpenMP – own Study – Comparison CPU/GPUSIMD Single: presented SIMD approach on CPUSIMD OpenMP: presented SIMD approach parallized on CPUSIMD OpenCL: Control of number of executing units not possible,
therefore only 1 value
49
Speed-Up StudySIMD OpenCL SIMD single MIMD single SIMD
OpenMPMIMD OpenMP
50
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
51
Parallization Conclusions
• problem unsuited for SIMD parallization• On-GPU-Reduction too time expensive, Therefore:
– Euler computation on GPU– Average computation on CPU
• most time intensive operation: MemCopy between GPU and Main Memory
• for more complex problems oder different ODE solver procedure speed-up behavior can change
52
Parallization Conclusion• MIMD-Approach S.P./R.F. efficient for SNE CP2• OpenMP realization for MIMD- and SIMD-
Approach possible (and done)• OpenMP MIMD realization almost linear
speedup• more set Threads than PEs physically
available leads to significant Thread-Overhead
• OpenMP chooses automatically number threads to physical available PEs for dynamic assignement
53
MC Study of a SMS using OpenCL and OpenMP
• Task• Modelling• Euler as simple ODE solver• Existing MIMD Solutions• An SIMD-Approach• OpenMP• Result Plots• Speed-Up-Study• Parallization Conclusions• Resumée
54
Resumée
• task can be solved on CPUs and GPUs• For GPU Computing new approaches and
algorithm porting required• although GPUs have massive number of
parallel operating cores, speed-up not for every application domain possible
55
Resumée• Advantages GPU Computing:
– for suited problems (e.g. Multimedia) very fast and scalable
– cheap HPC technology in comparison to scientific supercomputers
– energy-efficient– massive computing power in small size
• Disadvantage GPU Computing:– limited instruction set– strictly SIMD– SIMD Algorithm development hard– no execution supervision (e.g. segmentation/page
fault)
56
Overview
• Basics of Parallel Computing• Brief History of SIMD vs. MIMD
Architectures• OpenCL• Common Application Domain• Monte Carlo-Study of a Spring-Mass-
System using OpenCL and OpenMP