Manycore Application Migration, Methodology and...
Transcript of Manycore Application Migration, Methodology and...
Manycore Application Migration:
Methodology and Tools
1
Московский государственный университет, Лоран Морен, 29 август 2011
Москва - Лоран Морен 2
Methodology to Port Applications
29 август 2011
•Exploit CPU and GPU
•Optimize kernel execution
•Reduce CPU-GPU data transfers
•Optimize CPU code
•Exhibit Parallelism
•Move kernels to GPU
•Performance goal
•Validation process
•Hotspots selection
•Continuous integration Define your parallel project
Port your application
on GPU
Optimize your
GPGPU
application
Phase 1
Phase 2
Hotspots Parallelization
Tuning
GPGPU operational
application with
known potential
Hours to Days Days to Weeks
Weeks to Months
Take a decision
• Go
• Dense hotspot
• Fast GPU kernels
• Low CPU-GPU data transfers
• Midterm perspective:Prepare to manycore parallelism
Москва - Лоран Морен 329 август 2011
• No Goo Flat profile
o GPU results not accurate enough• cannot validate the computation
o Slow GPU kernels• i.e. no speedup to be expected
o Complex data structures
o Big memory space requirement
• Parallel execution on Many-core architectures will (probably) change Floating-Point computation results…o Parallel programming change in an extensive way the computation
order
o Hardware floating-point units are not always IEEE 754 compliant
o Convergence speed of Iterative algorithms can be impacted
o …
• … but it does not imply that the computation is wrong !o A floating-point computation is always an approximation
• The validation must be provided by application providerso It is the mathematicians or physicists responsibility
o They may or may not require binary equal results
o If not, a validation protocol must be provided
Validate your Computation
Москва - Лоран Морен 429 август 2011
Tools to improve the methodology
productivity
Москва - Лоран Морен 6
Tools in the Methodology
29 август 2011
•Parallel Profiling ToolsParaver, TAU, Vampir
•HMPP Performance Analyzer
•Target Specific toolsNvidia Insight
•HMPP Workbench
•HMPP Wizard
• Profiling toolsGprof, Kachegrind,Vampir, Acumem, ThreadSpotter
•Debugging toolsgdb, alinea DDT, TotalView
Define your parallel project
Port your application
on GPU
Optimize your
GPGPU
application
Phase 1
Phase 2
Hotspots Parallelization
Tuning
GPGPU operational
application with
known potential
Hours to Days Days to Weeks
Weeks to Months
Profiling
• Select Hotspots
• Estimate speedup using Amdahl’s law:
Focus on Hotspots (Gprof example)
Москва - Лоран Морен 829 август 2011
Profile your
CPU application
Build a coherent
kernel set
S
PP )1(
1
Move Computation to GPU:
HMPP Workbench
What is HMPP? (Hybrid Manycore Parallel Programming)
• A directive based programming modelo Provide an incremental tool to exploit GPU in legacy applications
o Complement existing parallel APIs (OpenMP or MPI)
o Avoid exit cost, future-proof solution
o Based on an OpenStandard
• An integrated compilation chaino Code generators from C and Fortran to GPU (CUDA or OpenCL)
o Source to source compiler: uses hardware vendor SDK
o One single compiler driver interfacing GPU compilers
o Complements existing parallel APIs (OpenMP or MPI)
• A runtime support of heterogeneous architectureso Manages hardware resource and its availability
o Interface transfers and synchronizations of the hardware
o Fast and easier application deployment on new hardware
o Free of use
Port your application with HMPP Workbench
Москва - Лоран Морен 1029 август 2011
HMPP usage in three steps
Москва - Лоран Морен 11
1. A set of directives to program
hardware accelerators
• Drive your HWAs, manage
transfers
• Optimize kernels at source level
2. A complete tool-chain to build
manycore applications
• Build your hybrid application
3. A runtime to adapt to platform
configuration
29 август 2011
HMPP Directives: drive Hybrid Applications
Москва - Лоран Морен 12
HMPP Runtime
HWAData Codelet
HW-specific code generation
Directives
29 август 2011
int main(int argc, char **argv) {// ...
#pragma hmpp compute advancedload, args[M;N]for( j = 0 ; j < 2 ; j++ ) {#pragma hmpp compute callsitecompute( size, size, size, alpha, vin1, vin2, beta, vout );
}// ...
}
#pragma hmpp compute codelet, target=CUDA, args[a].io=inoutvoid compute ( int M, int N, float alpha, float a[n][n], float b[n][n]){#pragma hmppcg hmppcg grid blocksize 16 X 16#pragma hmppcg unroll 2, jamfor (i = 0; i < M; i += 1) {#pragma hmpp unroll 4, guardedfor (j = 0; j < N; j += 1) {
a[i][j] = cos(a[i][j]) + cos(b[i][j]) - alpha;}
} }
#pragma hmpp sgemm codelet, target=CUDA, args[vout].io=inoutextern void sgemm ( int m, int n, int k, float alpha,
const float vin1[n][n], const float vin2[n][n],float beta, float vout[n][n] );
int main(int argc, char **argv) {// ...for( j = 0 ; j < 2 ; j++ ) {#pragma hmpp sgemm callsitesgemm( size, size, size, alpha, vin1, vin2, beta, vout );
}// ...
}
HMPP Basic Programming
Москва - Лоран Морен 13
One directive to setup a GPU codelet…
… and another inserted in the application
29 август 2011
HMPP BLAS Performance on NVIDIA Fermi
Москва - Лоран Морен 1429 август 2011
0
100
200
300
400
500
600
700
64
32
0
57
6
83
2
10
88
13
44
16
00
18
56
21
12
23
68
26
24
28
80
31
36
33
92
36
48
39
04
41
60
44
16
46
72
49
28
51
84
54
40
56
96
59
52
62
08
64
64
67
20
69
76
72
32
74
88
77
44
80
00
Gfl
op
/s
Taille des matrices
SGEMM Performance
MKL
HMPP
CUBLAS
MAGMA
2 x Intel(R) Xeon(R) X5560 @ 2.80GHz (8 cores) - MKLNVidia Tesla C2050, ECC activated – HMPP, CUBLAS, MAGMA
Build a Parallel Project: HMPP Wizard
A diagnosis tool to make the code “GPU Friendly”
• Features:
o Generate GPU porting advices of C/Fortran source code
• Unsupported programming (pointers, goto-labels, …)
• Loop structure and parallelism validation
• Diagnosis the loop nest translation process into a GPU grid of threads
o Estimate kernels behavior on a GPU and provide specific advices
• Reduction inside a kernel;
• Bad memory coalescing;
• Inefficient memory coalescing;
• Too low computation density;
• Detection of well-known patterns such as convolutions
o Propose basic GPU Tuning directives
• Core technology: based on static analysis
o Analyze memory access patterns on the GPU
o Takes into account the full HMPP code generation process
• Evaluates HMPPCG optimization directives
Make your kernels more GPU friendly
HMPP Wizard
Москва - Лоран Морен 1629 август 2011
Москва - Лоран Морен 1729 август 2011
HMPP Wizard: visual
Analyze your memory
access pattern
Use HMPPCG Directives
and make your kernel GPU
friendly
An advice: is the loop GPU friendly?
o Example: memory access pattern over the Z dimension (3D)
• The GPU grid is 2D → the coalescing must be on the X dimension
o A pragma permute is proposed to fix the problem
Tune your application with the HMPP Wizard
Москва - Лоран Морен 1830 август 2011
Debug your hybrid & parallel
application: Alinea DDT
www.allinea.com
Allinea Software
• Software tools company since 2001
– Allinea DDT – the scalable parallel debugger
– Allinea OPT – the optimization tool for MPI and non-MPI
– Users at all scales – at 1 to 100,000 cores and above
– World's only Petascale debugger!
Simplifying the challenge of multi- and many-core development
– Bugs at scale need a debugger at scale
• … until recently debuggers limited to ~4,000-8,000 cores
– Bugs on GPUs need a debugger for GPUs
• … until recently GPU software couldn't be debugged
Allinea Software
www.allinea.com
DDT for CAPS HMPP
Debuggable inside C kernels (codelets) on the GPU
F90 multi-dimensional arrays support in progress for the GPU
Auto-reporting of CAPS runtime errors to the DDT GUI
Also able to debug codeletsrunning on the CPU
Allinea SoftwareDDT for CAPS HMPP
Optimize CPU-GPU integration:
PARAVER
Москва - Лоран Морен 2329 август 2011
Analyze the CPU-GPU Transfers
HMPP + PARAVER
Get a precise view of HMPP element
behavior
Transfer time, execution time…
Optimize GPU Performance:
HMPP Performance Analyzer
Provide information about the GPU behavior at source level for
HMPP applications
• Main features
o Analyze execution profiles made on the GPU by target specific profilers
o Compute synthetic metrics from raw profile data
• Memory throughput, grid size, computation density…
o Provide detailed statistics from raw profile data
• Benefits
o Avoid using target specific profiling tools for HMPP applications
• Nvidia insight profiles Cuda code !
o Get precise and specific information about the loop nest behavior
o Explore and exploit at best the GPU power from the source level
o Fine tune kernels using HMPPCG directives
Understand how well kernels are performing
HMPP Performance Analyzer
Москва - Лоран Морен 2529 август 2011
Москва - Лоран Морен 2629 август 2011
HMPP Performance Analyzer: visual
Get precise and
specific information
about the kernel
behavior
Explore and Exploit at
best the GPU power
from the C source
level
HMPP Performance Analyzer: example
Москва - Лоран Морен 2730 август 2011
Synthetic kernel metrics
Global view of the execution time distribution
Details statistics of raw profiling data
An ideal package for a proven methodology
Москва - Лоран Морен 2929 август 2011
A multi-level tool suite for
manycore applications definition,
porting and optimization
A GPU focused porting
methodology and a state-of-the-art
products ecosystem
A resources tank to guide you in
your porting issues
Москва - Лоран Морен 3029 август 2011
PACKAGES
HMPP HYBRID COMPILER
DEVELOPMENT TOOLSHMPP Wizard
HMPP Performance Analyzer
HMPP HYBRID COMPILER
DEVELOPMENT TOOLSHMPP Wizard
HMPP Performance Analyzer
ALLINEA DDT
SCIENTIFIC LIBRARIES
For constructors / Integrators
DevDeck OEM
OEM license
For computing centers
& large companies
DevDeck ENTERPRISE
Floating license
HMPP HYBRID COMPILER
DEVELOPMENT TOOLSHMPP Wizard
HMPP Performance Analyzer
CUSTOMIZED SERVICES
expertise, tools recommendation, training…
For developers’ workstations
DevDeck DEVELOPER
Node-Locked license
+ Access to Web Ressources according to the package:
HMPP Workbench - Wizard - Performance Analyzer - Allinea DDT - Libraries - Methodology supports - Cook book - Tutorials - Use cases…
Conclusion
• Application Diagnosis by CAPSGo / NoGo Analysis
• Provided by HMPPIncremental approach
• Use GPU libraries/function into HMPPDon’t redevelop what
already exists
• Use tools like paraver, Vampire, TAUUnderstand finely the
data transfers
• HMPP Performance AnalyserGood understanding of
the HW
• HMPPCG directivesLots of experimentation
at a lower cost
Москва - Лоран Морен 3229 август 2011
Cost-effective migration requirements
Gravimetry application
Finite volumes method
Москва - Лоран Морен 33
• Effort
o 1 man-month
• Nehalem X5560
improvement:
o x5.15
• Main porting operation
o Reduce kernel memory
accesses, use of GPU
shared memory and optimize
Data Movement
The finite volumes method (left) is more accurate than the analytic solution (right)
which over estimates the central peak
29 август 2011
Heat transfer simulation
Monte Carlo simulation
for thermal radiation
Москва - Лоран Морен 34
• Efforto 1,5 man-month
• Sizeo ~1kLoC of C d (DP)
• GPU C2050 improvemento x6 over 8 Nehalem cores
• Full hybrid versiono x23 with 8 nodes over 8
Nehalem cores
• Main porting operationo adding a few HMPP directives
• Main difficultyo none
29 август 2011
Numerical Weather Prediction
A global cloud resolving model
Москва - Лоран Морен 35
• Effort
o 1 man-month (part of the code already ported)
• GPU C1060 improvement
o 11x over a Nehalem core
o 19x over a Harpertown core
• Main porting operation
o reduction of CPU-GPU transfers
• Main difficulty
o GPU memory size is the limiting factor
29 август 2011
Computer vision & Medical imaging
MultiView Stereo
Москва - Лоран Морен 36
• Resource spento 1 man-month
• Sizeo ~1kLoC of C99 (DP)
• CPU Improvement
o x 4,86
• GPU C2050 improvemento x 120 over serial code on
Nehalem
• Main porting operation
o Rethinking algorithm
29 август 2011
Image processing
Edge detection algorithm
Москва - Лоран Морен 37
• Sobel Filter benchmark
• Size
o ~ 200 lines of C code
• GPU C2070 improvemento x 25,8 over serial code on
Nehalem
• Main porting operation
o Use of basic HMPP directives
29 август 2011
Biosciences, phylogenetics
Phylip, DNA distance
• In association with the HMPP Center Of Excellence for APAC
• Computes a matric of distances between DNA distances
• Resource spent
o A first CUDA version developed by Shanghai Jiao Tong University, HPC Lab
o 1 man-month
• Size
o 8700 lines of C code, one main kernel (99% of the execution time)
• GPU C2070 improvement
o x 44 over serial code on Nehalem
• Main porting operation
o Kernel parallelism & data transfer coalescing leverage
o Conversion from double precision to simple precision computation
Москва - Лоран Морен 3829 август 2011
OpenHMPP – http://www.openhmpp.org –
CAPS entreprise – http://www.caps-entreprise.com –