Manycore Application Migration, Methodology and...

Manycore Application Migration:

Methodology and Tools

1

Московский государственный университет, Лоран Морен, 29 август 2011

Москва - Лоран Морен 2

Methodology to Port Applications

29 август 2011

•Exploit CPU and GPU

•Optimize kernel execution

•Reduce CPU-GPU data transfers

•Optimize CPU code

•Exhibit Parallelism

•Move kernels to GPU

•Performance goal

•Validation process

•Hotspots selection

•Continuous integration Define your parallel project

Port your application

on GPU

Optimize your

GPGPU

application

Phase 1

Phase 2

Hotspots Parallelization

Tuning

GPGPU operational

application with

known potential

Hours to Days Days to Weeks

Weeks to Months

Take a decision

• Go

• Dense hotspot

• Fast GPU kernels

• Low CPU-GPU data transfers

• Midterm perspective:Prepare to manycore parallelism

Москва - Лоран Морен 329 август 2011

• No Goo Flat profile

o GPU results not accurate enough• cannot validate the computation

o Slow GPU kernels• i.e. no speedup to be expected

o Complex data structures

o Big memory space requirement

• Parallel execution on Many-core architectures will (probably) change Floating-Point computation results…o Parallel programming change in an extensive way the computation

order

o Hardware floating-point units are not always IEEE 754 compliant

o Convergence speed of Iterative algorithms can be impacted

o …

• … but it does not imply that the computation is wrong !o A floating-point computation is always an approximation

• The validation must be provided by application providerso It is the mathematicians or physicists responsibility

o They may or may not require binary equal results

o If not, a validation protocol must be provided

Validate your Computation


Tools to improve the methodology

productivity


Tools in the Methodology


•Parallel Profiling ToolsParaver, TAU, Vampir

•HMPP Performance Analyzer

•Target Specific toolsNvidia Insight

•HMPP Workbench

•HMPP Wizard

• Profiling toolsGprof, Kachegrind,Vampir, Acumem, ThreadSpotter

•Debugging toolsgdb, alinea DDT, TotalView

Define your parallel project

Port your application

on GPU

Optimize your

GPGPU

application

Phase 1

Phase 2

Hotspots Parallelization

Tuning

GPGPU operational

application with

known potential

Hours to Days Days to Weeks

Weeks to Months

Profiling

• Select Hotspots

• Estimate speedup using Amdahl’s law:

Focus on Hotspots (Gprof example)


Profile your

CPU application

Build a coherent

kernel set

S

PP )1(

1

Move Computation to GPU:

HMPP Workbench

What is HMPP? (Hybrid Manycore Parallel Programming)

• A directive based programming modelo Provide an incremental tool to exploit GPU in legacy applications

o Complement existing parallel APIs (OpenMP or MPI)

o Avoid exit cost, future-proof solution

o Based on an OpenStandard

• An integrated compilation chaino Code generators from C and Fortran to GPU (CUDA or OpenCL)

o Source to source compiler: uses hardware vendor SDK

o One single compiler driver interfacing GPU compilers

o Complements existing parallel APIs (OpenMP or MPI)

• A runtime support of heterogeneous architectureso Manages hardware resource and its availability

o Interface transfers and synchronizations of the hardware

o Fast and easier application deployment on new hardware

o Free of use

Port your application with HMPP Workbench


HMPP usage in three steps


1. A set of directives to program

hardware accelerators

• Drive your HWAs, manage

transfers

• Optimize kernels at source level

2. A complete tool-chain to build

manycore applications

• Build your hybrid application

3. A runtime to adapt to platform

configuration


HMPP Directives: drive Hybrid Applications


HMPP Runtime

HWAData Codelet

HW-specific code generation

Directives


int main(int argc, char **argv) {// ...

#pragma hmpp compute advancedload, args[M;N]for( j = 0 ; j < 2 ; j++ ) {#pragma hmpp compute callsitecompute( size, size, size, alpha, vin1, vin2, beta, vout );

}// ...

}

#pragma hmpp compute codelet, target=CUDA, args[a].io=inoutvoid compute ( int M, int N, float alpha, float a[n][n], float b[n][n]){#pragma hmppcg hmppcg grid blocksize 16 X 16#pragma hmppcg unroll 2, jamfor (i = 0; i < M; i += 1) {#pragma hmpp unroll 4, guardedfor (j = 0; j < N; j += 1) {

a[i][j] = cos(a[i][j]) + cos(b[i][j]) - alpha;}

} }

#pragma hmpp sgemm codelet, target=CUDA, args[vout].io=inoutextern void sgemm ( int m, int n, int k, float alpha,

const float vin1[n][n], const float vin2[n][n],float beta, float vout[n][n] );

int main(int argc, char **argv) {// ...for( j = 0 ; j < 2 ; j++ ) {#pragma hmpp sgemm callsitesgemm( size, size, size, alpha, vin1, vin2, beta, vout );

}// ...

}

HMPP Basic Programming


One directive to setup a GPU codelet…

… and another inserted in the application


HMPP BLAS Performance on NVIDIA Fermi


0

100

200

300

400

500

600

700

64

32

0

57

6

83

2

10

88

13

44

16

00

18

56

21

12

23

68

26

24

28

80

31

36

33

92

36

48

39

04

41

60

44

16

46

72

49

28

51

84

54

40

56

96

59

52

62

08

64

64

67

20

69

76

72

32

74

88

77

44

80

00

Gfl

op

/s

Taille des matrices

SGEMM Performance

MKL

HMPP

CUBLAS

MAGMA

2 x Intel(R) Xeon(R) X5560 @ 2.80GHz (8 cores) - MKLNVidia Tesla C2050, ECC activated – HMPP, CUBLAS, MAGMA

Build a Parallel Project: HMPP Wizard

A diagnosis tool to make the code “GPU Friendly”

• Features:

o Generate GPU porting advices of C/Fortran source code

• Unsupported programming (pointers, goto-labels, …)

• Loop structure and parallelism validation

• Diagnosis the loop nest translation process into a GPU grid of threads

o Estimate kernels behavior on a GPU and provide specific advices

• Reduction inside a kernel;

• Bad memory coalescing;

• Inefficient memory coalescing;

• Too low computation density;

• Detection of well-known patterns such as convolutions

o Propose basic GPU Tuning directives

• Core technology: based on static analysis

o Analyze memory access patterns on the GPU

o Takes into account the full HMPP code generation process

• Evaluates HMPPCG optimization directives

Make your kernels more GPU friendly

HMPP Wizard



HMPP Wizard: visual

Analyze your memory

access pattern

Use HMPPCG Directives

and make your kernel GPU

friendly

An advice: is the loop GPU friendly?

o Example: memory access pattern over the Z dimension (3D)

• The GPU grid is 2D → the coalescing must be on the X dimension

o A pragma permute is proposed to fix the problem

Tune your application with the HMPP Wizard


Debug your hybrid & parallel

application: Alinea DDT

www.allinea.com

Allinea Software

• Software tools company since 2001

– Allinea DDT – the scalable parallel debugger

– Allinea OPT – the optimization tool for MPI and non-MPI

– Users at all scales – at 1 to 100,000 cores and above

– World's only Petascale debugger!

Simplifying the challenge of multi- and many-core development

– Bugs at scale need a debugger at scale

• … until recently debuggers limited to ~4,000-8,000 cores

– Bugs on GPUs need a debugger for GPUs

• … until recently GPU software couldn't be debugged

Allinea Software

www.allinea.com

DDT for CAPS HMPP

Debuggable inside C kernels (codelets) on the GPU

F90 multi-dimensional arrays support in progress for the GPU

Auto-reporting of CAPS runtime errors to the DDT GUI

Also able to debug codeletsrunning on the CPU

Allinea SoftwareDDT for CAPS HMPP

Optimize CPU-GPU integration:

PARAVER


Analyze the CPU-GPU Transfers

HMPP + PARAVER

Get a precise view of HMPP element

behavior

Transfer time, execution time…

Optimize GPU Performance:

HMPP Performance Analyzer

Provide information about the GPU behavior at source level for

HMPP applications

• Main features

o Analyze execution profiles made on the GPU by target specific profilers

o Compute synthetic metrics from raw profile data

• Memory throughput, grid size, computation density…

o Provide detailed statistics from raw profile data

• Benefits

o Avoid using target specific profiling tools for HMPP applications

• Nvidia insight profiles Cuda code !

o Get precise and specific information about the loop nest behavior

o Explore and exploit at best the GPU power from the source level

o Fine tune kernels using HMPPCG directives

Understand how well kernels are performing




HMPP Performance Analyzer: visual

Get precise and

specific information

about the kernel

behavior

Explore and Exploit at

best the GPU power

from the C source

level

HMPP Performance Analyzer: example


Synthetic kernel metrics

Global view of the execution time distribution

Details statistics of raw profiling data

An ideal package for a proven methodology


A multi-level tool suite for

manycore applications definition,

porting and optimization

A GPU focused porting

methodology and a state-of-the-art

products ecosystem

A resources tank to guide you in

your porting issues


PACKAGES

HMPP HYBRID COMPILER

DEVELOPMENT TOOLSHMPP Wizard





ALLINEA DDT

SCIENTIFIC LIBRARIES

For constructors / Integrators

DevDeck OEM

OEM license

For computing centers

& large companies

DevDeck ENTERPRISE

Floating license




CUSTOMIZED SERVICES

expertise, tools recommendation, training…

For developers’ workstations

DevDeck DEVELOPER

Node-Locked license

+ Access to Web Ressources according to the package:

HMPP Workbench - Wizard - Performance Analyzer - Allinea DDT - Libraries - Methodology supports - Cook book - Tutorials - Use cases…

Conclusion

• Application Diagnosis by CAPSGo / NoGo Analysis

• Provided by HMPPIncremental approach

• Use GPU libraries/function into HMPPDon’t redevelop what

already exists

• Use tools like paraver, Vampire, TAUUnderstand finely the

data transfers

• HMPP Performance AnalyserGood understanding of

the HW

• HMPPCG directivesLots of experimentation

at a lower cost


Cost-effective migration requirements

Gravimetry application

Finite volumes method


• Effort

o 1 man-month

• Nehalem X5560

improvement:

o x5.15

• Main porting operation

o Reduce kernel memory

accesses, use of GPU

shared memory and optimize

Data Movement

The finite volumes method (left) is more accurate than the analytic solution (right)

which over estimates the central peak


http://www.genci.fr/

Heat transfer simulation

Monte Carlo simulation

for thermal radiation


• Efforto 1,5 man-month

• Sizeo ~1kLoC of C d (DP)

• GPU C2050 improvemento x6 over 8 Nehalem cores

• Full hybrid versiono x23 with 8 nodes over 8

Nehalem cores

• Main porting operationo adding a few HMPP directives

• Main difficultyo none


http://www.genci.fr/

Numerical Weather Prediction

A global cloud resolving model


• Effort

o 1 man-month (part of the code already ported)

• GPU C1060 improvement

o 11x over a Nehalem core

o 19x over a Harpertown core


o reduction of CPU-GPU transfers

• Main difficulty

o GPU memory size is the limiting factor


Computer vision & Medical imaging

MultiView Stereo


• Resource spento 1 man-month

• Sizeo ~1kLoC of C99 (DP)

• CPU Improvement

o x 4,86

• GPU C2050 improvemento x 120 over serial code on

Nehalem


o Rethinking algorithm


Image processing

Edge detection algorithm


• Sobel Filter benchmark

• Size

o ~ 200 lines of C code

• GPU C2070 improvemento x 25,8 over serial code on

Nehalem


o Use of basic HMPP directives


Biosciences, phylogenetics

Phylip, DNA distance

• In association with the HMPP Center Of Excellence for APAC

• Computes a matric of distances between DNA distances

• Resource spent

o A first CUDA version developed by Shanghai Jiao Tong University, HPC Lab

o 1 man-month

• Size

o 8700 lines of C code, one main kernel (99% of the execution time)

• GPU C2070 improvement

o x 44 over serial code on Nehalem


o Kernel parallelism & data transfer coalescing leverage

o Conversion from double precision to simple precision computation


OpenHMPP – http://www.openhmpp.org –

CAPS entreprise – http://www.caps-entreprise.com –

http://www.openhmpp.org/

http://www.caps-entreprise.com/



Manycore Application Migration, Methodology and...

Documents

Transcript of Manycore Application Migration, Methodology and...