Speedup Altair RADIOSS Solvers Using NVIDIA GPU - GTC ......0 250 500 750 1000 Xeon Nehalem CPU...

Innovation Intelligence®

Speedup Altair RADIOSS Solvers

Using NVIDIA GPU

Eric LEQUINIOU, HPC Director

Hongwei Zhou, Senior Software Developer

May 16, 2012


ALTAIR OVERVIEW

Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

3

Altair’s Vision

Simulation, predictive analytics and optimization

leveraging high performance computing for engineering

and business decision making


4

25+Years of Innovation

40+Offices in 16 Countries

1500+Employees Worldwide

Altair Engineering


5

Altair’s Brands and Companies

Solid State Lighting

Products

Engineering

Simulation Platform

On-demand Cloud

Computing Technology

Product Innovation

Consulting

Business Intelligence &

Data Analytics Solutions

Industrial Design

Technology

Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.

66

RADIOSS … AcuSolve … MotionSolve

Statics

NVH

Non-Linear (Implicit)

Non-Linear (Explicit)

Thermal and CFD

Optimization – OptiStruct and HyperStudy

Data and Process Management

Multi-Body Dynamics

Partner AllianceSolutions

1-D Systems, Fatigue, Ergonomics, Industrial

Design, Injection Molding, Noise and Vibration (NVH),

Composite Materials, Electromagnetics

Pre-Post

HyperWorks for Analysis and Optimization


77

Simulate Real-life Models with HyperWorks Solvers


AcuSolve Already Benefits from GPU Accelerator

• High performance computational fluid dynamics software (CFD)

• The leading finite element based CFD technology

• High accuracy and scalability on massively parallel architectures

549 549

165279

0

250

500

750

1000

Xeon Nehalem CPU Nehalem CPU + Tesla GPU

Lower is

better

4 coreCPU

1 core +1 GPU

4 coreCPU

2 core +2 GPU

S-duct

80K Degrees of Freedom

Hybrid MPI/OpenMP for Multi-GPU test

2X* 3.3X*

*Performance gain versus 4 core CPU


RADIOSS PORTING ON GPU


1010

Motivations to use GPU

140

280

1450

795

8,5

2,1

12

5050

0,74 0,741,92

0

200

400

600

800

1000

1200

1400

1600

1 Node Intel

X5670

2 Nodes Intel

X5670

1 Node+1

Nvidia M2090

1 Node+2

Nvidia M2090G

flo

ps

0

10

20

30

40

50

60 $

Gflop Peak (DP) Gflop cost in $ Gflops per Watt

• Cost effective solution

• Power efficient solution

How much of the peak can we get?

� Which part of the code is best suited?

� Which coding effort is required?

� What is the speedup for my application?


1111

RADIOSS Porting on Nvidia GPU

• Assess the potential of GPU for RADIOSS

• Focus on Implicit

� Direct Solver

� Highly optimized compute intensive solver

� Limited scalability on multicores and cluster

� Iterative Solver

� Ability to solve huge problems with low memory requirement

� Efficient parallelization

� High cost on CPU due to large number of iterations for convergence

• Double precision required

• Integrate GPU parallelization into our Hybrid parallelization technology


RADIOSS DIRECT SOLVER PORTING


1313

Multifrontal Sparse Direct Solver

do j = 1, Nassemble(A(j))factor(j)update(j)

end

Non-pivoting: CholeskyPivoting: LDLT


1414

Concerns with GPU Acceleration

• CUBLAS (DGEMM) – perfect candidate to speed up update module

� Frontal matrix could be too huge to fit in GPU memory

� Frontal matrix could be too small and thus inefficient w/ GPU

� Data transfer is not trivial

� Only the lower triangular matrix is interesting

• Pivoting is required in real applications

� Pivot searching is a sequential operation

� Factor module has limited parallel potential

� It could be as expensive as “update” module in extreme case


1515

CUBLAS speedup “Update” module

• Improve profile of “Update” module

� Base – BCSLIB4.3

• Asynchronous computing

� Overlap the computation

� Overlap the communicationS1

S2

S3

TLDLU =


1616

Numerical Test : Non-Pivoting Case

GPU speedup - non-pivoting case

0

200

400

600

800

1000

1200

1400

1600

4 CPU 4 CPU + 1GPU

ela

ps

ed

(s

)

solver elapsed

3.9X2.9X

0

0,2

0,4

0,6

0,8

1

1,2

update total

pro

file

(%)

base(BCSLIB-EXT) improved

Lower is

better

Linear static - Non-pivoting case

Benchmark 2,8 Millions of Degrees of Freedom

Platform Intel Xeon X5550, 4 Core, 48GB RAMNvidia C2070, CUDA 4.0MKL 10.3, RHEL 5.1


1717

Numerical Test: Pivoting Case

GPU speedup - pivoting case

0

500

1000

1500

2000

2500

3000

3500

4000

4500

4 CPU 4 CPU + 1GPU

ela

pse

d (

s)

solver elapsed

2.8X 2.5X

0

0,2

0,4

0,6

0,8

1

1,2

update total

pro

file

(%)


Lower is

better

Nonlinear static - Pivoting caseCustomer model – Engine Block


Platform Intel Xeon X5550, 4 Core, 48GB RAMNvidia C2070, CUDA 4.0MKL 10.3, RHEL 5.1


1818

Challenging Work

• On models – “update” is not that

dominated

� E.g. engine block with 1st order element

• In-core /“Quasi” in-core is preferred

� A memory threshold to get reasonable good

speedup will be provided to the user

� Make sure the essential computation is in core

0

200

400

600

800

1000

1200

4 CPU 4 CPU + 1GPU

Ela

pse

d (

s)

solver elapsed

2.1X 1.8X

0

0,2

0,4

0,6

0,8

1

1,2

update total

pro

file

(%)


Lower is

better

1.5 Millions of DOFs with Pivoting


1919

Summary & Perspectives

• GPU and CUDA enhance the performance of direct solver significantly

� CUBLAS on Fermi card is so fast!

� Asynchronous computing is the key component

� Improving the profile is a necessary procedure� Amdahl's law

• Application Area

� Nonlinear analysis – robust and accurate solution

� Optimization – matrix factorization reused in sensitivity analysis

• Future works

� Other dynamic solvers� Block Lanczos Eigen value solver

� Altair AMSES (automated multilevel-substructure Eigen value solver)


RADIOSS ITERATIVE SOLVER PORTING


2121

Preconditioned Conjugate Gradient Method

• Problem� Solve linear equation Ax – b = 0� with A symmetric positive-definite

• Solution� Preconditioned Conjugate Gradient (PCG) solves iteratively:

M-1 . (Ax – b) = 0� Method quickly converges� Efficiency depends on M

• Other advantages� Low memory consumption� Efficient parallelization: SMP, MPI


2222

PCG Porting on GPU using Cuda

r0 = b – Ax0

z0 = M-1 r0

p0 = z0

k = 0

DO WHILE NOT DONE

alphak = rkt . zk / pk

t . A . pk

xk+1 = xk + alphak pk

rk+1 = rk – alphak A . pk

IF(rk+1 < precision) THEN

DONE = TRUE

END IF

zk+1 = M-1 . rk+1

betak = zk+1t . rk+1 / zk

t . rk

pk+1 = zk+1 + betak pk

k = k+1

END DO

result = xk+1

• Pretty simple original Fortran code

• Few kernels to write in Cuda

� Sparse Matrix Vector

� Focus optimization effort on this kernel

• Use of CUBLAS

� DAXPY

� DDOT


2323

Hybrid MPP Computing with CUDA and MPI

RADIOSS 11 Hybrid MPP Speedup

• Hybrid MPP version of RADIOSS

� 2 parallelization levels

� MPI parallelization based on domain decomposition

� Multi-threaded MPI processes under OpenMP

� Enhanced performance

� Higher scalability

� Better efficiency

• Extend Hybrid to multi GPUs programming

� Integrate GPU parallelization into our hybrid programming model

� MPI to manage communication between GPUs

� Portions of code not GPU enable benefit from OpenMP parallelization

� Programming model expendable to multi nodes with multi GPUs


2424

PCG Multi GPUs Parallelization

• Domain decomposition to split original problem

� Optimal load balancing

• MPI Communications between domains

� Controlled by the CPU

� One GPU associated to one MPI

• OpenMP Multithreading

� Speedup calculations not performed under GPU

� Leverage the available CPU cores


2525252525

Benchmark #1

Linear Problem #1

Hood of a car with pressure loads

Compute displacements and stresses


24 Millions of non zero

140000 Shells + 13000 Solids + 1100 RBE3

4300 iterations

Platform Nvidia PSG Cluster – 2 nodes with:

Dual Nvidia M2090 GPUs

Cuda v3.2

Intel Westmere 2x6 X5670@2,93Ghz

Linux RHEL 5.4 with Intel MPI 4.0

Performance Elapsed time decreased by up to 9X

351

185

84

53

0

50

100

150

200

250

300

350

400

1 node - 2 Nividia M2090

Ela

ps

ed

(s)

SMP 6-core

Hybrid 2 MPI x 6 SMP

SMP 6 + 1 GPU

Hybrid 2 MPI x 6 SMP + 2 GPUs

104

38

0

20

40

60

80

100

120

2 nodes - 4 Nvidia M2090

Ela

ps

ed

(s)



4.2X*

*Performance gain versus SMP 6-core

6.5X*

9.2X*

Lower

isbetter


26262626

Benchmark #2

Linear Problem #2

Hood of a car with pressure loads

Refined model




380000 Shells + 13000 Solids + 1100 RBE3

5300 iterations



Cuda v3.2




1106

572

254

143

0

200

400

600

800

1000

1200


Ela

ps

ed

(s)

SMP 6-core


SMP 6 + 1 GPU


306

85

0

50

100

150

200

250

300

350


Ela

psed

(s)



4.3X* 7.5X*

13X*


Lower is

better


27272727

Benchmark #3

Gravity Problem with contacts

Full car model




138000 Shells + 11000 Solids + 5700 1-D

elements + 230 Rigid bodies + 6 Rigid walls

1 Gravity load and 22 Contact interfaces

~ 9000 iterations



Cuda v3.2




1430

443

225

1223

0

200

400

600

800

1000

1200

1400

1600


Ela

ps

ed

(s)

SMP 6-core


SMP 6 + 1 GPU


531

163

0

100

200

300

400

500

600


Ela

ps

ed

(s)



3.2X*6.4X*

8.8X*


Lower is

better


CONCLUSION


29292929

Conclusion

• RADIOSS implicit direct & iterative solvers have been successfully ported on

Nvidia GPU using Cuda

• Adding GPU improves significantly the performance of these solvers

• For iterative solver, Hybrid MPP allows to run on multi GPU card workstation and

GPU cluster with good scalability and enhanced performance

• GPU support for implicit solvers is planed for HyperWorks 12

A big thanks to the Nvidia team for their great support:Stan Posey, Steven Rennich, Thomas Reed, Peng Wang!

Speedup Altair RADIOSS Solvers Using NVIDIA GPU - GTC ......0 250 500 750 1000 Xeon Nehalem CPU...

Documents

Transcript of Speedup Altair RADIOSS Solvers Using NVIDIA GPU - GTC ......0 250 500 750 1000 Xeon Nehalem CPU...