Speedup Altair RADIOSS Solvers Using NVIDIA GPU - GTC ......0 250 500 750 1000 Xeon Nehalem CPU...
Transcript of Speedup Altair RADIOSS Solvers Using NVIDIA GPU - GTC ......0 250 500 750 1000 Xeon Nehalem CPU...
-
Innovation Intelligence®
Speedup Altair RADIOSS Solvers
Using NVIDIA GPU
Eric LEQUINIOU, HPC Director
Hongwei Zhou, Senior Software Developer
May 16, 2012
-
Innovation Intelligence®
ALTAIR OVERVIEW
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
3
Altair’s Vision
Simulation, predictive analytics and optimization
leveraging high performance computing for engineering
and business decision making
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
4
25+Years of Innovation
40+Offices in 16 Countries
1500+Employees Worldwide
Altair Engineering
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
5
Altair’s Brands and Companies
Solid State Lighting
Products
Engineering
Simulation Platform
On-demand Cloud
Computing Technology
Product Innovation
Consulting
Business Intelligence &
Data Analytics Solutions
Industrial Design
Technology
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
66
RADIOSS … AcuSolve … MotionSolve
Statics
NVH
Non-Linear (Implicit)
Non-Linear (Explicit)
Thermal and CFD
Optimization – OptiStruct and HyperStudy
Data and Process Management
Multi-Body Dynamics
Partner AllianceSolutions
1-D Systems, Fatigue, Ergonomics, Industrial
Design, Injection Molding, Noise and Vibration (NVH),
Composite Materials, Electromagnetics
Pre-Post
HyperWorks for Analysis and Optimization
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
77
Simulate Real-life Models with HyperWorks Solvers
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
AcuSolve Already Benefits from GPU Accelerator
• High performance computational fluid dynamics software (CFD)
• The leading finite element based CFD technology
• High accuracy and scalability on massively parallel architectures
549 549
165279
0
250
500
750
1000
Xeon Nehalem CPU Nehalem CPU + Tesla GPU
Lower is
better
4 coreCPU
1 core +1 GPU
4 coreCPU
2 core +2 GPU
S-duct
80K Degrees of Freedom
Hybrid MPI/OpenMP for Multi-GPU test
2X* 3.3X*
*Performance gain versus 4 core CPU
-
Innovation Intelligence®
RADIOSS PORTING ON GPU
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1010
Motivations to use GPU
140
280
1450
795
8,5
2,1
12
5050
0,74 0,741,92
0
200
400
600
800
1000
1200
1400
1600
1 Node Intel
X5670
2 Nodes Intel
X5670
1 Node+1
Nvidia M2090
1 Node+2
Nvidia M2090G
flo
ps
0
10
20
30
40
50
60 $
Gflop Peak (DP) Gflop cost in $ Gflops per Watt
• Cost effective solution
• Power efficient solution
How much of the peak can we get?
� Which part of the code is best suited?
� Which coding effort is required?
� What is the speedup for my application?
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1111
RADIOSS Porting on Nvidia GPU
• Assess the potential of GPU for RADIOSS
• Focus on Implicit
� Direct Solver
� Highly optimized compute intensive solver
� Limited scalability on multicores and cluster
� Iterative Solver
� Ability to solve huge problems with low memory requirement
� Efficient parallelization
� High cost on CPU due to large number of iterations for convergence
• Double precision required
• Integrate GPU parallelization into our Hybrid parallelization technology
-
Innovation Intelligence®
RADIOSS DIRECT SOLVER PORTING
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1313
Multifrontal Sparse Direct Solver
do j = 1, Nassemble(A(j))factor(j)update(j)
end
Non-pivoting: CholeskyPivoting: LDLT
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1414
Concerns with GPU Acceleration
• CUBLAS (DGEMM) – perfect candidate to speed up update module
� Frontal matrix could be too huge to fit in GPU memory
� Frontal matrix could be too small and thus inefficient w/ GPU
� Data transfer is not trivial
� Only the lower triangular matrix is interesting
• Pivoting is required in real applications
� Pivot searching is a sequential operation
� Factor module has limited parallel potential
� It could be as expensive as “update” module in extreme case
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1515
CUBLAS speedup “Update” module
• Improve profile of “Update” module
� Base – BCSLIB4.3
• Asynchronous computing
� Overlap the computation
� Overlap the communicationS1
S2
S3
TLDLU =
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1616
Numerical Test : Non-Pivoting Case
GPU speedup - non-pivoting case
0
200
400
600
800
1000
1200
1400
1600
4 CPU 4 CPU + 1GPU
ela
ps
ed
(s
)
solver elapsed
3.9X2.9X
0
0,2
0,4
0,6
0,8
1
1,2
update total
pro
file
(%)
base(BCSLIB-EXT) improved
Lower is
better
Linear static - Non-pivoting case
Benchmark 2,8 Millions of Degrees of Freedom
Platform Intel Xeon X5550, 4 Core, 48GB RAMNvidia C2070, CUDA 4.0MKL 10.3, RHEL 5.1
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1717
Numerical Test: Pivoting Case
GPU speedup - pivoting case
0
500
1000
1500
2000
2500
3000
3500
4000
4500
4 CPU 4 CPU + 1GPU
ela
pse
d (
s)
solver elapsed
2.8X 2.5X
0
0,2
0,4
0,6
0,8
1
1,2
update total
pro
file
(%)
base(BCSLIB-EXT) improved
Lower is
better
Nonlinear static - Pivoting caseCustomer model – Engine Block
Benchmark 2,5 Millions of Degrees of Freedom
Platform Intel Xeon X5550, 4 Core, 48GB RAMNvidia C2070, CUDA 4.0MKL 10.3, RHEL 5.1
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1818
Challenging Work
• On models – “update” is not that
dominated
� E.g. engine block with 1st order element
• In-core /“Quasi” in-core is preferred
� A memory threshold to get reasonable good
speedup will be provided to the user
� Make sure the essential computation is in core
0
200
400
600
800
1000
1200
4 CPU 4 CPU + 1GPU
Ela
pse
d (
s)
solver elapsed
2.1X 1.8X
0
0,2
0,4
0,6
0,8
1
1,2
update total
pro
file
(%)
base(BCSLIB-EXT) improved
Lower is
better
1.5 Millions of DOFs with Pivoting
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1919
Summary & Perspectives
• GPU and CUDA enhance the performance of direct solver significantly
� CUBLAS on Fermi card is so fast!
� Asynchronous computing is the key component
� Improving the profile is a necessary procedure� Amdahl's law
• Application Area
� Nonlinear analysis – robust and accurate solution
� Optimization – matrix factorization reused in sensitivity analysis
• Future works
� Other dynamic solvers� Block Lanczos Eigen value solver
� Altair AMSES (automated multilevel-substructure Eigen value solver)
-
Innovation Intelligence®
RADIOSS ITERATIVE SOLVER PORTING
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
2121
Preconditioned Conjugate Gradient Method
• Problem� Solve linear equation Ax – b = 0� with A symmetric positive-definite
• Solution� Preconditioned Conjugate Gradient (PCG) solves iteratively:
M-1 . (Ax – b) = 0� Method quickly converges� Efficiency depends on M
• Other advantages� Low memory consumption� Efficient parallelization: SMP, MPI
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
2222
PCG Porting on GPU using Cuda
r0 = b – Ax0
z0 = M-1 r0
p0 = z0
k = 0
DO WHILE NOT DONE
alphak = rkt . zk / pk
t . A . pk
xk+1 = xk + alphak pk
rk+1 = rk – alphak A . pk
IF(rk+1 < precision) THEN
DONE = TRUE
END IF
zk+1 = M-1 . rk+1
betak = zk+1t . rk+1 / zk
t . rk
pk+1 = zk+1 + betak pk
k = k+1
END DO
result = xk+1
• Pretty simple original Fortran code
• Few kernels to write in Cuda
� Sparse Matrix Vector
� Focus optimization effort on this kernel
• Use of CUBLAS
� DAXPY
� DDOT
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
2323
Hybrid MPP Computing with CUDA and MPI
RADIOSS 11 Hybrid MPP Speedup
• Hybrid MPP version of RADIOSS
� 2 parallelization levels
� MPI parallelization based on domain decomposition
� Multi-threaded MPI processes under OpenMP
� Enhanced performance
� Higher scalability
� Better efficiency
• Extend Hybrid to multi GPUs programming
� Integrate GPU parallelization into our hybrid programming model
� MPI to manage communication between GPUs
� Portions of code not GPU enable benefit from OpenMP parallelization
� Programming model expendable to multi nodes with multi GPUs
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
2424
PCG Multi GPUs Parallelization
• Domain decomposition to split original problem
� Optimal load balancing
• MPI Communications between domains
� Controlled by the CPU
� One GPU associated to one MPI
• OpenMP Multithreading
� Speedup calculations not performed under GPU
� Leverage the available CPU cores
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
2525252525
Benchmark #1
Linear Problem #1
Hood of a car with pressure loads
Compute displacements and stresses
Benchmark 0,9 Millions of Degrees of Freedom
24 Millions of non zero
140000 Shells + 13000 Solids + 1100 RBE3
4300 iterations
Platform Nvidia PSG Cluster – 2 nodes with:
Dual Nvidia M2090 GPUs
Cuda v3.2
Intel Westmere 2x6 X5670@2,93Ghz
Linux RHEL 5.4 with Intel MPI 4.0
Performance Elapsed time decreased by up to 9X
351
185
84
53
0
50
100
150
200
250
300
350
400
1 node - 2 Nividia M2090
Ela
ps
ed
(s)
SMP 6-core
Hybrid 2 MPI x 6 SMP
SMP 6 + 1 GPU
Hybrid 2 MPI x 6 SMP + 2 GPUs
104
38
0
20
40
60
80
100
120
2 nodes - 4 Nvidia M2090
Ela
ps
ed
(s)
Hybrid 4 MPI x 6 SMP
Hybrid 4 MPI x 6 SMP + 4 GPUs
4.2X*
*Performance gain versus SMP 6-core
6.5X*
9.2X*
Lower
isbetter
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
26262626
Benchmark #2
Linear Problem #2
Hood of a car with pressure loads
Refined model
Compute displacements and stresses
Benchmark 2,2 Millions of Degrees of Freedom
62 Millions of non zero
380000 Shells + 13000 Solids + 1100 RBE3
5300 iterations
Platform Nvidia PSG Cluster – 2 nodes with:
Dual Nvidia M2090 GPUs
Cuda v3.2
Intel Westmere 2x6 X5670@2,93Ghz
Linux RHEL 5.4 with Intel MPI 4.0
Performance Elapsed time decreased by up to 13X
1106
572
254
143
0
200
400
600
800
1000
1200
1 node - 2 Nividia M2090
Ela
ps
ed
(s)
SMP 6-core
Hybrid 2 MPI x 6 SMP
SMP 6 + 1 GPU
Hybrid 2 MPI x 6 SMP + 2 GPUs
306
85
0
50
100
150
200
250
300
350
2 nodes - 4 Nvidia M2090
Ela
psed
(s)
Hybrid 4 MPI x 6 SMP
Hybrid 4 MPI x 6 SMP + 4 GPUs
4.3X* 7.5X*
13X*
*Performance gain versus SMP 6-core
Lower is
better
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
27272727
Benchmark #3
Gravity Problem with contacts
Full car model
Compute displacements and stresses
Benchmark 0,85 Millions of Degrees of Freedom
21 Millions of non zero
138000 Shells + 11000 Solids + 5700 1-D
elements + 230 Rigid bodies + 6 Rigid walls
1 Gravity load and 22 Contact interfaces
~ 9000 iterations
Platform Nvidia PSG Cluster – 2 nodes with:
Dual Nvidia M2090 GPUs
Cuda v3.2
Intel Westmere 2x6 X5670@2,93Ghz
Linux RHEL 5.4 with Intel MPI 4.0
Performance Elapsed time decreased by up to 9X
1430
443
225
1223
0
200
400
600
800
1000
1200
1400
1600
1 node - 2 Nividia M2090
Ela
ps
ed
(s)
SMP 6-core
Hybrid 2 MPI x 6 SMP
SMP 6 + 1 GPU
Hybrid 2 MPI x 6 SMP + 2 GPUs
531
163
0
100
200
300
400
500
600
2 nodes - 4 Nvidia M2090
Ela
ps
ed
(s)
Hybrid 4 MPI x 6 SMP
Hybrid 4 MPI x 6 SMP + 4 GPUs
3.2X*6.4X*
8.8X*
*Performance gain versus SMP 6-core
Lower is
better
-
Innovation Intelligence®
CONCLUSION
-
Copyright © 2012 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
29292929
Conclusion
• RADIOSS implicit direct & iterative solvers have been successfully ported on
Nvidia GPU using Cuda
• Adding GPU improves significantly the performance of these solvers
• For iterative solver, Hybrid MPP allows to run on multi GPU card workstation and
GPU cluster with good scalability and enhanced performance
• GPU support for implicit solvers is planed for HyperWorks 12
A big thanks to the Nvidia team for their great support:Stan Posey, Steven Rennich, Thomas Reed, Peng Wang!