Presentation File 2
Transcript of Presentation File 2
2
Introduction of GPUs in HPC Progress of CFD on GPUs Review of OpenFOAM on GPUs Discussion on WRF Developments
Agenda: GPU Progress and Directions for CAE
3
Explicit [usually
compressible]
Implicit [usually
incompressible]
Structured Grid Unstructured
CFD Speed-Ups Demonstrated in Range of Time Schemes and Spatial Discretization
~15x
CFD Algorithm Suitability for GPUs
ISVs
~5x
~5x ~2-3x
Stencil operations, uniform memory refs
Stencil operations, renumbering schemes
Linear algebra solver, uniform memory refs
Linear algebra solver, renumbering schemes
~x Factors Based
on Comparisons
with Xeon 8-core
Sandy Bridge CPU
Strategy: Directives Strategy: Directives
Strategy: Libraries Strategy: Libraries
4
Typical Routine Simulation
Large-scale Simulation ~19x Speedup
http://www.turbostream-cfd.com/ Source:
Sample Turbostream GPU Simulations
Turbostream: CFD for Turbomachinery
5
GPU Application
Jameson-developed CFD software SD++ for high order method aerodynamic simulations
GPU Benefit
Use of 16 x Tesla M2070: 15 hrs vs. 202 hrs for 16 x Xeon X5670
Fast turnaround of complex LES simulations that would otherwise be impractical for CPU-only use
Stanford University Aerospace Computing Lab – Prof. Antony Jameson
SD++ and Jameson Aerodynamics Research
15 hours on 16 x M2070s
202 hours ( > one week)
on 16 Xeon x5670 CPUs
Transitional flow over SD70053 airfoil, 21M DOF, Ma =.2, Re=60K, AoA=4, 4th order, 400K RK iters
6
GPU Application
NRL-developed CFD software JENRE for simulation of jet engine acoustics
GPU Benefit
Use of Tesla M2070: 3x vs. Hex core Intel (Westmere) CPU
More detailed mesh simulations possible for longer durations of jet engine transient conditions
U.S. DoD Naval Research Lab Lab for Computational Physics and Fluid Dynamics
Fighter Jet Engine Noise Reduction on GPUs
7
GPU Application
SJTU-developed CFD software NUS3D for aerodynamic simulations of wing shapes
GPU Benefit
Use of Tesla C2070: 20x – 37x vs. single core Intel core i7 CPU
Faster simulations for more wing design candidates vs. wind tunnel testing
Expanding to multi-GPU and full aircraft
COMAC and SJTU Commercial Aircraft Corporation of China
COMAC Wing Candidate
ONERA M6 Wing CFD Simulation
Commercial Aircraft Wing Design on GPUs
8
Particle CFD (LBM, SPH, etc.) generally better fit vs. continuum
Fully deployed explicit solvers generally outperform implicit
Explicit i,j,k stencil operations good fit for massively parallel threads
Most CFD is distributed parallel across CPU multicores/nodes Fits GPU parallel model well and preserves costly MPI investment Focus on hybrid parallel schemes that utilize all CPU cores + GPU
GPU development strategy depends on profile starting point: Legacy explicit scheme: compiler directives such as OpenACC New explicit scheme: CUDA and stencil libraries Legacy implicit scheme: CUDA and libs for solver; OpenACC for rest New implicit scheme: CUDA and libs for solver, matrix assembly, etc.
GPU Development Status for CFD
9 Courtesy of FluiDyna and Lbultra CFD Software: www.fluidyna.de
RTT DeltaGen for photo realistic 3D visualization
Integrates FluiDyna LBultra CFD functionality as plug-in
Designer only must specify
resolution and velocity
Simulation data displayed live with GPU performance
FluiDyna and Aerodynamic-Aware Surface Design
10
Prometech and Particleworks for Multiphase Flow
Oil Flow in HB Gearbox
Courtesy of Prometech Software and Particleworks CFD Software
MPS-based method developed at the University of Tokyo [Prof. Koshizuka]
Particleworks 3.0 GPU vs. 4 core i7
http://www.prometech.co.jp
11
ISV Software Application Method GPU Status
PowerFLOW Aerodynamics LBM Evaluation Lbultra Aerodynamics LBM Available v2.0 XFlow Aerodynamics LBM Evaluation Project Falcon Aerodynamics LBM Evaluation Particleworks Multiphase/FS MPS (~SPH) Available v3.1 BARRACUDA Multiphase/FS MP-PIC In development EDEM Discrete phase DEM In development ANSYS Fluent – DDPM Multiphase/FS DEM In development STAR-CCM+ Multiphase/FS DEM Evaluation AFEA High impact SPH Available v2.0
ESI High impact SPH, ALE In development LSTC High impact SPH, ALE Evaluation Altair High impact SPH, ALE Evaluation
Availability of Commercial DSFD-Based Software
12
ISV Primary Applications (Green color indicates CUDA-ready during 2013)
ANSYS ANSYS Mechanical; ANSYS Fluent; ANSYS HFSS
DS SIMULIA Abaqus/Standard; Abaqus/Explicit; Abaqus/CFD
MSC Software MSC Nastran; Marc; Adams
Altair RADIOSS; AcuSolve
CD-adapco STAR-CD; STAR-CCM+
Autodesk AS Mechanical, Moldflow, AS CFD
ESI Group PAM-CRASH imp; CFD-ACE+
Siemens NX Nastran
LSTC LS-DYNA; LS-DYNA CFD
Mentor FloEFD, FloTherm
Metacomp CFD++
Grid-Based Commercial CFD and GPU Progress
13
Additional Commercial GPU Developments
ISV Domain Location Primary Applications
FluiDyna CFD Germany Culises for OpenFOAM; LBultra
Vratis CFD Poland Speed-IT for OpenFOAM; ARAEL
Prometech CFD Japan Particleworks
Turbostream CFD England, UK Turbostream
IMPETUS Explicit FEA Sweden AFEA
AVL CFD Austria FIRE
CoreTech CFD (molding) Taiwan Moldex3D
Intes Implicit FEA Germany PERMAS
Next Limit CFD Spain XFlow
CPFD CFD USA BARRACUDA
Flow Science CFD USA FLOW-3D
14
Every primary ISV has products available on GPUs or undergoing evaluation
The 4 largest ISVs all have products based on GPUs, some at 3rd generation
#1 ANSYS, #2 DS SIMULIA, #3 MSC Software, and #4 Altair
The top 4 out of 5 ISV applications are available on GPUs today
ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, (LS-DYNA implicit only)
Several new ISVs were founded with GPUs as a primary competitive strategy Prometech, FluiDyna, Vratis, IMPETUS, Turbostream
Open source CFD OpenFOAM available on GPUs today with many options Commercial options: FluiDyna, Vratis; Open source options: Cufflink, Symscape ofgpu, RAS, etc.
Status Summary of ISVs and GPU Computing
15
Structured Grid FV Unstructured FV Unstructured FE
CFD Algorithm Characterization: Discretization
Finite Volume Finite Element:
16
Explicit
Usually
Compressible
Implicit
Usually
Incompressible
Structured Grid FV Unstructured FV Unstructured FE
Finite Volume
CFD Algorithm Characterization: Time Integration
Finite Element:
17
Explicit
Usually
Compressible
Implicit
Usually
Incompressible
Structured Grid FV Unstructured FV Unstructured FE
Finite Volume
Numerical operations on I,J,K stencil, no “solver” [Typically flat profiles: GPU strategy of directives (OpenACC)]
CFD Algorithm Characterization: Time Integration
Finite Element:
18
Structured Grid FV Unstructured FV Unstructured FE
GPU Acceleration Relative to Single 8-Core CPU
Explicit
Usually
Compressible
Implicit
Usually
Incompressible
Finite Volume Finite Element:
Turbostream
SJTU RANS
- SD++
Stanford
(Jameson)
- FEFLO
(Lohner)
Veloxi
~15x ~5x
19
Structured Grid FV Unstructured FV Unstructured FE
GPU Acceleration Relative to Single 8-Core CPU
Explicit
Usually
Compressible
Implicit
Usually
Incompressible
Finite Volume Finite Element:
Turbostream
SJTU RANS
- SD++
Stanford
(Jameson)
- FEFLO
(Lohner)
Veloxi
~15x ~5x
Sparse matrix linear algebra – iterative solvers [Hot spot ~50%, small % LoC: GPU strategy of CUDA and libs]
20
Structured Grid FV Unstructured FV Unstructured FE
GPU Acceleration Relative to Single 8-Core CPU
Explicit
Usually
Compressible
Implicit
Usually
Incompressible
Finite Volume Finite Element:
- Moldflow
- AcuSolve
- Moldex3D
Turbostream
SJTU RANS
- SD++
Stanford
(Jameson)
- FEFLO
(Lohner)
Veloxi
~15x ~5x
~2x
- ANSYS Fluent
- Culises for
OpenFOAM
- SpeedIT for
OpenFOAM
- CFD-ACE+
- FIRE
21
Commercial CFD Focus on Sparse Solvers for GPU
CFD Application Software
+
GPU CPU - Hand-CUDA Parallel
- GPU Libraries, CUBLAS
- OpenACC Directives
Implicit Sparse Matrix Operations
50% - 65% of
Profile time,
Small % LoC
(Investigating OpenACC for more tasks on GPU)
Read input, matrix Set-up
Global solution, write output
Implicit Sparse Matrix Operations
22
Library of nested solvers for large sparse Ax=b
Nesting creates a solver hierarchy, e.g.
Example solvers
Jacobi, simple local (neighbor) operations, no/little setup
BiCGStab, local and global operations, no setup
MC-DILU, graph coloring and factorization at setup
AMG, multi-level scheme, on each level: graph coarsening and matrix-
matrix products at setup
Accelerate state-of-the-art multi-level linear solvers in targeted
application domains
Primary Targets: CFD and Reservoir Simulation
Other domains will follow
Focus on difficult-to-parallelize algorithms
Parallelize both setup and solve phases
Difficult problems: parallel graph algorithms, sparse matrix
manipulation, parallel smoothers
No groups have successfully mapped production-quality algorithms to
fine-grained parallel architectures
Ensure NVIDIA architecture team understands these
applications and is influenced by them
BiCGstab AMG Jacobi
MC-DILU
NVIDIA-Developed Library of Linear Solvers
23
Committed: ANSYS – ANSYS Fluent and ANSYS CFD : #1 in CFD
FluiDyna – Culises library use in OpenFOAM: OpenFOAM is #2 in CFD for leveraged hardware
Evaluation: Autodesk – AS Moldflow: the leader in plastic mold injection simulation
Autodesk – AS CFD: important to the design engineering market and being hosted on Autodesk cloud
Discussion: CD-adapco – STAR-CCM+: the # 2 CFD code for software rev, either #2 or #3 for leveraged hardware
ESI – CFD-ACE+: important CFD code in the semiconductor/electronics industry along with others
Cradle – SC/Tetra: #3 CFD in Japan (behind ANSYS Fluent and STAR-CCM+) and primary CFD code at Toyota
Targets: Altair – AcuSolve: GMRES
Metacomp – CFD++: AMG
Mentor – FloEFD: AMG
SIMULIA – Abaqus/CFD: use ML from Petsc
LSTC – LS-DYNA CFD: AMG
AVL – FIRE: AMG
Convergent Technologies – Converge CFD: GMRES
ISV Progress with NVIDIA CFD Solver Library
24
ANSYS and NVIDIA Technical Collaboration
Release
ANSYS Mechanical ANSYS Fluent ANSYS EM
13.0 Dec 2010
SMP, Single GPU, Sparse
and PCG/JCG Solvers
ANSYS Nexxim
14.0 Dec 2011
+ Distributed ANSYS;
+ Multi-node Support
Radiation Heat Transfer
(beta)
ANSYS Nexxim
14.5 Nov 2012
+ Multi-GPU Support;
+ Hybrid PCG;
+ Kepler GPU Support
+ Radiation HT;
+ GPU AMG Solver (beta),
Single GPU
ANSYS Nexxim
15.0 Q4-2013
+ CUDA 5 Kepler Tuning + Multi-GPU AMG Solver;
+ CUDA 5 Kepler Tuning
ANSYS Nexxim
ANSYS HFSS (Transient)
25
Radiation HT Applications:
- Underhood cooling
- Cabin comfort HVAC
- Furnace simulations
- Solar loads on buildings
- Combustor in turbine
- Electronics passive cooling
ANSYS Fluent 14.5 and Radiation HT on GPU
VIEWFAC Utility:
Use on CPUs, GPUs
or both ~2x speedup
RAY TRACING Utility:
Uses OptiX library
from NVIDIA with up
to ~15x speedup
(Use on GPU only)
26
Solve Linear System of Equations: Ax = b
Assemble Linear System of Equations
No Yes
Stop
Accelerate
this first
~ 35%
~ 65%
Runtime:
Non-linear iterations
Converged ?
ANSYS Fluent CPU Job Profile for Coupled PBNS
27
0
3
6
9
Airfoil (hex 784K) Aircraft (hex 1798K)
K20X
3930K(6)
Lower is
Better
NOTE: Times
for solver only
AN
SY
S F
luent
AM
G S
olv
er
Tim
e p
er
Itera
tion (S
ec)
ANSYS Fluent 14.5 Performance – Results by NVIDIA, Nov 2012
CPU Fluent solver:
F-cycle, agg8, DILU,
0pre, 3post
GPU nvAMG solver:
V-cycle, agg8, MC-DILU,
0pre, 3post
2 x Core-i7 3930K, Only 6 Cores Used
Solver settings:
Airfoil and Aircraft Models with Hexahedral Cells
2.4x
2.4x
ANSYS Fluent GPU-Based AMG Solver from NVIDIA
28
Comparison of AMG Cycles on CPU and GPU
Lower is
Better
CPU-F
GPU-F
2D Convection Case: F-cycle best for both CPU and GPU
29
N1 N2 N3 N4
1 2 3
4
Partition on CPU
GPUs and Distributed Cluster Computing
N1
Geometry decomposed: partitions
put on independent cluster nodes;
CPU distributed parallel processing Nodes distributed
parallel using MPI
Global Solution
1
30
N1 N2 N3 N4
1 2 3
4
Partition on CPU
GPUs and Distributed Cluster Computing
N1
Geometry decomposed: partitions
put on independent cluster nodes;
CPU distributed parallel processing Nodes distributed
parallel using MPI
Global Solution
Execution on
CPU + GPU
GPUs shared memory
parallel using OpenMP
under distributed parallel
G1 G2 G3 G4
1
31
NOTE: Times
for solver only
CPU Fluent solver:
F-cycle, agg8, DILU,
0pre, 3post
GPU nvAMG solver:
V-cycle, agg8, MC-DILU,
0pre, 3post
2 x E5_2680 SB CPUs, 16 cores total, only 2 cores used with GPUs
Solver settings:
ANSYS Fluent Preview for 2 x CPU + 2 x Tesla K20X
ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Feb 2013
0
0.5
1
1.5
Helix (tet 1173K) Airfoil (hex 784K)
2 x K20X
E5_2680(16)
1.7x
2.1x
Lower is
Better
32
ANSYS Fluent Scaling Results for 4 x Tesla K20X
ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Mar 2013
1
2
3
4
K20X(1) K20X(2) K20X(3) K20X(4)
Helix (Tet 1.2M)
Airfoil (Hex .78M)
Sedan (Mixed 3.6M)
Perfect Scaling
Higher is
Better
GPU Solver Settings:
V-cycle, agg8/2,
MC-DILU, 0pre, 3post
• 2 server nodes
• 2 GPUs each node
• Infiniband network
Hardware Setup:
NOTES: • Results for
solver only
• Sedan case starts
with 2 GPUs
33
ANSYS Fluent 15.0 Multi-GPU Demonstration
G1 G2 G3 G4
8-Cores 8-Cores 16-Core Server Node
Multi-GPU Acceleration of
a 16-Core ANSYS Fluent
Simulation of External Aero
Xeon E5-2667 CPUs + Tesla K20X GPUs
2.9X Solver Speedup
CPU Configuration CPU + GPU Configuration
Click to Launch Movie
34
Problem Statement:
CFD demand for increased levels of CFD model
resolution for improved simulation accuracy
CFD use is 80% steady state RANS today rather a short-cut to faster turn-around
Fluid flow is inherently unsteady and in need of better turbulence treatment
CPU-based HPC limits advanced CFD
Opportunity:
CFD ISVs have developed URANS, DES, and LES capabilities which undergo very limited use
CPU-based turnaround times are impractical for many product development workflows
Large Eddy Simulation (LES) is of most interest and has a high degree of arithmetic intensity
GPU computing can offer a practical solution for LES that doesn’t exist today with CPUs
Summary: Opportunity for Advanced CFD
35
Opportunities exist for GPUs to provide significant performance acceleration for solver intensive large jobs
Improved product quality
Shorten product engineering cycles (Faster Time-to-Market)
Better Total Cost of Ownership (TCO)
Cut down energy consumption in the CAE process
Simulations recently considered intractable are now possible
Large Eddy Simulation (LES) with a high degree of arithmetic intensity
Parameter optimization with highly increased number of jobs
Conclusions For CAE on GPUs