MSC Nastran Explicit Nonlinear (SOL 700) on Advanced...
Transcript of MSC Nastran Explicit Nonlinear (SOL 700) on Advanced...
MSC Nastran Explicit Nonlinear
(SOL 700) on Advanced SGI®
Architectures
Presented By:
Dr. Olivier Schreiber, Application Engineering, SGI
Walter Schrauwen, Senior Engineer, Finite Element Development, MSC Software
April 10, 2013
2
SGI, Introduction
SGI and MSC Software
SGI Solutions for CAE
SGI architectures
MSC Nastran SMP, DMP technologies
Benchmarks
Effect of processor frequency, Turbo, Hyperthreading, Memory,
Interconnect, Filesystem, running on subset of cores, scaling
Summary
Agenda
3
SGI's focus
SGI: The Trusted Leader in Technical Computing
Business Computing
Business
Applications
Technical Computing
Workload Optimized
Big
Data
Technical
Applications
Redundancy Scale & Speed
SGI CAE Customers
4
SGI Partnership with MSC Software
5
MSC Software builds on all SGI supported platforms, software stacks:
x86_64, LP64, ILP64
NUMA, multi-core, DMP, SMP
Linux Distribution Popular Choices: RHEL, SLES
IntelMPI, PlatformMPI, OpenMPI
md2010.2 release started to
include SGI Math kernels
licensed by MSC Software.
15.2mil DOF SOL101 automotive
industry engine model shows
performance boost on x86_64
(results, courtesy MSC Software)
6
SGI UV Minimizes time to
solution. Ultimate
flexibility running shared
and distributed memory
applications in one
system using a single OS
instance
SGI Rackable or ICE X cluster
Cost-effective Solution &
performance leader for solvers
Ideal for Scale-up and Scale-Out
Mixed
Workflow
SGI® CXFS, Lustre, Gluster,
Panasas
InfiniBand or GigE Fabric
SGI® UV SGI®Rackable
SGI® ICE X
or
SGI Solution - Mixed Workflow Environment
SGI Rackable: Highly configurable cluster
7
Suggested: Intel® Xeon® 8-core 2.6GHz E5-2670, 6-core 2.9GHz E5-2667
SGI C2108‐RP2 “Summit” (head node)
SGI C2112‐4RP4 “Steelhead” (compute nodes)
500 GB SATA mirror Raid 1 on head node
2 striped 500 GB SATA local scratch (compute nodes)
4/8 GB Memory/core (2GB per million elements)
IB QDR or FDR interconnect
SGI InfiniteStorage 5000 (IS5000) for scratch/storage
Altair PBSPro Batch Scheduler v11
SLES or RHEL, SGI Performance Suite, Accelerate 20u, 42u racks
available
8
Suggested: Intel® Xeon® 8-core 2.6GHz E5-2670, 6-core 2.9GHz E5-2667
Up to 72 nodes (144 sockets, 1152 cores per rack)
Up to two 2.5” SATA drives for local swap/scratch
4/8 GB Memory/core (2GB per million elements)
Integrated IB FDR interconnect Hypercube/Fat Tree
Single or Dual-plane network topology (Multi-rail for
separation MPI & I/O traffic, or message splitting)
SGI InfiniteStorage 5000 (IS5000) for scratch/storage
SLES or RHEL, SGI Performance Suite, Accelerate
Altair PBSPro Batch Scheduler v11
SGI ICE X: Highly scalable, integrated,
cable-free cluster
SGI® ICE X
Suggestions:
Air/water cooled 19” 42U rack H:79.5”xW:31.3”xD:46.2”
64 sockets (512 cores), 16 TB shared memory/rack
Xeon 8 core 2.7GHz E5-4650, 6 core 2.9GHz E5-4617
Graphics Card, GP-GPU and Intel® Phi support
SGI NUMALink™ 6 Interconnect
SGI InfiniteStorage 5000 (IS5000) for scratch/storage
SLES or RHEL SGI Performance Suite, Accelerate
Altair PBSPro Batch Scheduler v11 with CPUSET MOM
SGI UV2: Highly scalable, Shared Memory,
latest generation x86 system
9
SGI® UV 2000
10
Customer Example: Challenge: Hamilton Sundstrand wants low administration/
maintenance system for fast turn-around times in commercial, regional,
corporate and military aircraft integrated component systems
development.
Solution: UV1000 384 cores Intel® Xeon® X7542 @2.66 GHz, 2TB RAM,
IS4600 Storage/fast scratch, Altair PBSPro batch scheduling
Applications: MSC Nastran, LS-DYNA, ANSYS, STAR-CCM+
Impact: Significant less system administration, more ease of use, IO
scalability improvement, leverage of RAM across system to run large
MSC Nastran jobs. Frees aircraft engineers to focus on core
competencies. No additional IT staff investment.
11
Cloud
SGI Cyclone Completing The Need For Customers
Traditional Data Center
Scale & Control On Demand
Modular Data Center
Modular & Mobile
SGI Customers can move freely between environments
11
SGI Cyclone Solution
SGI UV: Minimize
time to solution.
Ultimate flexibility
running shared and
distributed memory
applications in one
system using single
OS instance.
SGI ICE X or Rackable
cluster: Cost-effective
solution & performance
leader for all solver runs
SGI®Rackable
SGI® UV
Analysis types:
Bandwidth performance, large statics, dynamics, NVH, Multi-physics. Crash/impact, CFD
SGI® ICE
Customer C Customer E Customer D Customer B Customer A
SGI Firewall
Customer
Login
Customer
Login
Customer
Login
Customer
Login
Customer
Login
12
New hardware technologies that extend computational
performance:
MSC Nastran's computational technologies to exploit them:
• Shared Memory Parallel (SMP)
• Distributed Memory Parallel (DMP)
• Hybrid mode (combination of above two paradigms)
Industry Context:
13
09/21/10
14
MSC NASTRAN version: 2012.2 Shared Memory Parallelism (SMP)(80's)(v68): OpenMP Application
Programming Interface around computational loops.
Distributed Memory Parallelism (DMP)('90s)(v2001): MPI API around
physical domain decomposition (more granularity).
Paradigms map on two different system hardware levels:
inter-node or cluster parallelism (memory local to each node)–DMP only.
intra-node or multi-core parallelism (memory shared by all cores of each
node)--both SMP and DMP
Submittal procedures should ensure:
Placement of processes and threads across cores and nodes
Control of process memory allocation to stay within node capacity
Use of adequate scratch files across nodes or network
09/21/10
15
MSC NASTRAN execution control
Submittal line control options:
sdirectory=pathname for scratch files directories
mem=Explicit memory allocation for in-core processing
smp=number of OpenMP threads
dmp=Number of distributed processes
hosts={host1:...:hostn}
16
Scratch space filesystem topology NAS (Network Attached Storage)
DAS (Direct Attached Storage)
Root drive (/tmp)
Memory filesystem (/dev/shm)
(DAS)
Linux
filesystem
buffering
Hardware Nomenclature
node=host=blade(=chassis) Identified by MAC address/IP address. Comprises 2 sockets (or more) for: Processors with four (quad-core), six (hexa-core), eight or twelve cores each CPU=core
LS-DYNA UC 2012 17
Benchmarks
18
Models have been selected for increasing size and changes
in solver, material, site and phenomenon modeled--
indicated by their names.
Small model ‘EulerSnowBoxCompression'
19
Soil compressed
by structure to
test snow yield
model. 240K el.
Medium model ‘RoeAirBunkerBlast’, 880k el.
20
Explosive is ignited outside bunker, blast wave hits bunker, bunker
fails, explosive gas enters bunker. Flow between Euler domains is
modeled. Uses second order Roe solver
Large model ‘RoeAirMineBlast’ 2 x 800k el.
21
Blast wave expands below car model with opening, gas flows
inside car. Flow between Euler domains is simulated. Uses the
second order Roe solver
Largest model ‘EulerSoilVehicleBlast’ 2.2m el.
22
Simulates soil covered JWL explosive below a vehicle. Uses
biased Euler mesh. Blast location volume has especially fine
Euler mesh to capture blast in soil
CPU frequency effect on performance
23
Nominal ratio: 1.11, effect only 1.05
Turbo Boost effect on performance
24
Noticeable gains
Running with more nodes
25
Dataset dependent, number of processes is relevant parameter
Scaling up to 256 processes Job rating falls above 128
processes
Interconnect effect on performance
26
Dataset dependent
Giga bit Ethernet interconnect
27
No scalability beyond 64 processes or 4 nodes
Job rating continues to
increase up to 128 processes Job rating levels off at 64
processes
Leaving cores unused across nodes
28
Dataset dependent
Job rating decreases when
spreading fixed number of
processes across more nodes.
Job rating increases at constant
number of processes when
increasing number of nodes.
Leaving cores unused across nodes, benefit
29
64 processes run across 16
nodes, small case @2.9GHz,
FDR
64 processes run across 4
nodes, small case @2.9GHz,
FDR
Leaving cores unused across nodes, no benefit
30
64 processes run across 16
nodes, medium case @2.9GHz,
FDR
64 processes run across 4
nodes, medium case @2.9GHz,
FDR
Anomaly
31
2 processes on 2 nodes take 60% more time than on 1 node
Anomaly
32
2 processes run on 2 nodes,
medium case @2.9GHz, FDR 2 processes run on 1 node,
medium case @2.9GHz, FDR
33
Results observations recapitulation:
Good scaling for DMP beyond one node/16 cores.
HyperThreading not beneficial due to communication costs
Infiniband can be twice as fast as GigE interconnect
Frequency and Turbo matter with no I/O bottleneck if DAS or RAM filesystem used (not NAS)
More RAM accommodates Linux buffer cache
All effects depend on datasets
Upgrading a single system attribute like CPU frequency, interconnect,
number of cores per node, RAM speed brings diminishing returns if
others unchanged. Trades can be made based on metrics such as
dataset turnaround times or throughput, acquisition, licensing, energy,
facilities, maintenance costs to minimize.
34
MSC NASTRAN offers integrated shared and distributed parallel
computational features applicable to SGI® advanced computer
architectures, such as Rackable, ICE X and UV with Intel® Xeon
Processors, IB QDR, FDR, NUMAlink® 6, Interconnects hardware, Linux®,
SGI® Performance Suite™, Accelerate™ software to allow greatly
enhanced workflow optimization possibilities.
Olivier Schreiber [email protected]
Tony DeVarco [email protected]
Walter Schrauwen [email protected]
Thank you