MSC Nastran Explicit Nonlinear (SOL 700) on Advanced...

MSC Nastran Explicit Nonlinear

(SOL 700) on Advanced SGI®

Architectures

Presented By:

Dr. Olivier Schreiber, Application Engineering, SGI

Walter Schrauwen, Senior Engineer, Finite Element Development, MSC Software

April 10, 2013

2

SGI, Introduction

SGI and MSC Software

SGI Solutions for CAE

SGI architectures

MSC Nastran SMP, DMP technologies

Benchmarks

Effect of processor frequency, Turbo, Hyperthreading, Memory,

Interconnect, Filesystem, running on subset of cores, scaling

Summary

Agenda

3

SGI's focus

SGI: The Trusted Leader in Technical Computing

Business Computing

Business

Applications

Technical Computing

Workload Optimized

Big

Data

Technical

Applications

Redundancy Scale & Speed

SGI CAE Customers

4

SGI Partnership with MSC Software

5

MSC Software builds on all SGI supported platforms, software stacks:

x86_64, LP64, ILP64

NUMA, multi-core, DMP, SMP

Linux Distribution Popular Choices: RHEL, SLES

IntelMPI, PlatformMPI, OpenMPI

md2010.2 release started to

include SGI Math kernels

licensed by MSC Software.

15.2mil DOF SOL101 automotive

industry engine model shows

performance boost on x86_64

(results, courtesy MSC Software)

6

SGI UV Minimizes time to

solution. Ultimate

flexibility running shared

and distributed memory

applications in one

system using a single OS

instance

SGI Rackable or ICE X cluster

Cost-effective Solution &

performance leader for solvers

Ideal for Scale-up and Scale-Out

Mixed

Workflow

SGI® CXFS, Lustre, Gluster,

Panasas

InfiniBand or GigE Fabric

SGI® UV SGI®Rackable

SGI® ICE X

or

SGI Solution - Mixed Workflow Environment

SGI Rackable: Highly configurable cluster

7

Suggested: Intel® Xeon® 8-core 2.6GHz E5-2670, 6-core 2.9GHz E5-2667

SGI C2108‐RP2 “Summit” (head node)

SGI C2112‐4RP4 “Steelhead” (compute nodes)

500 GB SATA mirror Raid 1 on head node

2 striped 500 GB SATA local scratch (compute nodes)

4/8 GB Memory/core (2GB per million elements)

IB QDR or FDR interconnect

SGI InfiniteStorage 5000 (IS5000) for scratch/storage

Altair PBSPro Batch Scheduler v11

SLES or RHEL, SGI Performance Suite, Accelerate 20u, 42u racks

available

8

Suggested: Intel® Xeon® 8-core 2.6GHz E5-2670, 6-core 2.9GHz E5-2667

Up to 72 nodes (144 sockets, 1152 cores per rack)

Up to two 2.5” SATA drives for local swap/scratch

4/8 GB Memory/core (2GB per million elements)

Integrated IB FDR interconnect Hypercube/Fat Tree

Single or Dual-plane network topology (Multi-rail for

separation MPI & I/O traffic, or message splitting)


SLES or RHEL, SGI Performance Suite, Accelerate

Altair PBSPro Batch Scheduler v11

SGI ICE X: Highly scalable, integrated,

cable-free cluster

SGI® ICE X

Suggestions:

Air/water cooled 19” 42U rack H:79.5”xW:31.3”xD:46.2”

64 sockets (512 cores), 16 TB shared memory/rack

Xeon 8 core 2.7GHz E5-4650, 6 core 2.9GHz E5-4617

Graphics Card, GP-GPU and Intel® Phi support

SGI NUMALink™ 6 Interconnect


SLES or RHEL SGI Performance Suite, Accelerate

Altair PBSPro Batch Scheduler v11 with CPUSET MOM

SGI UV2: Highly scalable, Shared Memory,

latest generation x86 system

9

SGI® UV 2000

10

Customer Example: Challenge: Hamilton Sundstrand wants low administration/

maintenance system for fast turn-around times in commercial, regional,

corporate and military aircraft integrated component systems

development.

Solution: UV1000 384 cores Intel® Xeon® X7542 @2.66 GHz, 2TB RAM,

IS4600 Storage/fast scratch, Altair PBSPro batch scheduling

Applications: MSC Nastran, LS-DYNA, ANSYS, STAR-CCM+

Impact: Significant less system administration, more ease of use, IO

scalability improvement, leverage of RAM across system to run large

MSC Nastran jobs. Frees aircraft engineers to focus on core

competencies. No additional IT staff investment.

11

Cloud

SGI Cyclone Completing The Need For Customers

Traditional Data Center

Scale & Control On Demand

Modular Data Center

Modular & Mobile

SGI Customers can move freely between environments

11

SGI Cyclone Solution

SGI UV: Minimize

time to solution.

Ultimate flexibility

running shared and

distributed memory

applications in one

system using single

OS instance.

SGI ICE X or Rackable

cluster: Cost-effective

solution & performance

leader for all solver runs

SGI®Rackable

SGI® UV

Analysis types:

Bandwidth performance, large statics, dynamics, NVH, Multi-physics. Crash/impact, CFD

SGI® ICE

Customer C Customer E Customer D Customer B Customer A

SGI Firewall

Customer

Login

Customer

Login

Customer

Login

Customer

Login

Customer

Login

12

New hardware technologies that extend computational

performance:

MSC Nastran's computational technologies to exploit them:

• Shared Memory Parallel (SMP)

• Distributed Memory Parallel (DMP)

• Hybrid mode (combination of above two paradigms)

Industry Context:

13

09/21/10

14

MSC NASTRAN version: 2012.2 Shared Memory Parallelism (SMP)(80's)(v68): OpenMP Application

Programming Interface around computational loops.

Distributed Memory Parallelism (DMP)('90s)(v2001): MPI API around

physical domain decomposition (more granularity).

Paradigms map on two different system hardware levels:

inter-node or cluster parallelism (memory local to each node)–DMP only.

intra-node or multi-core parallelism (memory shared by all cores of each

node)--both SMP and DMP

Submittal procedures should ensure:

Placement of processes and threads across cores and nodes

Control of process memory allocation to stay within node capacity

Use of adequate scratch files across nodes or network

09/21/10

15

MSC NASTRAN execution control

Submittal line control options:

sdirectory=pathname for scratch files directories

mem=Explicit memory allocation for in-core processing

smp=number of OpenMP threads

dmp=Number of distributed processes

hosts={host1:...:hostn}

16

Scratch space filesystem topology NAS (Network Attached Storage)

DAS (Direct Attached Storage)

Root drive (/tmp)

Memory filesystem (/dev/shm)

(DAS)

Linux

filesystem

buffering

Hardware Nomenclature

node=host=blade(=chassis) Identified by MAC address/IP address. Comprises 2 sockets (or more) for: Processors with four (quad-core), six (hexa-core), eight or twelve cores each CPU=core

LS-DYNA UC 2012 17

Benchmarks

18

Models have been selected for increasing size and changes

in solver, material, site and phenomenon modeled--

indicated by their names.

Small model ‘EulerSnowBoxCompression'

19

Soil compressed

by structure to

test snow yield

model. 240K el.

Medium model ‘RoeAirBunkerBlast’, 880k el.

20

Explosive is ignited outside bunker, blast wave hits bunker, bunker

fails, explosive gas enters bunker. Flow between Euler domains is

modeled. Uses second order Roe solver

Large model ‘RoeAirMineBlast’ 2 x 800k el.

21

Blast wave expands below car model with opening, gas flows

inside car. Flow between Euler domains is simulated. Uses the

second order Roe solver

Largest model ‘EulerSoilVehicleBlast’ 2.2m el.

22

Simulates soil covered JWL explosive below a vehicle. Uses

biased Euler mesh. Blast location volume has especially fine

Euler mesh to capture blast in soil

CPU frequency effect on performance

23

Nominal ratio: 1.11, effect only 1.05

Turbo Boost effect on performance

24

Noticeable gains

Running with more nodes

25

Dataset dependent, number of processes is relevant parameter

Scaling up to 256 processes Job rating falls above 128

processes

Interconnect effect on performance

26

Dataset dependent

Giga bit Ethernet interconnect

27

No scalability beyond 64 processes or 4 nodes

Job rating continues to

increase up to 128 processes Job rating levels off at 64

processes

Leaving cores unused across nodes

28

Dataset dependent

Job rating decreases when

spreading fixed number of

processes across more nodes.

Job rating increases at constant

number of processes when

increasing number of nodes.

Leaving cores unused across nodes, benefit

29

64 processes run across 16

nodes, small case @2.9GHz,

FDR


nodes, small case @2.9GHz,

FDR

Leaving cores unused across nodes, no benefit

30


nodes, medium case @2.9GHz,

FDR


nodes, medium case @2.9GHz,

FDR

Anomaly

31

2 processes on 2 nodes take 60% more time than on 1 node

Anomaly

32

2 processes run on 2 nodes,

medium case @2.9GHz, FDR 2 processes run on 1 node,

medium case @2.9GHz, FDR

33

Results observations recapitulation:

Good scaling for DMP beyond one node/16 cores.

HyperThreading not beneficial due to communication costs

Infiniband can be twice as fast as GigE interconnect

Frequency and Turbo matter with no I/O bottleneck if DAS or RAM filesystem used (not NAS)

More RAM accommodates Linux buffer cache

All effects depend on datasets

Upgrading a single system attribute like CPU frequency, interconnect,

number of cores per node, RAM speed brings diminishing returns if

others unchanged. Trades can be made based on metrics such as

dataset turnaround times or throughput, acquisition, licensing, energy,

facilities, maintenance costs to minimize.

34

MSC NASTRAN offers integrated shared and distributed parallel

computational features applicable to SGI® advanced computer

architectures, such as Rackable, ICE X and UV with Intel® Xeon

Processors, IB QDR, FDR, NUMAlink® 6, Interconnects hardware, Linux®,

SGI® Performance Suite™, Accelerate™ software to allow greatly

enhanced workflow optimization possibilities.

Olivier Schreiber [email protected]

Tony DeVarco [email protected]

Walter Schrauwen [email protected]

Thank you

mailto:[email protected]




MSC Nastran Explicit Nonlinear (SOL 700) on Advanced...

Documents

Transcript of MSC Nastran Explicit Nonlinear (SOL 700) on Advanced...