Hpc for Mechanical Ansys

download Hpc for Mechanical Ansys

of 33

Transcript of Hpc for Mechanical Ansys

  • HighPerformanceComputingforMechanicalSimulationsusingANSYS

    JeffBeisheimANSYS,Inc

  • HighPerformanceComputing(HPC)atANSYS:

    Anongoingeffortdesignedtoremovecomputinglimitationsfromengineerswhousecomputeraidedengineeringinallphasesofdesign,analysis,andtesting.

    Itisahardwareandsoftware initiative!

    HPCDefined

  • NeedforSpeed

    ImpactproductdesignEnablelargemodelsAllowparametricstudies

    ModalNonlinearMultiphysicsDynamics

    AssembliesCADtomeshCapturefidelity

  • AHistoryofHPCPerformance

    1990 SharedMemoryMultiprocessing(SMP)available

    1990 SharedMemoryMultiprocessing(SMP)available

    1994IterativePCGSolverintroducedforlargeanalyses

    1994IterativePCGSolverintroducedforlargeanalyses

    1999 200064bitlargememoryaddressing

    1999 200064bitlargememoryaddressing

    20041st companytosolve100MstructuralDOF

    20041st companytosolve100MstructuralDOF

    2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores

    2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores

    1980s VectorProcessingonMainframes

    1980s VectorProcessingonMainframes

    20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC

    20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC

    1980

    1990

    2010

    2000

    2012

    2010GPUacceleration(singleGPU;SMP)

    2010GPUacceleration(singleGPU;SMP)

    2012 GPUacceleration(multipleGPUs;DMP)

    2012 GPUacceleration(multipleGPUs;DMP)

  • HPCRevolution

    Recentadvancementshaverevolutionizedthecomputationalspeedavailableonthedesktop Multicoreprocessors Everycoreisreallyanindependentprocessor

    LargeamountsofRAMandSSDs GPUs

  • ParallelProcessing Hardware

    2Typesofmemorysystems Sharedmemoryparallel(SMP) singlebox,workstation/server Distributedmemoryparallel(DMP)multipleboxes,cluster

    ClusterWorkstation

  • ParallelProcessing Software

    2TypesofparallelprocessingforMechanicalAPDL Sharedmemoryparallel(np >1) Firstavailableinv4.3 Canonlybeusedonsinglemachine

    Distributedmemoryparallel(dis np >1) Firstavailableinv6.0withtheDDSsolver Canbeusedonsinglemachineorcluster

    GPUacceleration(acc) Firstavailableinv13.0usingNVIDIAGPUs SupportsusingeithersingleGPUormultipleGPUs Canbeusedonsinglemachineorcluster

  • Distributed ANSYSDesignRequirements

    Nolimitationinsimulationcapability Mustsupportallfeatures Continuallyworkingtoaddmorefunctionalitywitheachrelease Reproducibleandconsistentresults Sameanswersachievedusing1coreor100cores SamequalitychecksandtestingaredoneaswithSMPversion UsesthesamecodebaseasSMPversionofANSYS Supportallmajorplatforms Mostwidelyusedprocessors,operatingsystems,andinterconnects SupportssameplatformsthatSMPversionsupports UseslatestversionsofMPIsoftwarewhichsupportthelatest

    interconnects

  • Distributed ANSYSDesign

    Distributedsteps(dis np N) Atstartoffirstloadstep,decomposeFEAmodelintoNpieces(domains)

    Eachdomaingoestoadifferentcoretobesolved

    Solutionisnotindependent!! Lotsofcommunicationrequiredto

    achievesolution Lotsofsynchronizationrequiredto

    keepallprocessestogether Eachprocesswritesitsownsetsoffiles(file0*,file1*,file2*,,file[N1]*)

    Resultsareautomaticallycombinedatendofsolution Facilitatespostprocessingin/POST1,

    /POST26,orWorkBench

  • Distributed ANSYSCapabilities Staticlinearornonlinearanalyses Bucklinganalyses Modalanalyses HarmonicresponseanalysesusingFULLmethod TransientresponseanalysesusingFULLmethod Singlefieldstructuralandthermalanalyses Lowfrequencyelectromagneticanalysis Highfrequencyelectromagneticanalysis Coupledfieldanalyses Allwidelyusedelementtypesandmaterials Superelements(usepass) NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE, LinearPerturbation Multiframerestarts Cyclicsymmetryanalyses UserProgrammablefeatures(UPFs)

    Widevarietyoffeatures&analysis

    capabilitiesaresupported

  • DistributedANSYSEquationSolvers

    Sparsedirectsolver(default) SupportsSMP,DMP,andGPUacceleration Canhandleallanalysistypesandoptions FoundationforBlockLanczos,Unsymmetric,Damped,andQR

    dampedeigensolvers PCGiterativesolver SupportsSMP,DMP,andGPUacceleration Symmetric,realvaluematricesonly(i.e.,static/fulltransient) FoundationforPCGLanczoseigensolver JCG/ICCGiterativesolvers SupportsSMPonly

  • DistributedANSYSEigensolvers

    BlockLanczoseigensolver(includingQRdamp) SupportsSMPandGPUacceleration PCGLanczoseigensolver SupportsSMP,DMP,andGPUacceleration Greatforlargemodels(>5MDOF)withrelativelyfewmodes(

  • DistributedANSYSBenefits

    Betterarchitecture Morecomputationsperformedinparallel fastersolutiontime BetterspeedupsthanSMP Canachieve>10xon16cores(trygettingthatwithSMP!) Canbeusedforjobsrunningon1000+CPUcores

    Cantakeadvantageofresourcesonmultiplemachines Memoryusageandbandwidthscales Disk(I/O)usagescales Wholenewclassofproblemscanbesolved!

  • DistributedANSYSPerformance

    Needfastinterconnectstofeedfastprocessors Twomaincharacteristics foreachinterconnect:latencyandbandwidth DistributedANSYSishighlybandwidthbound

    +--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHzTotal number of cores available : 32Number of physical cores available : 32Number of cores requested : 4 (Distributed Memory Parallel)MPI Type: INTELMPI

    Core Machine Name Working Directory----------------------------------------------------

    0 hpclnxsmc00 /data1/ansyswork1 hpclnxsmc00 /data1/ansyswork2 hpclnxsmc01 /data1/ansyswork3 hpclnxsmc01 /data1/ansyswork

    Latency time from master to core 1 = 1.171 microsecondsLatency time from master to core 2 = 2.251 microsecondsLatency time from master to core 3 = 2.225 microseconds

    Communication speed from master to core 1 = 7934.49 MB/sec Same machineCommunication speed from master to core 2 = 3011.09 MB/sec QDR InfinibandCommunication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband

  • DistributedANSYSPerformance

    0

    10

    20

    30

    40

    50

    60

    8cores 16cores 32cores 64cores 128cores

    R

    a

    t

    i

    n

    g

    (

    r

    u

    n

    s

    /

    d

    a

    y

    )

    InterconnectPerformance

    GigabitEthernetDDRInfiniband

    Needfastinterconnectstofeedfastprocessors

    Turbinemodel 2.1millionDOF SOLID187elements Nonlinearstaticanalysis Sparsesolver(DMP) Linuxcluster(8corespernode)

  • DistributedANSYSPerformance

    Needfastharddrivestofeedfastprocessors Checkthebandwidthspecs

    ANSYSMechanicalcanbehighlyI/Obandwidthbound SparsesolverintheoutofcorememorymodedoeslotsofI/O

    DistributedANSYScanbehighlyI/Olatencybound Seektimetoread/writeeachsetoffilescausesoverhead

    ConsiderSSDs Highbandwidthandextremelylowseektimes

    ConsiderRAIDconfigurationsRAID0 forspeedRAID1,5 forredundancyRAID10 forspeedandredundancy

  • DistributedANSYSPerformance

    Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.

    Needfastharddrivestofeedfastprocessors

    0

    5

    10

    15

    20

    25

    30

    1core 2cores 4cores 8cores

    R

    a

    t

    i

    n

    g

    (

    r

    u

    n

    s

    /

    d

    a

    y

    )

    HardDrivePerformance

    HDDSSD

    8millionDOF Linearstaticanalysis Sparsesolver(DMP) DellT5500workstation

    12IntelXeonx5675cores,48GBRAM,single7.2krpmHDD,singleSSD,Win7)

  • AvoidwaitingforI/Otocomplete! ChecktoseeifjobisI/Oboundorcomputebound CheckoutputfileforCPUandElapsedtimes WhenElapsedtime>>mainthreadCPUtime I/Obound

    ConsideraddingmoreRAMorfasterharddriveconfiguration WhenElapsedtimemainthreadCPUtimeComputebound Consideringmovingsimulationtoamachinewithfasterprocessors ConsiderusingDistributedANSYS(DMP)insteadofSMP ConsiderrunningonmorecoresorpossiblyusingGPU(s)

    DistributedANSYSPerformance

    Total CPU time for main thread : 167.8 seconds. . . . . .Elapsed Time (sec) = 388.000 Date = 08/21/2012

  • AllrunswithSparsesolverHardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernodeHardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernodeANSYS12.0to14.0runswithDDRInfinibandinterconnectANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS

    DistributedANSYSPerformanceANSYS11.0 ANSYS12.0 ANSYS12.1 ANSYS13.0SP2 ANSYS 14.0

    Thermal(fullmodel)3MDOF

    Time 4 hours 4hours 4hours 4hours 1hour 0.8hour

    Cores 8 8 8 8 8+1GPU 32

    ThermomechanicalSimulation (fullmodel)7.8MDOF

    Time ~5.5days 34.3hours 12.5hours 9.9 hours 7.5hours

    Iterations 163 164 195 195 195

    Cores 8 20 64 64 128

    Interpolation ofBoundaryConditions

    Time 37hours 37hours 37hours 0.2hour 0.2hour

    LoadSteps 16 16 16 Improvedalgorithm 16

    Submodel:CreepStrainAnalysis5.5MDOF

    Time ~5.5days 38.5hours 8.5hours 6.1hours 5.9hours 4.2hours

    Iterations 492 492 492 488 498 498

    Cores 18 16 76 128 64+8GPU 256

    TotalTime 2weeks 5days 2 days 1day 0.5day

    ResultsCourtesyofMicroConsult Engineering,GmbH

  • DistributedANSYSPerformance

    0

    5

    10

    15

    20

    25

    0 8 16 24 32 40 48 56 64

    S

    p

    e

    e

    d

    u

    p

    SolutionScalability

    Minimumtimetosolutionmoreimportantthanscaling

    Turbinemodel 2.1millionDOF Nonlinearstaticanalysis 1Loadstep,7substeps,25equilibriumiterations Linuxcluster(8corespernode)

  • DistributedANSYSPerformance

    050001000015000200002500030000350004000045000

    0 8 16 24 32 40 48 56 64

    S

    o

    l

    u

    t

    i

    o

    n

    E

    l

    a

    p

    s

    e

    d

    T

    i

    m

    e

    SolutionScalability

    11hrs,48 mins

    30mins

    Minimumtimetosolutionmoreimportantthanscaling

    1hr,20mins Turbinemodel 2.1millionDOF Nonlinearstaticanalysis 1Loadstep,7substeps,25equilibriumiterations Linuxcluster(8corespernode)

  • Graphicsprocessingunits(GPUs) Widelyusedforgaming,graphicsrendering Recentlybeenmadeavailableasgeneralpurposeaccelerators Supportfordoubleprecisioncomputations PerformanceexceedingthelatestmulticoreCPUs

    SohowcanANSYSmakeuseofthisnewtechnologytoreducetheoveralltimetosolution??

    GPUAcceleratorCapability

  • AccelerateSparsedirectsolver(SMP&DMP) GPUisusedtofactormanydensefrontalmatrices DecisionismadeautomaticallyonwhentosenddatatoGPU Frontalmatrixtoosmall,toomuchoverhead,staysonCPU Frontalmatrixtoolarge,exceedsGPUmemory,onlypartiallyaccelerated

    AcceleratePCG/JCGiterativesolvers(SMP&DMP) GPUisonlyusedforsparsematrixvectormultiply(SpMV

    kernel) DecisionismadeautomaticallyonwhentosenddatatoGPU Modeltoosmall,toomuchoverhead,staysonCPU Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated

    GPUAcceleratorCapability

  • Supportedhardware CurrentlysupportNVIDIATesla20series,Quadro6000,and

    QuadroK5000cards NextgenerationNVIDIATeslacards(Kepler)shouldworkwith

    R14.5 InstallingaGPUrequiresthefollowing: Largerpowersupply(singlecardneeds~250W) Open2xformfactorPCIe x162.0(or3.0)slot

    Supportedplatforms WindowsandLinux64bitplatformsonly DoesnotincludeLinuxItanium(IA64)platform

    GPUAcceleratorCapability

  • NVIDIATeslaC2075

    NVIDIATeslaM2090

    NVIDIAQuadro6000

    NVIDIAQuadroK5000

    NVIDIATeslaK10

    NVIDIATeslaK20

    Power(W) 225 250 225 122 250 250

    Memory 6GB 6GB 6GB 4GB 8GB 6to24GBMemoryBandwidth(GB/s)

    144 177.4 144 173 320 288

    PeakSpeedSP/DP(GFlops)

    1030/515 1331/665 1030/515 2290/95 4577/190 5184/1728

    Targetedhardware

    TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect

    GPUAcceleratorCapability

  • GPUAcceleratorCapability

    2.6x

    3.8x

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    2cores 8cores 8cores

    R

    e

    l

    a

    t

    i

    v

    e

    S

    p

    e

    e

    d

    u

    p

    GPUPerformance

    (noGPU) (noGPU)

    6.5millionDOF Linearstaticanalysis Sparsesolver(DMP) 2IntelXeonE52670(2.6GHz,16corestotal),128GBRAM,SSD,4TeslaC2075,Win7

    GPUscanoffersignificantlyfastertime tosolution

    (1GPU)

  • GPUAcceleratorCapability

    GPUscanoffersignificantlyfastertime tosolution

    2.7x

    5.2x

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    6.0

    2cores 8cores 16cores

    R

    e

    l

    a

    t

    i

    v

    e

    S

    p

    e

    e

    d

    u

    p

    GPUPerformance

    11.8millionDOF Linearstaticanalysis PCGsolver(DMP) 2IntelXeonE52670(2.6GHz,16corestotal),128GBRAM,SSD,4TeslaC2075,Win7

    (noGPU) (1GPU) (4GPUs)

  • SupportsmajorityofANSYSusers CoversbothsparsedirectandPCGiterativesolvers Onlyafewminorlimitations Easeofuse RequiresatleastonesupportedGPUcardtobeinstalled RequiresatleastoneHPCpacklicense Norebuild,noadditionalinstallationsteps Performance ~1025%reductionintimetosolutionwhenusing8CPUcores Shouldneverslowdownyoursimulation!

    GPUAcceleratorCapability

  • Howwillyouuseallofthiscomputingpower?

    DesignOptimizationStudies

    DesignOptimization

    Higherfidelity Fullassemblies Morenonlinear

  • ANSYSHPCPacksenablehighfidelityinsight Eachsimulationconsumesone

    ormorepacks Parallelenabledincreases

    quicklywithaddedpacks

    Singlesolutionforallphysicsandanyleveloffidelity FlexibilityasyourHPCresourcesgrow ReallocatePacks,asresources

    allow

    2048

    328

    128

    512

    ParallelEnabled(Cores)

    PacksperSimulation1 2 3 4 5

    HPCLicensing

    1GPU+

    4 GPU+

    16 GPU+

    64 GPU+

    256 GPU+

  • Scalable,likeANSYSHPCPacks Enhancesthecustomersabilitytoincludemany designpointsaspartofasingle study

    Ensuresoundproductdecisionmaking

    Amplifiescompleteworkflow Designpointscanincludeexecutionofmultipleproducts(pre,solve,HPC,post)

    Packagedtoencourageadoptionofthepathtorobustdesign!

    NumberofSimultaneousDesignPoints

    Enabled

    NumberofHPCParametricPackLicenses

    1 2 3 4 5

    64

    84

    16

    32

    HPCParametricPackLicensing

  • HPCRevolution

    Therightcombinationofalgorithmsand hardware

    leadstomaximumefficiency

    SMPvs.DMP

    HDDvs.SSDs

    InterconnectsClusters

    GPUs

  • HPCRevolution

    Everycomputertodayisaparallelcomputer

    EverysimulationinANSYScanbenefitfromparallelprocessing