Hpc for Mechanical Ansys

of 33

  • date post

    31-Oct-2015
  • Category

    Documents

  • view

    62
  • download

    1

Embed Size (px)

Transcript of Hpc for Mechanical Ansys

  • HighPerformanceComputingforMechanicalSimulationsusingANSYS

    JeffBeisheimANSYS,Inc

  • HighPerformanceComputing(HPC)atANSYS:

    Anongoingeffortdesignedtoremovecomputinglimitationsfromengineerswhousecomputeraidedengineeringinallphasesofdesign,analysis,andtesting.

    Itisahardwareandsoftware initiative!

    HPCDefined

  • NeedforSpeed

    ImpactproductdesignEnablelargemodelsAllowparametricstudies

    ModalNonlinearMultiphysicsDynamics

    AssembliesCADtomeshCapturefidelity

  • AHistoryofHPCPerformance

    1990 SharedMemoryMultiprocessing(SMP)available

    1990 SharedMemoryMultiprocessing(SMP)available

    1994IterativePCGSolverintroducedforlargeanalyses

    1994IterativePCGSolverintroducedforlargeanalyses

    1999 200064bitlargememoryaddressing

    1999 200064bitlargememoryaddressing

    20041st companytosolve100MstructuralDOF

    20041st companytosolve100MstructuralDOF

    2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores

    2007 2009 OptimizedformulticoreprocessorsTeraflopperformanceat512cores

    1980s VectorProcessingonMainframes

    1980s VectorProcessingonMainframes

    20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC

    20052007DistributedPCGsolverDistributedANSYS(DMP)releasedDistributedsparsesolverVariational TechnologySupportforclustersusingWindowsHPC

    1980

    1990

    2010

    2000

    2012

    2010GPUacceleration(singleGPU;SMP)

    2010GPUacceleration(singleGPU;SMP)

    2012 GPUacceleration(multipleGPUs;DMP)

    2012 GPUacceleration(multipleGPUs;DMP)

  • HPCRevolution

    Recentadvancementshaverevolutionizedthecomputationalspeedavailableonthedesktop Multicoreprocessors Everycoreisreallyanindependentprocessor

    LargeamountsofRAMandSSDs GPUs

  • ParallelProcessing Hardware

    2Typesofmemorysystems Sharedmemoryparallel(SMP) singlebox,workstation/server Distributedmemoryparallel(DMP)multipleboxes,cluster

    ClusterWorkstation

  • ParallelProcessing Software

    2TypesofparallelprocessingforMechanicalAPDL Sharedmemoryparallel(np >1) Firstavailableinv4.3 Canonlybeusedonsinglemachine

    Distributedmemoryparallel(dis np >1) Firstavailableinv6.0withtheDDSsolver Canbeusedonsinglemachineorcluster

    GPUacceleration(acc) Firstavailableinv13.0usingNVIDIAGPUs SupportsusingeithersingleGPUormultipleGPUs Canbeusedonsinglemachineorcluster

  • Distributed ANSYSDesignRequirements

    Nolimitationinsimulationcapability Mustsupportallfeatures Continuallyworkingtoaddmorefunctionalitywitheachrelease Reproducibleandconsistentresults Sameanswersachievedusing1coreor100cores SamequalitychecksandtestingaredoneaswithSMPversion UsesthesamecodebaseasSMPversionofANSYS Supportallmajorplatforms Mostwidelyusedprocessors,operatingsystems,andinterconnects SupportssameplatformsthatSMPversionsupports UseslatestversionsofMPIsoftwarewhichsupportthelatest

    interconnects

  • Distributed ANSYSDesign

    Distributedsteps(dis np N) Atstartoffirstloadstep,decomposeFEAmodelintoNpieces(domains)

    Eachdomaingoestoadifferentcoretobesolved

    Solutionisnotindependent!! Lotsofcommunicationrequiredto

    achievesolution Lotsofsynchronizationrequiredto

    keepallprocessestogether Eachprocesswritesitsownsetsoffiles(file0*,file1*,file2*,,file[N1]*)

    Resultsareautomaticallycombinedatendofsolution Facilitatespostprocessingin/POST1,

    /POST26,orWorkBench

  • Distributed ANSYSCapabilities Staticlinearornonlinearanalyses Bucklinganalyses Modalanalyses HarmonicresponseanalysesusingFULLmethod TransientresponseanalysesusingFULLmethod Singlefieldstructuralandthermalanalyses Lowfrequencyelectromagneticanalysis Highfrequencyelectromagneticanalysis Coupledfieldanalyses Allwidelyusedelementtypesandmaterials Superelements(usepass) NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE, LinearPerturbation Multiframerestarts Cyclicsymmetryanalyses UserProgrammablefeatures(UPFs)

    Widevarietyoffeatures&analysis

    capabilitiesaresupported

  • DistributedANSYSEquationSolvers

    Sparsedirectsolver(default) SupportsSMP,DMP,andGPUacceleration Canhandleallanalysistypesandoptions FoundationforBlockLanczos,Unsymmetric,Damped,andQR

    dampedeigensolvers PCGiterativesolver SupportsSMP,DMP,andGPUacceleration Symmetric,realvaluematricesonly(i.e.,static/fulltransient) FoundationforPCGLanczoseigensolver JCG/ICCGiterativesolvers SupportsSMPonly

  • DistributedANSYSEigensolvers

    BlockLanczoseigensolver(includingQRdamp) SupportsSMPandGPUacceleration PCGLanczoseigensolver SupportsSMP,DMP,andGPUacceleration Greatforlargemodels(>5MDOF)withrelativelyfewmodes(

  • DistributedANSYSBenefits

    Betterarchitecture Morecomputationsperformedinparallel fastersolutiontime BetterspeedupsthanSMP Canachieve>10xon16cores(trygettingthatwithSMP!) Canbeusedforjobsrunningon1000+CPUcores

    Cantakeadvantageofresourcesonmultiplemachines Memoryusageandbandwidthscales Disk(I/O)usagescales Wholenewclassofproblemscanbesolved!

  • DistributedANSYSPerformance

    Needfastinterconnectstofeedfastprocessors Twomaincharacteristics foreachinterconnect:latencyandbandwidth DistributedANSYSishighlybandwidthbound

    +--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+Release: 14.5 Build: UP20120802 Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHzTotal number of cores available : 32Number of physical cores available : 32Number of cores requested : 4 (Distributed Memory Parallel)MPI Type: INTELMPI

    Core Machine Name Working Directory----------------------------------------------------

    0 hpclnxsmc00 /data1/ansyswork1 hpclnxsmc00 /data1/ansyswork2 hpclnxsmc01 /data1/ansyswork3 hpclnxsmc01 /data1/ansyswork

    Latency time from master to core 1 = 1.171 microsecondsLatency time from master to core 2 = 2.251 microsecondsLatency time from master to core 3 = 2.225 microseconds

    Communication speed from master to core 1 = 7934.49 MB/sec Same machineCommunication speed from master to core 2 = 3011.09 MB/sec QDR InfinibandCommunication speed from master to core 3 = 3235.00 MB/sec QDR Infiniband

  • DistributedANSYSPerformance

    0

    10

    20

    30

    40

    50

    60

    8cores 16cores 32cores 64cores 128cores

    R

    a

    t

    i

    n

    g

    (

    r

    u

    n

    s

    /

    d

    a

    y

    )

    InterconnectPerformance

    GigabitEthernetDDRInfiniband

    Needfastinterconnectstofeedfastprocessors

    Turbinemodel 2.1millionDOF SOLID187elements Nonlinearstaticanalysis Sparsesolver(DMP) Linuxcluster(8corespernode)

  • DistributedANSYSPerformance

    Needfastharddrivestofeedfastprocessors Checkthebandwidthspecs

    ANSYSMechanicalcanbehighlyI/Obandwidthbound SparsesolverintheoutofcorememorymodedoeslotsofI/O

    DistributedANSYScanbehighlyI/Olatencybound Seektimetoread/writeeachsetoffilescausesoverhead

    ConsiderSSDs Highbandwidthandextremelylowseektimes

    ConsiderRAIDconfigurationsRAID0 forspeedRAID1,5 forredundancyRAID10 forspeedandredundancy

  • DistributedANSYSPerformance

    Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.

    Needfastharddrivestofeedfastprocessors

    0

    5

    10

    15

    20

    25

    30

    1core 2cores 4cores 8cores

    R

    a

    t

    i

    n

    g

    (

    r

    u

    n

    s

    /

    d

    a

    y

    )

    HardDrivePerformance

    HDDSSD

    8millionDOF Linearstaticanalysis Sparsesolver(DMP) DellT5500workstation

    12IntelXeonx5675cores,48GBRAM,single7.2krpmHDD,singleSSD,Win7)

  • AvoidwaitingforI/Otocomplete! ChecktoseeifjobisI/Oboundorcomputebound CheckoutputfileforCPUandElapsedtimes WhenElapsedtime>>mainthreadCPUtime I/Obound

    ConsideraddingmoreRAMorfasterharddriveconfiguration WhenElapsedtimemainthreadCPUtimeComputebound Consideringmovingsimulationtoamachinewithfasterprocessors ConsiderusingDistributedANSYS(DMP)insteadofSMP ConsiderrunningonmorecoresorpossiblyusingGPU(s)

    DistributedANSYSPerformance

    Total CPU time for main thread : 167.8 seconds. . . . . .Elapsed Time (sec) = 388.000 Date = 08/21/2012

  • AllrunswithSparsesolverHardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernodeHardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernodeANSYS12.0to14.0runswithDDRInfinibandinterconnectANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS

    DistributedANSYSPerformanceANSYS11.0 ANSYS12.0 ANSYS12.1 ANSYS13.0SP2 ANSYS 14.0

    Thermal(fullmodel)3MDOF

    Time 4 hours 4hours 4hours 4hours 1hour 0.8hour

    Cores 8 8 8 8 8+1GPU 32

    ThermomechanicalSimulation (fullmodel)7.8MDOF

    Time ~5.5days 34.3hours 12.5hours 9.9 hours 7.5hours

    Iterations 163 164 195 195 195

    Cores 8 20 64 64 128

    Interpolation ofBoundaryConditions

    Time 37hours 37hours 37hours 0.2hour 0.2hour

    LoadSteps 16 16 16 Improvedalgorithm 16

    Submodel:CreepStrainAnalysis5.5MDOF

    Time ~5.5days 38.5hours 8.5hours 6.1hours 5.9hours 4.2hours

    Iterations 492 492 492 488 498 498

    Cores 18 16 76 128 64+8GPU 256

    TotalTime 2weeks 5days 2 days 1day 0.5day

    ResultsCourtesyofMicroConsult Engineering,GmbH

  • DistributedANSYSPerformance

    0

    5

    10

    15

    20

    25

    0 8 16 24 32 40 48 56 64

    S

    p

    e

    e

    d

    u

    p

    SolutionSca