Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and...

29
Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on Extreme Scale Computing (JLESC) Franck Cappello, Argonne National Laboratory University of Illinois at Urbana Champaign

Transcript of Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and...

Page 1: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

Inria-UIUC/NCSA-ANL-BSC-JSC-RikenJointLaboratoryonExtremeScale

Computing(JLESC)

FranckCappello,ArgonneNationalLaboratory

UniversityofIllinoisatUrbanaChampaign

Page 2: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

Why?AndWhatisit?

Page 3: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCPurpose

JLESCInternational,

virtualorganization

Fundinginstitutions

FullPartners

Researchers,Students,Engineers,FromCS,MathsAndalsootherdiscisplines

MakingthebridgebetweenPetascale andExtremecomputing

Page 4: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCObjectives

JLESCInitiateandfacilitateinternationalcollaborations

Context:Computationalanddatafocusedsimulationandanalyticsatscale.

Originalideas,publications,discussionforums,researchreports,productsandopensourcesoftware

Page 5: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCtimeline

• Createdin2009withInria andUIUCasthepartners• Argonnejoinedin2013• BarcelonaSupercomputingCenter joinedin2014• Julich Supercomputingcenterjoinedin2015• Riken/AICS joinedin2015à Runningfullspeed(mayenlargethenumberofpartnerslater)

TheJLESCagreementwassignedin2014for4years.Renewedifthemajorityoffullpartnersagree

Page 6: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

• AddressHeterogeneity– Cores:(slim)computecores,accelerators,(fat)systemcores– Deepermemoryhierarchy:Caches,onsockethighbandwidthmemory,DRAM,3D

Xpoint (fastnonvolatile),NVM(NANDbased)

• Loadbalancing– Staticreasons:processvariation-gatelength,wirewidth,etc.,– Dynamicreasons:powermanagement,Turbomodes,etc.– HeterogeneousApplications(multi-physics,workflows)

• Resilience– notonlyforprocesscrash,DUEandradiationinducedSDC– ButalsoforBugsthatcouldleadtosystematicerrors

• PlateauingoftheHDDstoragebandwidth– Memorykeepsincreasing(X10PB),storageI/Oisnot(1TB/s).– Insitudataanalytics,BurstBuffers,Compression(errorboundlossy compression)

à AlotofpressureonApplications,Numericalalgorithms,Runtimes,SystemandI/Odevelopers

ChallengesfromPetascale toExtremescale

Page 7: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCResearchtopics

•Computationalsciences,application,mini-apps•Parallelprogrammingmodelsandlibraries,•Numericalalgorithmsandlibraries,•Dataanalytics,graphalgorithms,•ParallelI/Osystemsandlibraries,•Storageinfrastructuredesignandefficiency•Systemandapplicationperformanceanalysisandmodeling(Tools),•Resilience(checkpoint/restart,SDCdetection)

Page 8: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCMatchMaking/Reporting

• A2to3-daysworkshopeverysixmonths(nextonesLyonandKobe)• Asummerschooleveryyear• Researcher/studentexchanges. (fromdaysto1yearormore)

Page 9: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCResearchProject

JLESCJointProjects

Partnerinstitutions

Co-investigator

Co-investigator

NewideasPublications,Software

Aprojectreportpresented intheJLESCactivityreport

Aprojectproposal:Astatementofgoalsfortheproject,aprogramofwork,andaschedule toaccomplishthegoals.

ResearchresultsPresentedatJLESCworkshops

EvaluatedbytheJLESCcommittees

EvaluatedbytheJLESCcommittees

Page 10: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCFacilities

AninstitutioninvolvedinJLESCwillprovidetoJLESCvisitorsinvolvedincollaborations(ifavailable):•Officespace,internetaccess,administrativesupport,and•AccesstoitslocalHPCresourcesduringtheirvisit.

Page 11: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

WhatCollaborations?

Page 12: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

Applications arethedriversofJLPCresearch

HACC:(Hardware/HybridAccelerated CosmologyCode)Framework.NewCosmologicalN-BodyFramework,DesignedforextremeperformanceANDportability,includingheterogeneoussystems,Supportsmultipleprogrammingmodels,Memoryefficient,InsituanalysisframeworkLargestExecutionsonMira

TheCommunityEarthSystemModel(CESM/ACME)isafully-coupled,globalclimatemodelthatprovidesstate-of-the-artcomputersimulationsoftheEarth'spast,present,andfutureclimatestates.TheCESMprojectwilladdressimportantareasofclimatesystemresearch.Inparticular,itisaimedatunderstandingandpredictingtheclimatesystem.Multiplecoupledmodules

SPECFEM3Dsimulatesseismicwavepropagationinsedimentarybasinsoranyotherregionalgeologicalmodel.SPECFEM3Dversion2.0,named"Sesame",usesthecontinuousGalerkin spectral-elementmethod,tosimulateforwardandadjointcoupledacoustic-(an)elastic seismicwavepropagationonarbitraryunstructuredhexahedralmeshes.HybridcodemixingCPUsandGPUs

NAMD,recipientofa2002GordonBellAward,isaparallelmoleculardynamicscodedesignedforhigh-performancesimulationoflargebiomolecular systems.BasedonCharm++parallelobjects,NAMDscalestohundredsofprocessorsonhigh-endparallelplatforms.Largescaledistributedobjectsprogram

Page 13: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

InternationalcollaborationswithEuropeancollaborators

ComparisonofMeshingandCFDMethodsforAccurateFlowSimulationsonHPCsystemMembers:M.Tsubokura (RIKEN),A.Lintermann (JSC)

StrongScalabilityEnhancements toFMMforMolecularDynamicsSimulationsMembers:A.Amer,P.Balaji (ANL),I.Kabadshow (JSC),D.Haensel (JSC)

FastIntegratorsforScalableQuantumMolecularDynamicsMembers:A.Schleife (UIUC),E.Constantinescu(ANL)

HPClibrariesforsolvingdensesymmetriceigenvalueproblemsMembers:T.Imamura(RIKEN),I.Gutheil (JSC)

SharedInfrastructureforSourceTransformationAutomaticDifferentiationMembers:L.Hascoët (INRIA),S.Hari KrishnaNarayanan,PaulHovland (ANL)

ReducingCommunication inSparseIterativeandDirectSolversMembers:A.Bienz,L.Olson,B.Gropp (UIUC),L.Grigori,(INRIA)

IterativeanddirectsolversinahybridMPI/OpenMP computationalmechanicscode(Alya)Members:G.Houzeaux (BSC),LucGiraud,PierreRamet (Inria)

OptimizingChASE eigensolver forBethe-Salpeter computationsonmulti-GPUsMembers:A.Schleife (UIUC),E.DiNaopli (JSC)

ActiveProjects(Apps&Numerics)

Page 14: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

InternationalcollaborationswithEuropeancollaborators

EnhancingAsynchronousParallelisminOmpSs withArgobotsMembers:P.Balaji,A.Amer,S.Seo (ANL),J.Labarta,R.M.Badia,X.Teruel,V.BeltranQuerol (BSC)

LoadrebalancingofaparticlecodeMembers:C.Lachat,E.Jeannot,A.Denis (Inria),G.Houzeaux,M.Vazquez (BSC)

Developertoolsforporting&tuningparallelapplicationsonextreme-scaleparallelsystemsMembers:B.J.N.WylieC.Feld(JSC),M.TsujiH.Muraj (RIKEN),J.Gimenez (BSC)

StudytheuseoftheFoldinghardware-basedprofilertoassistondatadistributionforheterogeneousmemorysystemsinHPC

Members:A.J.Pena,H.Servat,J.Labarta (BSC),P.Balaji (ANL)

ActiveProjects(Prog.Models,Runtime&tools)

Page 15: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

InternationalcollaborationswithEuropeancollaborators

Resourcemanagementandschedulingfordata-intensiveHPCworkflowsMembers:N.l Cheriere,GabrielAntoniu (Inria),M.Dorier,R.Ross(ANL)

ElasticprovisioningfordatastreamsMembers:L.Pineda, O.Marcu,Alexandru Costan,GabrielAntoniu (Inria),B. Subramaniam,KateKeahey (ANL),

Towardtaminglargeandcomplexdataflowsindatacentric supercomputingMembers:E.Jeannot (Inria),FrançoisTessier (Inria,nowANL),V.Vishwanath (ANL)

Extreme-ScaleWorkflowTools:Swift,Decaf,Damaris,andFlowVRMembers:J.M.Wozniak,T.Peterka,M.Dreher,M.Dorier,M.Dreher,R.Ross (ANL),S.Ibrahim,O.Yildiz,G.Antoniu,B.Raffin (INRIA)

SmartinsituvisualizationMembers:L.Rahmani,L.Bougé,G.Antoniu(Inria),M.Dorier (Inria,nowANL),T.Peterka (ANL)

ExploitingOmnisc’IO toimporve scheduling,prefetchingandcachinginExascale systemsMembers:S.Ibrahim,G.Antoniu (Inria),M.Dorier (Inria nowANL),R.Ross(ANL)

TowardsInterference-awarescheduling inHPCsystemsMembers:O.Yildiz,S.Ibrahim,G.Antoniu (Inria),M.Dorier (Inria,nowANL),R.Ross(ANL)

ModelingandavoidingexecutioninterferencesMembers:F.Wagner,R.Bleuse,D.Trystram (Inria),G.Bauer(UIUC),F.Cappello,(ANL)

ActiveProjects(I/O,Visualization,Storage)

Page 16: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

InternationalcollaborationswithEuropeancollaborators

EvaluatingLossy CompressionforHPCcheckpointingMembers:J.CalhounL.OlsonM.Snir (UIUC),ShengDi,F.Cappello (ANL)

NewTechniquestoDesignSilentDataCorruptionDetectorsMembers:O.Subasi,L.BautistaGomez(ANL,nowBSC),O.Unsal,J.Labarta (BSC),S.Di,P.-L.Guhur,F.Cappello(ANL),A.Benoit,Y.Robert,A.Cavelan,H.Sun(Inria)

Hybridresilience forMPI+taskscodesMembers:T.Martsinkevich (Inria,nowRiken),F.Cappello (ANL),O.Subasi,O.Unsal (BSC)

OptimizationoffaulttolerancestrategyforworkflowMembers:S.Di,T.Peterka ,F.Cappello (ANL),A.Cavelan ,A.Benoit,YvesRobert(Inria)

ActiveProjects(Resilience)

Page 17: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

Successstories

Page 18: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

TheG8(multi-disciplinary)ECSprojectEnablingClimateSimulationatExtremeScale

F.Cappello(MainPI, ANL-UIUC,INRIA),M.Snir (ANL,UIUC),G.Bosilca (UTK),L.Giraud(INRIA),J.Labarta (BSC),R.Loft(NCAR),S.Matsuoka(Titech),M.Sato(U.Tsukuba),F.Wolf(GRS),O.Unsal,A.Weaver(U.Victoria),J.White(NCAR),D.Wuebbles (UIUC)

• Objectives:• Understandwhatwillbethemajorobstacles that

theclimatecommunitywillfaceatExascale• Proposeandevaluatepossiblewaystoovercome

theseobstacles• TargetCodesandSystems:

• CESM(NorthAmerica)andNICAM(Japan)• KComputer,BlueGene P/Q,BlueWaters,Tsubame 2

• Results:• Identifiedscenariosofclimatesimulationsatextremescale• Modelingandauto-tuning/scheduling forintra-nodeheterogeneity• Identifylikelyscalabilitybottlenecks&definepossiblesolutions• Newhybridfault-tolerantprotocols

CESMNICAM

• 58participants• 40publications,25talks• 7softwareprototypes• 11advisedstudents• 24visits

Page 19: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

CrackingtheFaultToleranceChallengeofExascale Systems(since2009)

Team:OsmanUnsal,OmerSubasi (BSC),TatianaMartsinlevich,TomasRopars,AminaGuermouche,YvesRobert (Inria),LeonardoBautistaGomez,ShengDi,FranckCappello,MarcSnir (ANLandUIUC)Firstchallenge:ReduceCheckpoint/restarttimeà FTI(faulttoleranceInterface)formulti-level checkpointingà Lossy compressionofcheckpointsandrestartfromlossy statesSecondchallenge:Reduceasmuchaspossiblecheckpoint/restartoverheadà Formulationofoptimalcheckpointintervalsformulti-level checkpointingThirdChallenge: Acceleratingrestartà Identifiedthesend-determinism anddeveloped

severallowoverheadmessage loggingprotocol.à Solvedmemoryoccupationissueofmessage logging.FourthChallenge:DetectionofSystematicandnon-systematicSilentDataCorruptionsà DevelopedanewclassoflowoverheadSDCdetectorsbasedon

surrogatefunctions(usingprediction,machine learningandexploitingnumerical properties)

>10visits,>10publications,5softwareprototypes

Restartfromlossy state

Recoveryacceleration

Recall

Page 20: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

Idea:onededicatedI/OcorepermulticorenodeOriginality:sharedmemory,asynchronousprocessingImplementation:softwarelibraryApplication:tornadosimulation,fluiddynamicsTransferredtotheBlueWaterssupercomputer(2011)

AddressingI/OBottleneck(AcriticalproblematExascale)

Damaris:AMiddlewareforI/OonMulticoreTeam:GabrielAntoniu (Inria),Matthieu Dorier (InrianowANL),RobRoss(ANL),MarcSnir (UIUC)

20

• Scaleson10,000+coresonKraken(11thofTop500in2010)

• Scaleson16,000+coresonTitan(1stofTop500in2012)

• x12lessfiles• x15writethroughput• Jitterfullyhidden• Predictableperformance• Damaris ADTproject:(2015-2017)

20

Page 21: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

AddressingI/OInterference

2121

• Identifyingpotentialrootcausesofinterference• Networkinterfaceandprotocols• Networkcomponents• Storageservers,schedulers• Storagedevices(disks,SSDs)

• Interplaybetweentheabovesources

O.Yildiz etal.IPDPS2016,OntheRootCausesofCross-ApplicationI/OInterferenceinHPCStorageSystems

• Mitigating I/Ointerference:theCALCioM approach

• Mainideas:• applicationscommunicate

withoneanother• differentcoordination

strategiescanbeimplemented

M.Dorieretal.IPDPS2014,CALCioM:MitigatingI/OInterferenceinHPCSystemsthroughCross-ApplicationCoordination

Page 22: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

22

LaurentHascoët,Valeri Pascual (Inria)PaulHovland,SriHari KrishnaNarayanan,MichelSchanen,JeanUtke*(ANL).

PavingtheroadofAlgorithmicDifferentiation(Fundamentalandtechnicallychallenging)

• ADisatechnique forcomputingderivativesofprograms– ExampleofDerivativesuse:sensitivityanalysis

• StudythesensitivityofthemodeloutputsWRTuncertaininputparameters• IfRisafunctionofXandY,hentheuncertaintyinRisobtainedby:• WhereåX andåY aretheuncertaintyonXandY

– Challenges:Efficiencyofderivativecode,complexityofADtooldevelopment,dynamicallocationofmemory.

• Efficiency:DevelopedanimplementationoftheChristiansonmethodfordifferentiatingfixedpoint iterationswithinOpenAD (withDanGoldberg fromEdinburgh).Application:MITgcm packagehalfpipe_streamice

• ComplexityofADTools:DevelopingTapenadeXAIF -- AninterfacebetweenInria’sTapenadeandOpenAD fromArgonne.LeveragesTapenade’ssourceanalysisandOpenAD’s differentiationalgorithms.Application:OpenAD regressionsuite

• MemoryAllocation:DevelopedADMM– thefirstlibrarytosupportADofcodescontainingdynamicmemoryallocation/deallocation.Application:Chemicalengineeringprocessmodel: simulatedmovingbed(SMB)

SmithGlacier,Antarctica

Page 23: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

PromisingStories

Page 24: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

24

-Discussion initiatedDecember2015JLESCworkshoponFPGAforHPCanddataanalytics-SpecificworkshopatANLonFPGA(Riken,BSC,UIUCandANL)withXilinxandAltera-Sessiononadvancedarchitecture intheJune2016JLESCworkshop-SpecificworkshopatUIUConFPGA

BeyondMoore’slaw:ReconfigurableComputing

Example:Stratix 10FPGASoC:

SecureDeviceManager

Quad64bitsARMCortex-A53HardProcessor(1.5GHz)

EmbeddedMulti-DieInterconnectBridge

5.5Mlogicelements1GHzfabricclocking(2Xpreviousgen.)

Intel14nmtri-gatetechnology(KnightsLanding)

Multi-dieonsubstrate

•10TFLOPs (single-precision IEEE754DSPs)•~100GFlops/Watt(SP)•64bitquad-coreARM(1.5GHz)•1Tbps bandwidth (Hybrid MemoryCube)•Upto14430Gbps transceivers•5.5Mlogicelements(FPGA)•1GHzfabricclocking(2Xpreviousgen.)

HowtoincreasetheperformancewhenFrequency,Power,andIntegrationarebounded?1)Dedicatedarchitectures(Anton),2)Reconfigurable

Page 25: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

WhatImpact?

Page 26: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

EvaluationCriteriafortheScientific/AdvisoryCommittee• Impacton extremescaleplatforms• ImpactonthebroadHPCcommunity• ImpactonHPCresearchinthedifferentinstitutions

• Qualityandnumberofpublications• Softwareprototypes– open-sourceornot,numberofdownloads• IntegrationofthesoftwareprototypesintoExtremeScalePlatform

softwarestack• Researchgrants• Visibility:Keynotes,pressreleases,interviews• Educationofyoungresearchers:advisingandteaching• Transfer, involvementofindustry,patents,standardization• Researchactivitiesmanagement(includingworkshoporganization,

meetingorganization)• Dissemination(toalargepublicaudience)

Page 27: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

Quantitativeevaluation(notcounting2015-2016)

Numberofactiveprojects:10to23

TotalnumberofPersonMonths:829

Numberofvisits:63

Numberofacceptedpublications:61

Numberofsoftware:6

Studentadvising:29

Numberofadditionalfunding:5

Numberofstudentawards:4

727

2009-2010 2010-2011 2011-2012 2012-2013 2013-2014 2014-2015

5152009-2010 2010-2011 2011-2012 2012-2013 2013-2014 2014-2015

Page 28: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

JLESCasaResearchObjectJLESCisthemostsuccessfulinternationalcollaborationsofUIUC

Why?

ScootPoole(HumanScience)usesJLESCasaresearchobject:

• Understandwhatmake collaborationswork

• Formalize

• Helpothercollaborations

JLESC

Page 29: Inria-UIUC/NCSA-ANL-BSC-JSC-Riken Joint Laboratory on ... · àLossycompression of checkpoints and restart from lossystates Second challenge: Reduce as much as possible checkpoint/restart

Thankyou