Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright...

37
Lawrence Berkeley National Laboratory Recent Work Title Storage 2020: A Vision for the Future of HPC Storage Permalink https://escholarship.org/uc/item/744479dp Authors Lockwood, GK Hazen, D Koziol, Q et al. Publication Date 2017-10-20 Peer reviewed eScholarship.org Powered by the California Digital Library University of California

Transcript of Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright...

Page 1: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Lawrence Berkeley National LaboratoryRecent Work

TitleStorage 2020: A Vision for the Future of HPC Storage

Permalinkhttps://escholarship.org/uc/item/744479dp

AuthorsLockwood, GKHazen, DKoziol, Qet al.

Publication Date2017-10-20 Peer reviewed

eScholarship.org Powered by the California Digital LibraryUniversity of California

Page 2: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage

GlennK.Lockwood,DamianHazen,QuinceyKoziol,ShaneCanon,KatieAntypas,JanBalewski,NicholasBalthaser,WahidBhimji,JamesBotts,JeffBroughton,TinaL.Butler,GregoryF.Butler,RaviCheema,ChristopherDaley,TinaDeclerck,LisaGerhardt,WayneE.Hurlbert,KristyA.Kallback-Rose,StephenLeak,JasonLee,ReiLee,JialinLiu,KirillLozinskiy,DavidPaul,Prabhat,CorySnavely,Jay

Srinivasan,TaviaStoneGibbins,NicholasJ.Wright

NationalEnergyResearchScientificComputingCenterLawrenceBerkeleyNationalLaboratory

Berkeley,CA94720

ReportNo.LBNL-2001072

November2017

ThisworkwassupportedbytheDirector,OfficeofScience,OfficeofAdvancedScientificComputingResearchoftheU.S.DepartmentofEnergyunderContractNo.DE-AC02-05CH11231.

Page 3: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

ThisdocumentwaspreparedasanaccountofworksponsoredbytheUnitedStatesGovernment.Whilethisdocumentisbelievedtocontaincorrectinformation,neithertheUnitedStates

Governmentnoranyagencythereof,northeRegentsoftheUniversityofCalifornia,noranyoftheiremployees,makesanywarranty,expressorimplied,orassumesanylegalresponsibilityforthe

accuracy,completeness,orusefulnessofanyinformation,apparatus,product,orprocessdisclosed,orrepresentsthatitsusewouldnotinfringeprivatelyownedrights.Referencehereintoanyspecificcommercialproduct,process,orservicebyitstradename,trademark,manufacturer,orotherwise,doesnotnecessarilyconstituteorimplyitsendorsement,recommendation,orfavoringbytheUnitedStatesGovernmentoranyagencythereof,ortheRegentsoftheUniversityofCalifornia.Theviews

andopinionsofauthorsexpressedhereindonotnecessarilystateorreflectthoseoftheUnitedStatesGovernmentoranyagencythereofortheRegentsoftheUniversityofCalifornia.

Page 4: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

3

TableofContents1. Introduction...................................................................................................................................6

2. NERSCStorageHierarchy..........................................................................................................72.1. CurrentStorageInfrastructureatNERSC.....................................................................................72.2. Workflow-basedModelforStorage...............................................................................................8

3. Requirements..............................................................................................................................113.1. CurrentI/OPatterns........................................................................................................................113.2. NERSC-9Requirements...................................................................................................................143.3. DOEExascaleRequirementsReviews........................................................................................153.4. EmergingApplicationsandUseCases........................................................................................163.5. OperationalRequirements............................................................................................................173.5.1. Reliability,Durability,Longevity,andDisasterRecovery............................................................173.5.2. Spacemanagementandcurationfeatures..........................................................................................183.5.3. Availability........................................................................................................................................................18

3.6. GapsandChallenges.........................................................................................................................193.6.1. Tiering................................................................................................................................................................193.6.2. DataMovement...............................................................................................................................................193.6.3. DataCuration...................................................................................................................................................193.6.4. WorkloadDiversity.......................................................................................................................................203.6.5. StorageSystemSoftware............................................................................................................................203.6.6. HardwareConcerns......................................................................................................................................213.6.7. POSIXandMiddleware................................................................................................................................21

4. TechnologyLandscapeandTrends......................................................................................214.1. Hardware.............................................................................................................................................214.1.1. MagneticDisk..................................................................................................................................................224.1.2. Solid-StateStorage........................................................................................................................................234.1.3. StorageClassMemoryandNonvolatileRAM....................................................................................244.1.4. MagneticTape.................................................................................................................................................254.1.5. StorageSystemDesign................................................................................................................................26

4.2. Software................................................................................................................................................274.2.1. Non-POSIXStorageSystemSoftware....................................................................................................274.2.2. ApplicationInterfacesandMiddleware...............................................................................................28

5. NextSteps.....................................................................................................................................285.1. VisionfortheFuture........................................................................................................................285.2. Strategy.................................................................................................................................................305.2.1. NearTerm(2017–2020)..........................................................................................................................305.2.2. LongTerm(2020–2025)..........................................................................................................................325.2.3. OpportunitiestoInnovateandContribute.........................................................................................34

6. Conclusion....................................................................................................................................36

Page 5: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

4

ExecutiveSummaryTheexplosivegrowthindataoverthenextfiveyearsthatwillaccompanyexascalesimulationsandnewexperimentaldetectorswillenablenewdata-drivenscienceacrossvirtuallyeverydomain.Atthesametime,newnonvolatilestoragetechnologieswillenterthemarketinvolumeandupendlong-heldprinciplesusedtodesignthestoragehierarchy.Thedisruptionthattheseforceswillbringtobearonhigh-performancecomputing(HPC)willalsocreatesignificantopportunitiestoinnovateandacceleratescientificdiscovery.ToensurethatNERSCfullycapitalizesontheseopportunities,wehavedevelopedacomprehensivevisionforthefutureofstorageinHPCandidentifiedshort-andlong-termstrategicgoalstoeffectivelyrealizethisvision.ThisreportpresentstheresultsofthiseffortandoffersablueprintfordesigningastorageinfrastructureforsupportingHPCthrough2025andbeyond.

Atahighlevel,abroadsurveyofscientificworkflowsanduserrequirementsreviewsidentifiedfourlogicaltiersofdatastoragewithdifferentperformance,capacity,shareability,andmanageabilityrequirements:

• Temporarystorage,whichcontainsdatabeingactivelyusedbysimulationanddataanalysisapplicationsoverthecourseofhourstodays.

• Campaignstorage,whichcontainsdatabeingactivelyusedbylargerworkflowsandscienceprojectsoverthecourseofweekstomonths.

• Communitystorage,whichcontainslargerdatasetsthataresharedamongdifferentprojectswithinascientificcommunityoverthecourseofyears.

• Foreverstorage,whichcontainshigh-valueorirreplaceabledatasetsindefinitely.

ThesefourtiersdonotneatlymaptothephysicalstoragehierarchypresentlydeployedatNERSCtoday,butoverthenextseveralyears,NERSCwillusetacticaldeploymentstocloselyalignstorageresourceswiththeserequirements.By2020,ouraimistoaccommodateTemporarystoragedataandmuchoftheCampaignstoragedataontoasingle,flash-basedstoragesystemthatistightlyintegratedwiththeNERSC-9computeplatformthatwillbedeployedthatyear.Simultaneously,disk-basedCommunityandtape-basedForevertierswillbemorecloselycoupledandprovideasingle,seamlessuserinterfacethatwillsimplifythemanagementoflong-liveddataforbothusersandcenterstaff.Thesetierswillbeimplementedoff-platformtoenablethemtogrowinresponsetouserneedsandpersistbeyondthelifetimeoftheNERSC-9computesystem.

By2025,thenonvolatilemediaunderpinningtheconvergedTemporary/Campaignstoragetierwillexposeextremeperformanceandscalabilitythroughahigh-performanceobjectinterface.UserswhowanttouseafamiliarPOSIXfilesysteminterfacetoaccessdataonthissystemwillusePOSIXmiddlewarethatprovidescompatibilityatthecostofperformance.Similarly,theoff-platformCommunity/Forevertierswillconvergeintoasinglemassstoragesystemby2025,anddataaccesswilloccurthroughindustry-standardobjectstorageinterfacesthatmorenaturallymaptotheusepatternsoflong-liveddata.Today'sfilesysteminterfacesandcustomHPSSclientsoftwarewillbealternateaccessmodes,buttheunderlyingstoragesystemwilltransparentlycombinetheeconomicsoftapeandtheaccessibilityofdiskintooneseamlessdatarepository.

Thetransitionfromfilesystemstoobjectstoresasexascalebecomeswidespreadin2025willrequireuserstochangetheirapplicationsoradoptI/Omiddlewarethatabstractsawaytheinterfacechanges.Ensuringthatusers,applications,andworkflowswillbereadyforthistransitionwillrequireimmediateinvestmentintestbedsthatincorporatebothnewnonvolatilestoragetechnologiesandadvancedobjectstoragesoftwaresystemsthateffectivelyusethem.Thesetestbedswillalsoprovideafoundationon

Page 6: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

5

whichanewclassofdatamanagementtoolscanbebuilttoleveragetheflexibilityofuser-definedobject-levelmetadata.

AstheDOEOfficeofScience'smissioncomputingfacility,NERSCwillfollowthisroadmapanddeploythesenewstoragetechnologiestocontinuedeliveringstorageresourcesthatmeettheneedsofitsbroadusercommunity.NERSC'sdiversityofworkflowsencompasssignificantportionsofopenscienceworkloadsaswell,andthefindingspresentedinthisreportarealsointendedtobeablueprintforhowtheevolvingstoragelandscapecanbebestutilizedbythegreaterHPCcommunity.ExecutingthestrategypresentedherewillensurethatemergingI/Otechnologieswillbebothapplicabletoandeffectiveinenablingscientificdiscoverythroughextreme-scalesimulationanddataanalysisinthecomingdecade.

Page 7: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

6

1. IntroductionTheNationalEnergyResearchScientificComputingCenter(NERSC)atLawrenceBerkeleyNationalLaboratoryisthemissionscientificcomputingfacilityfortheOfficeofScience(SC)intheU.S.DepartmentofEnergy(DOE).Asoneofthelargestfacilitiesintheworlddevotedtoprovidingcomputationalresourcesandexpertiseforbasicscientificresearch,NERSCisaworldleaderinacceleratingscientificdiscoverythroughhighperformancecomputing(HPC)anddataanalysis.StoragesystemsplayacriticalroleinsupportingNERSC'smissionbyenablingtheretentionanddisseminationofsciencedatausedandproducedatthecenter.Overthepast10years,thetotalvolumeofdatastoredatNERSChasincreasedfrom3.5PiBto146PiBandcontinuestogrowatanannualrateof30%,drivenbya1000xincreaseinsystemperformanceand100xincreaseinsystemmemory.Inaddition,therehasbeendramaticgrowthinexperimentalandobservationaldata,andexperimentalfacilitiessuchastheLargeSynopticSurveyTelescope(LSST)1andLinacCoherentLightSource(LCLS)2areincreasinglyturningtoNERSCtomeettheirdataanalysisandstoragerequirements.

Asthesedatarequirementscontinuetogrow,thetechnologiesunderpinningtraditionalstorageinHPCarerapidlytransforming.Solid-statedrivesarenowbeingintegratedintoHPCsystemsasanewtierofhigh-performancestorage,shiftingtheroleofmagneticdiskmediaawayfromperformance,andtaperevenuesareonaslowdecline.Economicdriverscomingfromcloudandhyperscaledatacenterprovidersarealteringthemassstorageecosystemaswell,rapidlyadvancingthestateoftheartinobject-basedstoragesystemsoverPOSIX-basedparallelfilesystems.Inadditiontothesechangingtides,non-volatilestorage-classmemory(SCM)isemergingasanextremelyhigh-performance,low-latencymediawhoseroleinthestoragehierarchyremainsthesubjectofintenseresearch.Thecombinationofthesefactorsbroadensthedesignspaceoffuturestoragesystems,creatingnewopportunitiesforinnovationwhilesimultaneouslyintroducingnewuncertainties.

ToclarifyhowtheevolvingstoragerequirementsoftheNERSCusercommunitycanbebestmetgiventhestoragetechnologylandscapeoverthenexttenyears,wepresenthereadetailedanalysisofNERSCusers'datarequirementsandrelevanthardware,middleware,andsoftwaretechnologiesandtrends.Fromthisweproposeareferencestoragearchitecturethataddressestheincreasingdatademandsfromexternalexperimentalfacilities,datascience,andotheremergingworkloadswhilecontinuingtosupporttheneedsoftraditionalHPCusers.Weenumeratetherequirementsoflonger-termedstorageresourcesthatenablepublication,collaboration,andcurationovermultipleyears.

WelayoutaroadmapforthecentertodeploystorageresourcesthatbestserveNERSCusersin2020andidentifytheactionsrequiredtorealizethisstrategy.Wethendescribetheevolutionofstoragesystemsbeyond2020andhowadvancesinstoragehardwareandinnovationwithinDOEandinindustrywillimpactourlong-termstoragestrategythrough2025.Withthisroadmapandlong-termstrategy,weidentifyareaswhereNERSCispositionedtoprovideleadershipinstorageinthecomingdecadetoensureourusersareabletomakethemostproductiveuseofallrelevantstoragetechnologies.BecauseoftheNERSCworkload'sdiversityacrossscientificdomains,thisanalysisandthe

1Ivezić,Zetal.2011.LargeSynopticSurveyTelescope(LSST)ScienceRequirementsDocument.https://docushare.lsst.org/docushare/dsweb/Get/LPM-17.AccessedSeptember11,2017.22016.LCLSDataAnalysisStrategy.https://portal.slac.stanford.edu/sites/lcls_public/Documents/LCLSDataAnalysisStrategy.pdf.AccessedSeptember11,2017.

Page 8: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

7

referencestoragearchitectureshouldberelevanttoHPCstorageplanningoutsideofNERSCandtheDOE.

2. NERSCStorageHierarchyNERSChasmorethan6,000activeuserswithmorethan700activeprojectsthatspanabroadrangeofsciencedisciplines,suchasmaterialsscience,astrophysics,bioinformatics,andclimatescience.ThediversityofworkflowsatNERSCresultinawiderangeofI/Opatterns,datavolumes,andretentionrequirements;forexample,anumberofprojectsusedatafromexperimentalandobservationalfacilitiesaspartoftheirworkflowandneedhigh-capacitystorageatNERSCtoingestobservationaldatathatistransferredoverthewide-areanetwork.Agrowingnumberofprojectsalsocombinemodelingandsimulationwithexperimentalorobservationaldata,whichisincreasingthecomplexityofworkflowsandthedemandforstorageresourcesaccessiblefrombothextreme-scalecomputesystemsandthewide-areanetwork.Tomeetthesediverseneeds,NERSCmaintainsdifferenttiersofstorage,eachoptimizedforadifferentbalanceofperformance,capacity,andmanageability.

2.1. CurrentStorageInfrastructureatNERSCAsof2017,theNERSCstoragehierarchyconsistsofa1.6PiBflash-basedburstbuffer,a27PiBLustrescratchfilesystembuiltusingharddiskdrives(HDDs),a10.7PiBdisk-basedprojectfilesystemthatprovidesmediumtermstorage,anda130PiBenterprisetape-basedarchive.Thesetiers,depictedschematicallyinFigure1,varyincapacity,performance,reliability,anddatamanagementpolicies.

FIGURE1.STORAGEHIERARCHYATNERSCIN2017

Thetoptwotiers(burstbufferandscratch)areoptimizedforperformanceandprovidesufficientcapacitytosupporttypicalactiveworkloadsinthesystem.Thesestoragesystemsareeitheractivelypurgedorrequireuserstorequestresourcesaspartoftheirjob.Theyareadvertisedasscratchspaceandmanagedasmorevolatileandlessrobustresources,andusersareencouragedtosavecriticaldataandresultstotheothertiers.Thedisk-basedscratchtieriscurrentlyimplementedusingtheLustreparallelfilesystem,andtheburstbuffercurrentlyusesCray'sDataWarpfilesystemandinfrastructure.

Page 9: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

8

Theprojectandarchivetiersareoptimizedforcapacityanddurability,butstillprovidesufficientperformancetoallowuserstomovedataeffectivelyinandout.Thesetiersarenotactivelypurgedbutinsteadmanagedviaquotas.Theprojecttierisdisk-basedandrunsIBM'sSpectrumScaleparallelfilesystem(previouslyknownasGPFS),whilethearchivetierusesacombinationofdiskandtapethataremanagedbytheHPSSsoftwaredevelopedbyacollaborationbetweenDOElabsandIBM.

Reliabilityandmanageabilityareamajorconcernfortheprojectandarchivetierssincetheyareoftentherepositoriesforusers'mostcriticaldata.Datastoredinthesesystemsarecriticaltosupportthescientificprocessitself,sincescientificresultsmustbemaintainedforlongperiodsoftimeandareoftensharedthroughthecommunityviadataportals3associatedwiththesestoragesystems.Consequently,thestoragesoftwaretechnologiesusedforthesetiersmustbehighlyrobust.Thesetiersmustalsobeabletogrowovertimetoallowforexternalprojectstosponsoradditionalspacetomeetmissionorsciencerequirements.Forexample,variousexperimentalprojectssuchasSTAR4andALICE,5alongwithexperimentalfacilitiessuchastheALS6andJGI,7haveaugmentedNERSC'sprojectfilesystemtostoretheirdata.Thiscontrastssharplywiththeburstbufferandscratchtiers,whicharetypicallydesignedspecificallytomeettheneedsofthecomputationalplatformwithwhichtheyareprocured.

2.2. Workflow-basedModelforStorageInpreparationforNERSC'snextmajorsystem,tobedeployedin2020,andaspartoftheAllianceforApplicationPerformanceatExtremeScale(APEX),8theNERSCdivisionofLawrenceBerkeleyNationalLaboratory,LosAlamosNationalLaboratory(LANL),andSandiaNationalLaboratory(SNL)surveyedtheirusers'scientificworkflowstoinformthetechnicalrequirementsfortheprocurementoftheNERSC-9andCrossroadssystems.Theresultsofthisanalysis,summarizedintheAPEXWorkflowswhitepaper,9presentsthedatamovementbetweendifferentstagesofworkflowsasworkflowdiagramstohelpreasonaboutsystemarchitecture;anexampleofsuchadiagramisshowninFigure2.Theverticalaxiscapturestherequiredretentiontimeforthedatainputsandoutputsandisamajorcontributortostoragesystemcapacityrequirements.Theverticalaxisalsospeakstotheperformancerequirementsofeachtier,asdatathatisgenerated(anddeleted)morefrequentlywillrequirehigherperformancethanthosedataproductsthataregeneratedmuchlessfrequently.

3ALSDataandSimulationPortal.https://spot.nersc.gov/.AccessedSeptember4,2017.4Adams,J.etal.2005.Experimentalandtheoreticalchallengesinthesearchforthequark–gluonplasma:TheSTARCollaboration’scriticalassessmentoftheevidencefromRHICcollisions.NuclearPhysicsA.757,1–2(Aug.2005),102–183.5Aamodt,K.etal.2008.TheALICEexperimentattheCERNLHC.JournalofInstrumentation.3,8(Aug.2008),S08002–S08002.6AdvancedLightSource.https://als.lbl.gov/.AccessedSeptember3,2017.7DOEJointGenomeInstitute:ADOEOfficeofScienceUserFacilityofLawrenceBerkeleyNationalLaboratory.https://jgi.doe.gov/.AccessedSeptember3,2017.8AllianceforApplicationPerformanceatExtremeScale.http://www.lanl.gov/projects/apex/.AccessedApril30,2017.9APEXWorkflows.http://www.nersc.gov/assets/apex-workflows-v2.pdf.AccessedApril30,2017.

Page 10: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

9

FIGURE2.DATAMOTIONANDRETENTIONINANARCHETYPALSIMULATIONSCIENCEPIPELINE.FROMTHEAPEXWORKFLOWSWHITEPAPER.10

Overall,thisstudyfoundcommonalityacrossDOEincomputeandstoragerequirements,anditpresentedataxonomyofworkflows'storagerequirementsintheformofthreelogicalstoragetiers:Temporary,Campaign,andForever:

• Temporarystorage,usedforthedurationofasingleworkflowinstance,isusedtostoreanddeliverworkingsets,checkpoints,andjoboutputs.Itisthehighestperformingstorageresource,andassuchistypicallytightlycoupledtothecomputesystem.

• Campaignstorage,usedforthedurationofaprojectorallocation,enablescollaborationwithinagroupofresearchers,providesspaceforpostprocessingandinputsetsforsubsequentruns,andfacilitatesdatacurationforlaterpublicationormovementtolonger-termstorage.ItrequiresgreatercapacitybutlessperformancethantheTemporarystoragetier.

• Foreverstorage,usedforlong-termstorage,actsasarepositoryforhigh-valuedatathatisirreplaceableorprohibitivelyexpensivetoreproduce.Itwillcontainrawdatasets,oftentoolargetostoreinotherresources,andmayalsostoregoldendatasetsthatareofwidervaluetoscientificcommunities.ItsperformancerequirementsarelowerthanCampaignstorage,butitmustbeabletoreliablyholdyearsordecadesworthofdata.

InadditiontothesethreetiersformalizedintheAPEXWorkflowsdocument,thereareadditionaldesigncriteriathatarecriticaltoNERSC'susers:theabilitytoingestandstoredatafromremoteinstruments,theavailabilityofaccesscontrolsforpublishingandsharing,andtheabilitytoefficientlyindex,search,anddescribedatasets.Thus,wealsoidentifyafourth,Communitystorage,resourcethatisoptimized

102016.APEXWorkflowsWhitepaper.http://www.nersc.gov/assets/apex-workflows-v2.pdf.AccessedApril30,2017.

Page 11: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

10

toingestdatafromexperimentalandobservationalfacilities,sharedatawithresearchersatothercenters,andfacilitatethecurationofdata.

Figure3summarizesthefunctionalityofthesefourlogicaltiersintermsoftheirbalanceofcapacityandperformanceandhowmuchoptimizationisinvestedinmakingtheircontentssearchable,shareable,andotherwiseeasilycurated.

FIGURE3.FUNCTIONALVIEWOFSTORAGETIERS

WhileFigure3depictsafunctionalviewofstorage,Figure4showshowthefunctionalmodelmapstotheNERSCresourcesshowninFigure1.

FIGURE4.MAPPINGBETWEENFUNCTIONALMODELANDACTUALSTORAGERESOURCESAVAILABLEATNERSC.

Page 12: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

11

Asisclearinthisdiagram,thestorageresourcesprovidedbyNERSCtodaydonotpreciselyalignwiththefourlogicaltierswehaveidentified.However,withtheunderstandingthatfourlogicaltiersneednotnecessarilymaptofourphysicalstorageresources,thisservesasasoundapproachtodefiningthedesignoptimaandgoalsforfuturephysicalstorageresources.

3. RequirementsAspreviouslyindicated,theNERSCworkloadisevolvingasaresultofavarietyofscientificandtechnologicalchanges.Toensurethatfuturecomputeandstorageresourceswillmeettheseevolvingneeds,wedrawonavarietyofrequirementsstudiesthatincludecurrentworkloads,theAPEXWorkflowswhitepaper,11theDOEExascaleRequirementsReviews,12andNERSCstaffexperiences.

3.1. CurrentI/OPatternsExaminingcurrentuserandapplicationI/Obehaviortargetingscratchfilesystems(theTemporarystoragetier)atNERSCshowsthatthevolumeofdatareadfromandwrittentothesescratchfilesystemsareapproximatelyequal,asshowninFigure5.Thisislikelyduetoabalancebetweencheckpoint-heavyworkloads(manywrite-heavycheckpointoperationsforeachread-heavyrestartoperation),commonexperimentalandsimulationdatasetsbeingre-readmultipletimes,andwrite-once,read-onceintermediatefilesgeneratedbyscientificworkflows,asnotedinFigure2.

FIGURE5.WEEKLYI/OREADANDWRITEVOLUMESONNERSCEDISON'SSCRATCH1ANDSCRATCH2LUSTREFILESYSTEMS.OVERALLANNUALAVERAGEREAD/WRITERATIOIS11/9.

ThisanalysisindicatesthatTemporarystorageneedstoprovidebalancedreadandwritecapabilitiesandthatstoragemedia,APIs,oraccesssemanticsthatemphasizeoneovertheotherwouldnotbe

11APEXWorkflows.http://www.nersc.gov/assets/apex-workflows-v2.pdf.AccessedApril30,2017.12DOEExascaleRequirementsReview.http://www.exascaleage.org/.AccessedAugust31,2017.

Page 13: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

12

suitablefortheNERSCworkload.Inaddition,theTemporaryandCampaignstoragetiersshouldbestronglycoupledtostreamlinedatamotionofhotdatasetsbetweentheworkingspaceandastorageresourcethatfacilitatesdatamanagementoverthecourseofthelargerscientificstudy.

AsshowninFigure6,NERSCapplicationsalsouseavarietyofPOSIXmetadatacallswithinTemporaryandCampaignstoragesystems,withthevastmajoritybeingopens,closes,andstats.ItisthereforeessentialthattheTemporarystorageresource'ssystemsoftwareimplementthesecallsinahighlyscalablefashion;forexample,calculatingthesizeofafilethatisstripedacrosshundredsofstorageserversmustbeefficient,andallowinguserstoobtainfilehandlesbywhichtheycanaccesstheirstoreddatamustincurminimallatency.

FIGURE6.DISTRIBUTIONOFMETADATAOPERATIONCOUNTSONNERSCEDISON'SSCRATCH1ANDSCRATCH2LUSTREFILESYSTEMSFROMJUNE2016TOJUNE2017.

Intuitively,accessestoForeverstorageshouldskewtowardwrites,butthisisnotpronouncedatNERSC;24%ofdatawrittentothearchiveisrecalledatsomepoint.Infact,somearchiveddatashowsahighskewtowardreadsasaresultofsciencecommunitiescontinuallyaccessinglargedatasets.ThenetresultisthatNERSC'sarchiveread-to-writeratioisremarkablybalanced,withreadsaccountingfor40%ofsystemI/O.GiventhatNERSC'sForevertierismagnetictape,andtapereadsaremoredifficulttomanageandareslowerduetovolumemountandseeklatenciesonlinearaccessmedia,weconcludethatthisisaresultbornoutofnecessityratherthanbestuseofthesystemordesiresofusers.Re-readswouldbebetterservedfromlower-latencyCampaignorCommunitystoragelayersifcapacityallowed.

WithsufficientlysizedCommunityandCampaignstoragetiers,Foreverstorageshouldbeoptimizedforhighperformancewritecapabilitiesratherthanreadperformance,asreaddutyismainlyfulfilledbyCommunityandCampaignstorageresources.However,asshowninFigure4(whichdepictstherealityatNERSC),thisisasystemdesignpointratherthanastatementofhowthecurrentstoragesystemswork.Thediscrepancyisdriveninlargepartbythefactthattapeisstillthemostcost-effectivemassstoragemediumonadollars-per-bitbasis.

ThecouplingbetweenForeverandCommunitystoragecanbelooserthanTemporaryandCampaign,asthedatainForeverandCommunityspaceisprincipallystatic.Communitystorageshouldbesizedsuch

Page 14: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

13

thatdatadoesnotmigratefrequentlytoandfromForeverstorage.Becauseofthedifficultyinteractingwithtape,Communitystorageneedstobelargeenoughthatiteffectivelyeliminatesrepeatedre-readsfromtheForevertier.Forevaluatingeffectivetechnologies,POSIXI/OoperationsaremuchsimplerintheCommunityandForeverspace,mainlycomposedofput/get/statonwholefiles,withotheroperationstocreateandmaintaindirectoryhierarchiesandverylittleelse.Suchawrite-once,read-many(WORM)workloadisanareawhereinexpensivecapacitystoragesystemswithoutfullPOSIXcompliancecouldbedeployed;forexample,theobjectstoragesystemsusedextensivelyinthecloudandhyperscalemarketsarespecificallyoptimizedforWORMI/O.

Asscienceteamsmovefromusingsmallnumbersofapplicationsduringtheirresearchtomorecomplexinteractionsbetweenmanyapplications,scientificworkflowsareexpectedtobecomethedominantmodeofoperationatNERSC.Thecomputeconcurrencyoftheseworkflowsisdiverseandcanbeextremelylowforimageorotherinstrument-analysisworkflows.Thesedata-orientedworkflowsareanticipatedtogrowmoreinthroughputratherthanproblemsizeby2020,andbecausemanyconstituentapplicationsdonotstrongscalewell,theincreasedconcurrencyofNERSC'sfuturesystemswillbeutilizedbybundlingmultipleworkflowpipelinesintoasinglejob.13Unlikethescalingbehavioroftraditionalsimulationscienceapplications,thiswilldemandscalablemetadataperformancefromthestoragesystemaseachnodeprocesseslargernumbersoffilesconcurrently.14

FIGURE7.PERCENTAGEOFDATAGENERATEDBYNERSCWORKFLOWSTHATWILLBERETAINEDINFOREVERSTORAGE

AkeyfindingoftheAPEXWorkflowsstudywasthatNERSCuserswanttosaveasignificantfractionofthedatafilesusedandproducedbytheirworkflowsforalongtime,perhapsindefinitely.Figure7showsthepercentageofI/OgeneratedbythesurveyedNERSCworkflowsthatissavedforever.Evenifusers

13Daley,C.S.etal.2015.AnalysesofScientificWorkflowsforEffectiveUseofFutureArchitectures.Proceedingsofthe6thInternationalWorkshoponBigDataAnalytics:Challenges,andOpportunities(BDAC-15)(Austin,TX,2015).14Daley,C.S.etal.2016.PerformanceCharacterizationofScientificWorkflowsfortheOptimalUseofBurstBuffers.ProceedingsoftheWorkshop,WorkflowsinSupportofLarge-ScaleScience(WORKS2016)(SaltLakeCity,2016),69–73.

Page 15: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

14

areabletomakeuseofin-situorin-transitanalyticstoreducedatamovementduringworkflowexecution,alargefractionofthegenerateddataisirreducibleandmustberetainedlong-term.

Thus,in-flightanalyticsarenotmagicbulletsthatcanbereliedupontostemtheincreasingvolumesofdatabeinggeneratedbyscientificworkflows,andwearerapidlyapproachingtheneedforO(exabyte)ofcapacitystorageunlessNERSCusersre-architecttheirworkflowstosavelessdata.Extrapolatingthehistoric45%annualgrowthrateofNERSC'scurrentarchivesystemalonepredicts1exabyteofuserdataby2022.GiventheaforementionedobservationthatNERSCusersarecurrentlyusingthearchiveasbothCommunityandForeverstorage,effectivelybalancingthecapacityofCommunitystoragerelativetoForeverstorageindicatestheneedforhundredsofpetabytesofcapacityintheCommunitystoragetierby2023.

Thefindingspresentedaboveindicatetwocorollaries:

• Campaignstorageis,inasense,"cold"Temporarystorage,andCommunitystorageis"hot"Foreverstorage.

• ThedatastoredinTemporary/Campaignstorageservesthegoalsofindividualresearchprojectsandtheirusers,whiledatainCommunity/Foreverstoragemaybeofinteresttobroaderscientificcommunitiesandmanyresearchprojects.

ThesesuggestabroaddichotomybetweenTemporary/CampaignstorageandCommunity/Foreverstorageinboththeirdataretentiontimesandthebreadthofuserstheyserve.ItfollowsthatTemporary/Campaignstorageisbestimplementedclosetospecificcomputesystemstoemphasizehigh-performanceanalysisandaccessbyasmallcohortofusers.Conversely,Community/Foreverstorageisbestmaintainedclosertothewide-areanetworkandmorecentrallywithinafacilitytoemphasizesharingandbroadaccessbylargerusercommunities.

Fromtheseuserrequirements,severalkeydesigncriteriabecomeapparent.TheTemporaryandCampaigntiersshouldbecloselycoupledandprovidebalancedread/writeperformanceandscalablemetadatatosupporttheNERSCworkload.TheCommunityandForevertiersdonotneedsuchtightcoupling,buttheyshouldbesizedsuchthatmostreadactivitytargetsdatathatisstoredintheCommunitytierratherthanForeverstorage.ThiswouldallowCommunitystoragetomakeuseoftechnologiesoptimalforWORMworkloads,leavingForeverstorageforhighlyvaluablebutcolddata.

3.2. NERSC-9RequirementsIn2020,NERSCplanstodeployitsNERSC-9system,whichistargetedtoincreasetheprocessingcapabilityofthecenterby4-5xovertheNERSC-8system,Cori.Withthepotentialfordramaticdatagrowthasemergingareasindatasciencesmatures,thisincreaseincomputingcapabilityisexpectedtobeaccompaniedbyatleastaproportionalincreaseintherateandvolumeofdatagenerationwithinNERSC.TheNERSC-9systemwillincludeplatformstoragethatisexplicitlydesignedto:

"[retain]allapplicationinput,output,andworkingdatafor12weeks(84days),estimatedata

minimumof36%ofbaselinesystemmemory[3PiB]perday."15

15APEX2020TechnicalRequirementsDocumentforCrossroadsandNERSC-9Systems.http://www.lanl.gov/projects/apex/request-for-proposal.php.AccessedApril30,2017.

Page 16: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

15

aswellasdeliversufficientperformancetoabsorbcheckpointing.ThetechnicalrequirementsfortheNERSC-9systemwerespecifiedsuchthatplatform-integratedstoragewillfulfilltheroleofTemporarystorageandaportionofCampaign.

Whiletheutilityandcapabilityofthisplatformstoragewillbewell-definedinthe2020timeframe,itisdesignedtoretaindataforonly84days.Therefore,usersandprojectsthatwishtoretaindatalong-termmuststoreitonalternate,longer-termstorageresourcesthatfalloutsideoftheNERSC-9procurement.However,intheNERSC-9storagetechnicalrequirements,vendorsweregiventheflexibilitytorespondwithinnovativesolutionssurroundingfeaturesthataremorerelevanttolonger-termdatamanagement,includingbackgrounddataintegrityverification,detailedmonitoringofstorageperformanceandutilization,fastmetadatatraversal,andconnectivitytoexternalfilesystemsandotherdatasources.Asaresult,theNERSC-9procurementcouldbeusedasavehicleforprocuringandsatisfyingtherequirementsofCampaign,Community,andForeverstoragetieraswell.

3.3. DOEExascaleRequirementsReviewsTheDOEAdvancedScientificComputingResearch(ASCR)programhasconductedanumberofrequirementsgatheringeffortswithotherDOESCprogramstoensurethattheexascalesystemstobefieldedin2021-2023arealignedwiththemissionneedsofeachDOESCprogramoffice.Theseeffortsbuildonalonghistoryofengagementwiththescientificcommunitythathelpdrivefuturesystemrequirementsandarchitectures,goingbacktotheNERSC'sGreenBook16reviewin2002andextendingtotherecentDOEExascaleRequirementsReviews.17TheoutputoftheseeffortsdirectstheplanningandacquisitionstrategiesforNERSC,theLeadershipComputingFacilitiesandESnet.

Thesecomprehensivereportsspanabroadrangeofareas,includingcomputationalrequirements,softwareandmiddlewareneeds,networking,datamanagement,anddataanalysis.SomeofthecommondataandstoragerequirementsthatemergedfromthoseeffortsthatarerelevanttoNERSC'sstoragestrategyareasfollows:

1. Manyoftheprogramofficesanticipateexabyte-scalestorageneedsinthecomingdecade,withmanyprojectsgeneratingandprocessinghundredsofterabytesofdatatodayandprojecting10-50xgrowthduringthatdecade.Multipleprojectsarepredicting100petabyteorgreaterdatasetsinthe2025timeframe.Theseusecasesunderlinetheneedforcost-effective,capacity-optimizedCommunityandForeverstorage.

2. Thereisanincreasingneedtointegrateobservationalandsimulationdatainworkflowsthatrequiredatatobeco-locatedforeffectiveanalysis.Thisis,inpart,adirectresultoftypicalobservationalandsimulationresultsnowsurpassingtheanalysiscapabilitiesofcomputingsystemsatusers'homeinstitutions.Thiswilldrivetheneedtoimprovedatamovementtools,increasestoragecapacity,andprovidehigh-bandwidth,wide-areanetworkingconnectivity.Thisspeakstotheneedforeffectiveintegrationbetweenallstoragetierstominimizethecomplexityofdatamovementduringworkflows.

3. DatamanagementneedstoextendbeyondNERSCtothewide-areanetwork,asothercomputeandexperimentalfacilitiesintegratemorecloselywithNERSC.Externalconnectivityrequirementsarealsobeingdrivenbyagrowingdemandtosharecommon,curateddatasetswiththewidercommunity,drivingtheneedforarobustCommunitystorageresource.

16Greenbook–NeedsandDirectionsinHigh-PerformanceComputingfortheOfficeofScience.https://www.nersc.gov/assets/For-Users/DOEGreenbook.pdf.AccessedApril27,2017.17DOEExascaleRequirementsReview.http://www.exascaleage.org/.AccessedAugust31,2017.

Page 17: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

16

4. Usershaveastrongneedforintegrateddatatrackingandprovenancewithinthestoragesystem.Thisincludesexpandedcapabilitiesaroundmetadatastorage,searchingandquerying,andeventtriggering.ThesearefeaturesthatareprincipaltoCampaignandCommunitystoragetiers.

5. Thereisatransitionfromindividuallarge-scalesimulationstowardensembles,uncertaintyquantification,andmorecomplexworkflowsthatmustconnectandintegratesimulationandanalysis.ThisshifttowardsensembleworkflowswillrequirethatCampaignstoragesimplifydatamanagementacrosslargeprojectsandtheothertiers.

6. Thedramaticgrowthindatastoragedemandsisaccompaniedbyadesiretoapplynewformsofdataanalysisandanalytics,includingmachinelearning,toeffectivelyprocessthemassiveamountsofdataresultingfromexperimentalsources,extreme-scalesimulation,anduncertaintyquantification.ThisalignswiththeobservationatNERSCthatTemporarystoragedeliverbalancedreadandwriteperformance.

AlldivisionswithinDOESCanticipatethatthedramaticincreaseintheircomputationalrequirementswilldrivesimilarlydramaticincreasesintheirdatastorageandmanagementrequirements.Simplyprovidinghighcapacity,high-bandwidthstoragewillnolongersatisfythebroadrangeofrequirementsthatarisefromtheaforementionedshifttowardworkflow-orientedprocessingandexperimentalanalysis.Rather,futurestoragesystemswillhavetodeliverlowlatency(highIOPs),richmetadatafacilities,andexternalconnectivity,inadditiontohighparallelI/Obandwidth.Theseuserrequirementsreinforcetheneedtotreatstorageinfrastructuredesignasamulti-dimensionalproblemandsupporttheapproachdescribedinSection2.2.

3.4. EmergingApplicationsandUseCasesAgrowingnumberofdomainsciencesneedtoleveragethecapabilitiesofHPCsystems,yethavedatarequirementsthatcontrastwiththoseoftraditionalHPCworkloads.Manyoftheseemergingdataworkloadsaredrivenbymachinelearningandotherdataanalyticstechniquesthatrelyonworkflowframeworks(e.g.ApacheSpark),analyticspackages(e.g.,CaffeandTensorFlow),anddomain-specificlibrariesthattraditionallyhavenotbeenusedinHPC.TheseanalysistoolsoftenexhibitI/OpatternsthatperformpoorlyonHPCsystemsasaresultoftheirgenesisincloudenvironments,andwhileindividualanalyticstoolscanberefactoredforuseonHPCsystems,thefieldofdatascienceisevolvingrapidlyandindependentlyoftheHPCcommunity.ThenextsetofpopulartoolsmayexhibitthesamedeleteriousI/Obehaviorandpoorout-of-boxperformance,andtheywillneedtobeadaptedtoHPCenvironmentsbecauseoftheirprioritizationofproductivityandtheirmomentuminthelargerdataanalyticscommunity.

Manyoftheseemergingapplicationsareasareassociatedwithobservationalandexperimentalfacilitiesthatarealreadygeneratinglargevolumesofdata,and,ashighlightedinSection3.3,theirprojectedgrowthratesarestaggering.Forexample,NERSCiscollaboratingwiththeLinearCoherentLightSource(LCLS)toenablereal-timeanalysisofdatageneratedbyhigh-speed,high-resolutioninstruments.Theseinstrumentscurrentlygeneratehundredsofmegabytespersecondofdatabutareprojectedtogeneratetenstohundredsofgigabytespersecondofdatawithfutureupgrades.InstrumentsattheNationalCenterforElectronMicroscopy,18theAdvancedPhotonSource,19theSpallationNeutronSource,20and

18NationalCenterforElectronMicroscopy(NCEM).http://foundry.lbl.gov/facilities/ncem/.AccessedSeptember11,2017.19AdvancedPhotonSource.https://www1.aps.anl.gov/.AccessedSeptember11,2017.20SpallationNeutronSource.https://neutrons.ornl.gov/sns.AccessedSeptember11,2017.

Page 18: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

17

elsewhereprojectsimilarincreases.Thesefacilitiesalsooftenrun24x7formonthsatatime,soavailabilityandreliabilityofthecompute,storage,andnetworkresourcessupportingtheseworkflowsiscritical.Giventhefactthatresearchersareoftenallocatedverylimitedtimeontheseinstruments,providingcontinuityofstorageandcomputingresources,eventhroughsystemmaintenanceperiods,isimportant.

DirectinteractionsbetweenNERSCstaffandthestaffandusersfromanumberofexperimentalfacilitiesandprojectshaverevealedseveralkeystoragerequirements.TherewillbeaneedtotransferhundredsofGB/secfromthewide-areanetworkdirectlytoadurablestorageresourcesuchasCampaignorForeverstorageinareliableway.Thistranslatestoaneedforhighavailabilityandaccessibilityofdataonthesetiersthroughmaintenance,softwareupgrades,andstorageexpansion.Furthermore,predictableI/Operformanceforbothdataandmetadataaccessesiscriticalforco-schedulingexperimentalandcomputationalresources,andprovidingqualityofservicecontrolsishighlydesirableacrossallstoragetiers.

3.5. OperationalRequirementsUserrequirementsreviewsandothersurveysdefinemanydesigncriteriaforthestoragesystemarchitecturesuchasI/Operformanceanddatamanageability,butoperationalconsiderationsanddatalifecyclemanagementneedsgiverisetoadditionalrequirementsthatarenotdirectlyuser-facing.TheseoperationalrequirementsareespeciallycriticalfortheCommunityandForeverstorageresources,whichwillretainlong-liveddata.Dataontheseresourceswillroutinelyoutlastthefour-tofive-yearlifespanofindividualcomputeplatformsandmustbeavailableacrossallcomputesystemsandaccompanyingedgeservicesatthecenter.

AsdiscussedinSection2.1,theroleofCommunitystorageatNERSCiscurrentlyfulfilledbytheprojectfilesystemwhichhasbeeninexistenceformorethan10years.ForeverstorageisfulfilledbytheHPSS-basedarchiveandhasbeenmanagedformorethan20years.DozensofNERSCstaffhaveaccumulatedhundredsofyearsofdirectexperiencemanaginglong-livedHPCstoragesystems,contributingtocommunitybestpracticesandworkingwithpeersatotherDOEHPCfacilities.Theyhaveidentifiedcriticalattributesneededtomaintainandrunthesesystemseffectively.Theseoperationalrequirementscanbeorganizedintothreegeneralcategories,describedinsections3.5.1-3.5.3.

3.5.1. Reliability,Durability,Longevity,andDisasterRecoveryBecauseCommunityandForeverstorageareexpresslydesignedtostorevaluabledata,ensuringthatthedataishighlyresistanttocorruption,availableeveninthepresenceofcomponentfailures,andcanbequicklyrestoredintheeventofadisasterareparamount.Althoughvirtuallyallmassstoragesystemsmakeassurancesaboutthesefeatures,itisimportanttonotetheeffortrequiredbystoragesystemoperatorstoexercisethesefeaturesinpractice.ThisefforthasadirecteffectonthestaffinglevelsrequiredtosupportthestoragesystemasitincreasesincapacityandmaybeofcriticalimportancetoensuretheminimaldowntimeduringoutagesrequiredbytheemergingapplicationsandusecasesdiscussedinSection3.4.

Requiredfeaturesinclude:

Page 19: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

18

• Highlydurablehardwareandsoftware.Forthearchive,tapemediahasofferednotonlycost-effectivecapacitybutadditionaldurabilityassurancebecausethedataisoffline.Thismakesitfarlesspronetodatacorruptionduetosoftwareerror,asevidencedbya2011software-induceddisasterataleadinghyperscaleprovider.21

• Highdegreeofreliabilityandintegrityfordataatrestandinmotion.ThismaybeaddressedbymechanismslikeT10DIFanddatachecksummingandiscriticaltopreventingsilentdatacorruption,asevidencedbya2013hardware-relateddatacorruptionissuewithinInternet2.22

• Abilitytoshrink,grow,andmigratedata"live,"ascapacityisincreasedorreconfigured.Thisisanessentialfeatureforrepackingolddatatonew,highercapacitymedia.ItalsoenablesNERSCtoallowlargeexperimentsandotherdata-intensiveuserstopurchaseadditionalstoragetobeco-locatedwiththeircomputeresources.

• Abilitytomountstorageresourcesacrossdifferentcomputeandloginsystemsandovertensorhundredsofthousandsofclientnodes.ThisisimportantforalltiersbutparticularlyessentialfortheCampaignandCommunitystoragetier,whichmustinterfacewithadiversityofenvironmentstoingestexperimentaldataandsharedatasets.

• Flexiblesupportforavarietyofhigh-performancenetworks.Thisallowsthestoragetocontinuetobecompatibleasthecenter'snetworkandcomputetechnologiesevolvewithchanginguserrequirementsandemergingtechnologies.

3.5.2. SpacemanagementandcurationfeaturesEffectivelymanagingstorageresourceutilizationreducesstoragecostsandimprovesqualityofservice.Whilemanagementfeaturessuchassupportinguser-andgroup-levelquotasaresupportedbyvirtuallyallstoragesystems,itcanbeaninflexibleandopaqueapproachifusersdonothavetheabilitytodeterminewhatdatatheyhave.Givingusersandadministratorstheabilitytodeterminewhichdatasetsareconsumingthemostspaceandwheretheselargedatasetsarelocatedsimplifiestheirdatamanagementoverhead.Requiredspacemanagementandcurationfeaturesinclude:

• Flexiblemethodstotrackusageandtospecifyandenforcelimits(e.g.userquotas,treequotas,etc).Thisallowsusersandoperatorstomakemoreinformeddecisionsaboutwhichdatacanorshouldbedeletedtoensurefairshareofstorageresources.

• Methodstoquicklywalkthestorageresourcenamespace.Inadditiontohelpinginformspacemanagementdecisions,understandingthedistributionoffileorobjectsizes,accessfrequencies,andothermetadatainformspolicydecisionsandsystemperformanceoptimization.

• Abilitytomanagehardwarethathasdifferentcharacteristics(bandwidth,capacity,IOPs)withinthesamesystem.Thisallowsthestoragesystemtogrowalongindependentdimensions(e.g.,performanceandcapacity)andisofincreasingimportancewithemergingNANDandSCMmedia.

3.5.3. AvailabilityMaintainingthehighestpossibleavailabilityofstorageresourcesisessentialtooperatingasupercomputingcenter;anentirecentercanberenderedofflineifitsstoragesystemsareoffline.Furthermore,theneedtomaintainextremeavailabilityandminimizemaintenanceoutagesonlybecomesgreaterasexperimentalfacilitiesbecomecoupledtoHPCfacilities;asdescribedinSection3.4,storagesystemdowntimecanseverelyimpacttheabilityofauserofanexperimentalfacilitytodo

21Treynor,B.2011.Gmailbacksoonforeveryone.https://gmail.googleblog.com/2011/02/gmail-back-soon-for-everyone.html.AccessedSeptember4,2017.22Foster,I.2013.GlobusOnlineensuresresearchdataintegrity.https://www.globus.org/blog/globus-online-ensures-research-data-integrity.AccessedSeptember4,2017.

Page 20: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

19

research.Assuch,wehaveidentifiedthefollowingoperationalrequirementstoensuremaximumavailability:

• Strongsupportforliveupdates,rollingupgrades,liveconfigurationchanges,etc.Thisminimizestheneedtotakethesystemoffline,especiallyforextendedperiodsoftime,andspeaksdirectlytotherequirementofmaintaininghighavailabilityduringmaintenance.

• Supportforcentralizedmanagementandmonitoring.Thisimprovesoperationalefficiencyandreducesdowntimebydecreasingtheamountofeffortrequiredforstorageengineerstomanagemultipletiersofhighlydistributedstorage.

• Abilitytorecovercleanlyfromfaultsorfailureswithminimalcleanupandmanualintervention.Aswithpreviousoperationalrequirements,thisisdirectlytiedtoreducingdowntimeandstaffingrequirements.

3.6. GapsandChallengesWhilethecurrentstoragehierarchydescribedinSection2hasservedNERSCwell,contrastingitwiththerequirementsstatedinthissectionrevealsomeshortcomingsinitsoverallarchitecture,thedeployedtechnology,anditseaseofuse.Ifthesegapsarenotaddressed,theywillbefurtheraggravatedbytechnologytrendsandemerginguserneeds.

3.6.1. TieringThenumberoflayersinthehierarchyisdrivenbycostoptimizationstoprovidefast,high-performancestoragetosupportrunningsimulationsandanalysis(Temporarystorage);highcapacitytosupportlonger-termprojects(Campaign/Communitystorage);andarchivingdatatosupportthescientificprocess(Community/Foreverstorage).Tieredstorageaddscomplexityforusersandstaff,andthelackofautomateddatamovementbetweentiersisasignificantburdentoNERSCusers.Eachlayerofthestoragehierarchyisacomplex,independentsystemthatrequiresexpertisetomanage,andcollapsingtierswouldsimplifystorageadministrationforNERSCandreducedatamanagementcomplexityforusers.

3.6.2. DataMovementAtpresent,movingdatabetweenNERSC'sTemporaryandCampaign/Communitystoragetiersisrelativelyfrictionless,astheybothprovideaPOSIXfilesysteminterface.MovementinandoutofForeverstorageismorechallengingbecauseitrequiresuserstointeractwithcustomclientsoftwaresimilartoFTPorUNIXtar.Thefactthatdataresidesontape—whichintroducesvolumemountlatenciesthatmayspanseveralminutesandlinearreadorwriteaccessrestrictions,plusthefactthatdatamaybescatteredovermanydifferenttapecartridges—addstothedifficulty.Providingacommoninterfaceforalltiers,whetheritbefile-basedorobject-based,wouldstreamlinedatamovementandsimplifythetaskofbuildingmoreproductiveuserinterfacestomanagedatamovement.

3.6.3. DataCurationIntegratedsearchanddiscoverytoolsarelackingatalllevelsofthestoragehierarchytoday.ThisismoreproblematicforCommunityandForeverstorage,wheresignificantquantitiesofdataareresidentforyearsordecades.Thesetiersoftenserveasshareddatarepositoriesformultipleprojectsoveralongperiodoftime,andtheindividualownerorstewardofadatasetmaychangeoverthecourseofaproject.

Page 21: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

20

Toaddresstheseissues,largeprojectshavebuilttheirowndatacatalogsthatarecompletelyexternalfromtheNERSCstorageresources.Some,suchasJAMO,23arefocusednarrowlyoncataloginganddatamovement;whileothers,includingthosedevelopedbytheATLAS24experimentattheLargeHadronColliderandbytheAdvancedLightSource25atBerkeleyLab,includewebpresentationandworkflowfeatures.AlthoughwedonotintendtodefineametadataschemaforallNERSCusers,havingacommonsetofmetadatafeaturesacrossalltiersonwhichusercommunitiescanbuildtheirdomain-specificcatalogingsystemswouldsimplifydatamanagementandcurationasNERSC'sstoragehierarchycontinuestoevolveoverthenext10years.

3.6.4. WorkloadDiversityThespanofNERSCuserworkloadsisbroadand,consequently,thescaleanddistributionoffilecharacteristicsandI/Opatternsvariesgreatly.AsdiscussedinSection3.1,simulationsrunningatscaleoftenwriteverylargecheckpointsthatstresstheentiredatapathfrominterconnecttomedia.Attheotherendofthespectrum,manyexperimentally-drivenprojectsrunmanylow-concurrencyjobsoverlargecollectionsofsmallerfiles.Thiscanstressthemetadataserviceandthestoragesystem'sabilitytoefficientlyhandlehighvolumesofsmallI/Ooperationsthathasknock-oneffectsonotherusersofthefilesystem.Providingameanstodistributemetadataovermultiplestorageserversisanessentialrequirement,andfeaturesthatallowmoreintelligentpartitioningofmetadataonthebasisofusers,projects,orarbitrarydatapropertieswouldbenefitqualityofservice.

3.6.5. StorageSystemSoftwareUsabilityandmanageabilitygapsexistacrossthestoragesystemsoftwareusedacrossallofNERSC'scurrentstoragetiers.Forexample,theLustre-basedscratchfilesystemdeployedaspartoftheCorisystem'sTemporarystoragetierfilesystemprovidesnostraightforwardwaytoaddadditionalstoragecapacityorrebalancedataacrossLustreobjectstoragetargets.Lustre'smanagementtoolsarealsorelativelyimmature;asidefromIntel'snow-unsupportedIntelManagerforLustresoftware,26thereisnosingle-panefilesystemmanagementinterfaceforLustre,andthemajorityofavailabletoolsareadhocscriptscontributedbythecommunity.

NERSC'sSpectrumScale-basedprojectfilesystemhasitsownsetofchallenges.Maintenanceoperations,suchasfilesystemintegritychecksthatrequirethefilesystemtobetakenofflineforanextendedperiod,workdirectlyagainstthehighavailabilityrequirementsidentifiedinSection3.4.Furthermore,SpectrumScaleisaproprietary,closed-sourcesystemwithannuallicensingcosts,andmuchrecentdevelopmenteffortatIBMhasgoneintosupportingrequirementsdrivenbyenterprise,notHPC,needs.

TheForevertier,implementedusingHPSS,isengineeredtopresentaPOSIX-compliantinterfacedespiteasimpleput/getinterfacebeingsufficientfornearlyallusecases.ThisPOSIX-complianceadds

23NewMetadataOrganizerandArchiveStreamlinesJGIDataManagement.http://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2013/new-metadata-organizer-and-archive-streamlines-jgi-data-management.AccessedMarch6,2017.24PDSFdatadisksummary.http://portal.nersc.gov/atlas_diskstat.AccessedMarch6,2017.25Deslippe,J.etal.2014.WorkflowManagementforReal-TimeAnalysisofLightsourceExperiments.9thWorkshoponWorkflowsinSupportofLarge-ScaleScience.(Nov.2014),31–40.26Damkroger,T.2017.ANewPathwithLustre.http://intel.cmail20.com/t/ViewEmail/d/C316287F828160FA/5FC4DCCCE8C49BF9F6A1C87C670A6B9F.AccessedApril20,2017.

Page 22: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

21

significantcomplexitytothesoftware,yettheuserinterfaceintothistieristhroughcustomclientsoftware.Afilesysteminterfacetothearchive,eitherthroughintegrationwithSpectrumScaleorFUSE,ispossible,buttheunderlyingtapestoragecanmakeoperationsthatareunremarkableinadisk-basedfilesystemextremelyinefficientandtimeconsumingwithoutcarefulplanning.

3.6.6. HardwareConcernsWhileallofthedisk-basedstoragesystemsarearchitectedforreliabilitywithenterpriseclassRAIDandredundancy,thedemandforstoragecapacityisnowbeingsatisfiedwithmore,notsimplylarger,disks.Thishasasignificanteffectontheoverallreliabilityofastoragesystemanditscharacteristicmeantimetodataloss,andtheextreme-scalestorageindustryistransitioningfromblock-basedparitywithineachfailuredomain(e.g.,RAID6)tohighlydistributed,object-levelerasurecodingacrossshelves,racks,andevendatacenters.Filesystemsbuiltuponblock-basedstoragecannotmakeuseoftheseadvancesinerasurecodingdespitethenatureofmagneticdiskseffectivelyrequiringitforresilienceinthefuture,somovingtheCampaignandCommunitystoragetierstowardstechnologiesthatbalanceparityandresiliencemoreeffectivelywillbeessential.

3.6.7. POSIXandMiddlewareOverthelast50years,thePOSIXI/Ostandard27hasstoodthetestoftimeasthecanonicalwaytoaccessstoragedevices.However,advancesinsoftwarescalabilityandhardwareperformancehavestrainedtheappropriatenessoftheexistingstandardanditssemantics.Eitherrevisionstothestandardorentirelynewperformance-optimizedstandardswouldbevaluableforfutureapplicationstodealwithemerginghigh-performancestoragetechnologies.

Further,agreatdealofI/Omiddleware,suchasHDF5,PnetCDF,andADIOS,aretunedtooperatingwiththetraditionalmemory-to-diskI/Oendpoints.Thismiddlewareprovidesgreatvaluetoapplicationdevelopersbyisolatingusersfromthevagariesofextractingpeakperformancefromtheunderlyingstoragesystem,butitwillneedtobeupdatedtohandlethetransitiontoamulti-tieredI/Oconfiguration.Prefetchingdatafromscratchorprojectintoaburstbufferandmigratingchangesbackagain,supportforasynchronousI/Ooperations,andotherimprovementstoleveragenewtechnologiesareneededtocontinuesupportinguserrequirements.

4. TechnologyLandscapeandTrendsHavingidentifiedboththefunctionalrequirementsofafuturestorageinfrastructureatNERSC,aswellastherequirementscomingfromusers,experimentalfacilities,andoperators,wenowpresenthardwareandsoftwaretechnologiesthatareorwillbeavailabletoimplementtheTemporary,Campaign,Community,andForevertiersoverthenextdecade.

4.1. HardwareAlthoughtheHPCindustryhashistoricallybeenasignificantdriverofmassstoragehardware,theemergenceofcloudandotherhyperscaleserviceprovidershashadadramaticeffectonthestorageindustryanditsroadmapsforstoragemedia.Theseeconomicforces,combinedwiththeimpending

272009.InternationalStandard-InformationTechnologyPortableOperatingSystemInterface(POSIX)BaseSpecifications,Issue7.ISO/IEC/IEEE9945:2009(E).(Sep.2009),1–3880.

Page 23: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

22

scalinglimitsofsomephysicalmediaandtheemergenceofentirelynewformsofothers,arecausingrapidandsignificantchangesinthefuturelandscapeofstoragehardware.

4.1.1. MagneticDiskMagneticdiskistransitioningfromamediumdesignedforbothcapacityandbandwidthintoonesolelyforcapacityasaresultoftwofactors:

• Magneticstoragemediaisreachingaphysicallimitonhowsmallindividualmagneticdomainsonthedisksurfacecanbe.

• High-performanceNANDisproliferating,satisfyingstorageperformancerequirementsanddisincentivizinginnovationtowardsbettermagneticdiskperformance.

CombinedwiththescalingofI/Operformancewiththesquarerootofthebitdensityonrotatingmedia,thedisparitybetweendiskcapacityandperformanceisonlyexpectedtowiden.

Thatsaid,thereareanumberofcapacity-focusedimprovementsonthemagneticdiskroadmapsofvendorsandindustryconsortia.AsshowninFigure8,therearetechnologyimprovementsthatareprojectedtodelivera10xincreaseinarealdensityoverthenext10years.

FIGURE8.PROJECTEDAREALDENSITYIMPROVEMENTSFORMAGNETICDISKSTORAGETECHNOLOGY.BASEDONPROJECTIONSFROMSEAGATE28ANDATSC.29PARALLELMAGNETICRECORDING(PMR)ISTHE

STANDARDTECHNOLOGYOFTODAY.

Themodest10%arealdensity(AD)improvementfromtwo-dimensionalmagneticrecording(TDMR)30islikelytoreachtheenterprisemarketinthenearterm,andheat-assistedmagneticrecording(HAMR)

28Anderson,D.2016.WhitherHardDiskArchives?32ndInternationalConferenceonMassiveStorageSystemsandTechnology.(May2016).292016ATSCTechnologyRoadmap.http://idema.org/?page_id=5868.AccessedSeptember3,2017.

Page 24: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

23

andbit-patternedmagneticrecording(BPM)31promisetodelivermoreaggressiveincreasesinbitdensityinthelongerterm.However,bothHAMRandBPMrepresentlargelynewrecordingtechniquesratherthansmallrefinementstoexistingapproaches,andthereisanontrivialriskthatHAMRwillnotbeacommerciallyoreconomicallyviableoptionin2020.

Thus,itismorelikelythatvendorswillcontinueincreasingtheper-drivestoragecapacitybyrelyingonrefinementstoshingling(e.g.,viaTDMR)andincreasingplattercounts.Thesetwoapproacheswillresultinhigh-capacitydriveswithreducedwriteperformance,flatreadperformance,andslightlyincreasedpowerconsumption.WhilesuitablefortheWORMworkloadsprolificinenterpriseapplicationsandcontentdistributionnetworks,theevolutionofspinningdiskmediaismovingawayfromthebalancedread-writeworkloadsdescribedinSection3.1andcommontoscientificcomputingingeneral.

4.1.2. Solid-StateStorageNAND-basedsolid-statestoragedevices(flash)havebecomeagrowingpresenceinHPCintheformofnode-localscratchstorage32andcentralizedburstbuffers33designedtoreachabetterperformance-per-bitthanmagneticdiskmedia.Asdemandforflashmediacontinuestoincrease,drivenbybothmobileelectronicsandhyperscalemarkets,thelowerpowerconsumptionandhighperformanceofflashareexpectedtocontinuetopushmagneticdiskintolower-performanceroles.

Thelowpowerconsumptionandhighbitdensityofflashmakeitanattractivearchivalmedia.Althoughthecost-per-bitofflashisstillsignificantlyhigherthanthatofmagneticdiskandtape,thecost-per-bitofflashstoragecanbereducedbysacrificingperformanceandendurance.Hyperscaleconsumers(e.g.,Facebook34)aredrivingthedevelopmentofquad-levelcell(QLC)flashasalow-power,high-densitymediumforWORM-andarchivalstorage,andthefirstQLCNANDproductshaverecentlybeenannouncedbyvendorsincludingSamsung35andToshiba.36Bythe2020timeframe,itisentirelyconceivablethatQLCflashmayfindarolealongsidehigherperformance,higherenduranceMLCandTLCNANDintiered,all-flashstoragesystems.

Thecost-per-bitofflashisalsoexpectedtodropprecipitouslybefore2020astheglobalNANDmanufacturingindustrycompletestheprocessofconverting2D(planar)NANDfabricationplantsto3DNAND.Thiswilllikelypushpricesforperformanceflashbelow$0.10perGB,encroachingonamarkettraditionallyheldbymagneticdisk.37Advancesin3DNANDfabricationtechnology,drivenbyhealthy

30Victora,R.H.etal.2012.Two-DimensionalMagneticRecordingat10Tbits/in^2.IEEETransactionsonMagnetics.48,5(May2012),1697–1703.31Albrecht,T.R.etal.2015.Bit-PatternedMagneticRecording:Theory,MediaFabrication,andRecordingPerformance.IEEETransactionsonMagnetics.51,5(May2015),1–42.32Strande,S.M.etal.2012.Gordon:design,performance,andexperiencesdeployingandsupportingadataintensivesupercomputer.Proceedingsofthe1stConferenceoftheExtremeScienceandEngineeringDiscoveryEnvironment-XSEDE(Chicago,2012),1.33Bhimji,W.etal.2016.AcceleratingSciencewiththeNERSCBurstBufferEarlyUserProgram.Proceedingsofthe2016CrayUserGroup(London,2016).34Rao,V.2016."HowWeUseFlashatFacebook:TieredSolidStateStorage."FlashMemorySummit2016.(August2016).35Elliot,J.2017."AdvancementsinSSDand3DNANDReshapingStorageMarket."FlashMemorySummit2017.(August2017).36ToshibaDevelopsWorld'sFirstQLCBiCSFLASH3DMemorywith4-Bit-Per-CellTechnology.https://toshiba.semicon-storage.com/us/company/taec/news/2017/06/memory-20170627-1.html.AccessedSeptember9,2017.37Handy,J.FlashMarketCurrent&Future.FlashMemorySummit2017.(August2017).

Page 25: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

24

competitioninthemarketplace,willallow3DNANDtoscalewellbeyond2020aswell;approachessuchasstringstackingareexpectedtoallowarealdensitiesofflashtoscaletoatleast5-10xthestateofthearttoday.

TheNVMeoverFabrics(NVMeoF)protocolisarapidlyevolvingstandardthatenablesblock-levelaccesstoNVMedevicesoveranynetworkfabricsthatsupportremotedirectmemoryaccess(RDMA),includingInfiniBandandIntelOmniPath.IncombinationwithRDMAfabricswhosebandwidthandperformancealignwiththeperformanceofNAND,NVMeoFisexpectedtoenablefabric-attachedNVMedevicesasaviable,high-performance,disaggregatedstoragearchitecture.

Furthermore,itistechnologicallyfeasibletouseNVMeoFtotransferblock-baseddatatoremotetargetswithoutCPUinterventionandwithoutcopyingblocksthroughhostmemory.AlthoughsuchafeaturerequiresextensivehardwaresupportanddrivercompatibilitybetweenNVMedevicesandRDMA-enablednetworkinterfaces,ithasthepotentialtoenablehyperconvergednodedesignsforHPCthatdonotsufferfromI/O-inducedjitter.Althoughsuchzero-jitterarchitecturesareinkeyvendors'roadmaps,itisimportanttostressthatthesesolutionsremainunproveninproductionenvironments.Furthermore,block-leveldatatransferwillstillrequirestoragesystemsoftwaretorunontopofNVMeoFwhichisnotjitter-free.

AcomplementarytechnologyistheStoragePerformanceDevelopmentKit(SPDK),38whichisanemergingsetoflibrariesthatprovideamechanismforapplicationstoperformI/OtoNVMeandNVMeoFdevicesentirelyinuserspace.ThissignificantlyreducestheI/Olatencyofinteractingwithflashmediabycompletelyremovingtheneedfordatatotransitthesystemkernel,anditisoneofseveraleffortstoprovideacompletelynewinterfacetostoragemediathatexposesthefullcapabilitiesofthehardware.SPDKisnotwidelyusedinproductionstoragesystemsatpresent,butitisaninstrumentalcomponentinfutureproducts,includingDAOS.39

4.1.3. StorageClassMemoryandNonvolatileRAMStorageclassmemory(SCM)technologies,whichincludeIntel/Micron's3DXPoint,areonthehorizonandpromisetodelivernonvolatileandbyte-addressablestoragewhoseperformanceliessomewherebetweentoday'sDRAMandNAND.AlthoughsuchtechnologiesdeliverhigherperformanceanddurabilitythanNAND,thesignificantlyhighercostperbit(andthereforelowercapacity)renderSCMapureperformancetechnologythatislikelytobeintegratedintolarger,flash-basedstoragesystemstoremediatethesoftwareoverheadsincurredbyprocessessuchasdatajournaling.WhileSCMwillundeniablyplayaroleinstoragesystemsinthe2020timeframe,itislikelytofirstappearashighlyintegratedcomponentswithinalargerstoragesystem.ThisisanalogoustohowflashwasfirstintegratedintoenterprisestorageasextensionsoftraditionalRAM-basedcachetierssuchasinZFS'sZIL/L2ARC.40

ThereisopportunityforSCMtobedirectlyusedbyusersandapplicationsintheformofbyte-addressablenonvolatilestoragewithextremelylowlatency,buttheconsistencysemanticsofreadingandwritingdatafromaglobalstorageresourcewithaload/storeinterfacepresentanumberofnew

38StoragePerformanceDevelopmentKit.http://www.spdk.io/.AccessedSeptember10,2017.39Paciucci,G.HPCStorageTrends.HPCAdvisoryCouncilSwissConference.(April2017).40Leventhal,A.2008.Flashstoragememory.CommunicationsoftheACM.51,7(Jul.2008),47.

Page 26: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

25

challengesthatremainasubjectofintenseresearch.41Ofnote,theNVMLibrary42isanemerginginterfaceforpersistentmemorythatpreservesmostofthelow-latencybenefitsofSCMandflashbyenablinguser-spaceI/Odirectlytosuchdevicesthroughkey-value,block,andothersemantics.AlthoughtheNVMLibraryiscurrentlybeingusedtodevelopexperimentalstorageservicesonSCM,43librariesandapplicationsthatcanmakedirectuseofthebyte-addressabilityofSCMareunlikelytobeproduction-readyby2020.

4.1.4. MagneticTapeLTOandenterprisemagnetictapemediahaveacomfortabletechnologicalrunwaybecausetheycapitalizeontheinvestmentsmadetowardimprovingmagneticdiskmedia.Furthermore,state-of-the-artmagnetictapetechnologytypicallycomestomarketfiveormoreyearsafterthesametechnologyreachedthemagneticdiskmarket,givingthetapeindustryahealthyleadtimeintheeventthatmagneticdiskreachesanyfundamentalbarrierstoimprovement.

Asaconsequenceoftapetechnologytrailingdisktechnology,though,theroadmapformagnetictapeisdrivenbyeconomics,nottechnology.TakingLTOtape(whichholdsavastmajorityshareofthemagnetictapemarket)asanexample,taperevenuehasbeensteadilydecreasingdespitesteadilyincreasingvolumesofcapacityshipped,asshowninFigure9.

FIGURE9.ANNUALREVENUEANDEXABYTESSHIPPEDOFLTOTAPEMEDIA.DATAFROMFONTANAANDDECAD.44

41Chowdhury,M.andRangaswami,R.2017.NativeOSSupportforPersistentMemorywithRegions.Proceedingsofthe2017InternationalConferenceonMassiveStorageSystemsandTechnologies.(2017).42pmem.io:NVMLibrary.http://pmem.io/nvml/.AccessedSeptember9,2017.43Carns,P.etal.2016.EnablingNVMforData-IntensiveScientificServices.4thWorkshoponInteractionsofNVM/FlashwithOperatingSystemsandWorkloads(INFLOW’16)(Savannah,GA,2016).44Fontana,R.,Decad,G.2016.StorageMediaOverview:HistoricPerspectives.32ndInternationalConferenceonMassiveStorageSystemsandTechnology.(May2016).

Page 27: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

26

Inaddition,thediversityofthetapemanufacturingmarkethasshrunkdramaticallyoverthelastdecade:asof2014,onlySonyandFujifilmcontinuetomanufacturemagnetictapemedia,andasof2017,IBMremainstheonlyvendortodeveloptapedrivesandcartridges.Asadirectconsequenceofthesteadydeclineoftaperevenueandmarketcompetition,itislikelythattherateofinnovationinmagnetictapewilldeceleraterelativetomagneticdisk.Theperceptibleeffectsofthisdeclinearelesscertainthough,anditisnotclearifthecostadvantagesoftapeforarchivalstoragewillbesurpassedbyanothermediainthenextfivetotenyears.

Ifoneassumesthatdatagenerationratesareultimatelyboundedbytheavailablecapacitybeingproduced,andthemajorityofstoragecapacityisprovidedbymagneticdiskasevidencedinFigure10,thedecelerationofmagnetictapecapacityshipmentsrelativetomagneticdiskpresentsasignificantriskbecauseitfollowsthataconstantinvestmentindisk-basedstoragewillrequireincreasinginvestmentintape-basedstoragetoprovideaconstantratioofdisktotape.Thus,whiletaperemainscost-effectiveforarchivalinthenearterm,itisunlikelytobetheoptimallong-termsolution.However,thecross-overpointisnotimminent,anditisnotclearthatthispointwilloccurbefore2025.

FIGURE10.ANNUALEXABYTESOFSTORAGEMEDIASHIPPED.DATAFROMFONTANAANDDECAD.45

Thelowcost-per-bitoftape,combinedwithitsminimalpowerconsumptionasanofflinestoragemedium,continuestomakeitanattractivearchivalstoragetechnologyintheshortterm.Giventheuncertaintiesoutlinedabove,though,trackingtheeconomicsofthetapemarketandfollowingvendorroadmapsareessentialforlonger-termplanning.

4.1.5. StorageSystemDesignStoragesystemarchitecturesin2020willbeshapedbythetechnologicaldevelopmentsoutlinedinthissectioninseveralkeyways:

45Fontana,R.,Decad,G.2016.StorageMediaOverview:HistoricPerspectives.32ndInternationalConferenceonMassiveStorageSystemsandTechnology.(May2016).

Page 28: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

27

1. NANDdeviceswillstratifyintoperformance-oriented,high-enduranceMLC/TLCandlow-performance,high-capacityQLC,bothofwhichconsumelesspowerandpossesshigherbitdensitythanmagneticdisk.

2. Magneticdiskmediawilldisappearfromperformance-criticaldatapathsandbecomeacapacity-onlymedium.

3. Magnetictape,whichhashistoricallybeenacapacity-onlymedium,hasanuncertainfutureasitsrevenuesdrop.However,dramaticshiftsintheeconomicsoftapeareunlikelytomanifestbefore2020-2025.

Basedonthesetechnologicalandeconomictrends,theroleofthesedifferentmediawillalsoevolve:

1. MLC/TLCNANDwillreplacemagneticdiskinallperformance-criticalapplications,andQLCNANDwillbegintosupplantmagneticdiskinmanyWORMapplicationareas.

2. Magneticdiskwillbegintoeatawayatthemostperformance-sensitiveapplicationsofmagnetictape,includinghotarchiveandreplicatedtape.

3. Magnetictape'sroleinthedatacenterwillcontinuetoshrinktowarddeeparchiveapplicationsasQLCNANDandmagneticdiskapproachitincost.

4.2. SoftwareBeyondthechangescominginthehardwarerealm,therearemanyimprovementsandadditionsneededinextreme-scalestorageandI/Osoftwareaswell.TheincreasingdifficultyinscalingPOSIX-basedparallelfilesystemstoextremescalesisbecomingasignificantimpediment,and,asdiscussedinSection4.1.3,newsoftwareinterfacesarearequirementtomakeoptimaluseofemerginglow-latencystoragehardware.Becausethesenewnon-POSIXinterfacesareoptimizedforperformanceoverusability,though,I/OmiddlewarewillbecomemoreimportanttobridgethesemanticgapbetweentheI/OoperationsthatscientificapplicationsdemandandtheI/Ooperationssupportedbytheunderlyingstoragesystem.

4.2.1. Non-POSIXStorageSystemSoftwareThestatefulfile-basednatureofPOSIXI/O,combinedwithitsprescriptivemetadataschemaandstrongconsistencysemantics,makeitdifficulttoscalePOSIX-basedfilesystemstotheextremelevelsofparallelismanticipatedforexascalesystems.Objectstores,initiallydrivenbytheextreme-scaleI/Oneedsofcloudproviders,eschewPOSIXI/Osemanticsinfavorofstatelessput/getoperationsandimmutabledataobjects.ByexposingtheseI/Oprimitivesdirectlytoapplications,theyprovideamuchmorescalablefoundationonwhichmorefeature-richstorageservicesandsystemscanbebuilt.

Asaresult,weexpecttoseescalableobject-basedstoragesystems,suchasDAOS46orCeph,47takeonamoreprominentroleinHPCsystemsinthenearfuture.POSIXfile-basedinteractionwillstillbeanoptionforusers'sourcecode,configurationfiles,andinputdecks,butthisPOSIXinterfacewillbeimplementedasmiddlewareatopanativeobjectinterfaceratherthanbeingthelowest-leveluserinterfacetostorage.AsPOSIXmovesfromanativeinterfacetoamiddlewarelayer,weanticipatethehardwareadvancesdescribedinSection4.1todriveagradualreplacementofparallelfilesystemswithobjectstoresforbothperformanceandcapacitywithoutrequiringimmediate,disruptivechangestouserapplications.

46Gorda,B.2015.DAOS:AnArchitectureforExascaleStorage.31stInternationalConferenceonMassiveStorageSystemsandTechnology.(May2015).47Weil,S.A.etal.2006.Ceph:AScalable,High-PerformanceDistributedFileSystem.ProceedingsofUSENIXSymposiumonOperatingSystemsDesignandImplementation.(2006).

Page 29: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

28

4.2.2. ApplicationInterfacesandMiddlewareAsPOSIXevolvesintomiddleware,wealsoseeagreaterpercentageoftheapplicationcommunitymovingtouseotherI/OmiddlewarepackageslikeHDF548andADIOS.49ThisshiftallowsapplicationteamstousemoresemanticallymeaningfulAPIs(e.g.,storeawholearrayratherthanmanuallyserializedatastructures)andbenefitfromtheeffortandexperienceofthemiddlewarepackagedevelopers.TheincreasingadoptionofI/OmiddlewarepackageswillalsoinsulateapplicationsfromtheunderlyingshiftawayfromcurrentPOSIXconsistencysemantics,allowingthemtoautomaticallygainthebenefitsofnewhardwarewithouthavingtodirectlyinteractwiththestoragesystem'snativeAPI.

Increasedstorageofobservationaldataandapushtowardimprovedreproducibilityofscienceresultsalsoleadstoaneedforstoringprovenanceinformationonalldata,asidentifiedinSection3.3.EnhancingI/Omiddlewaretoautomaticallyaddprovenancetoapplicationdatawillgoalongwaytowardimprovingthecurrentwild-westconditionsofdatacurationbyprovidingalways-available,queryableinformationonthestoragesystem.Thesedatacurationimprovementswilladdtothemomentumforalong-livedCommunity/ForeverstoragethatisindependentofTemporary/Campaignstorage.

5. NextStepsAsdiscussedinprevioussections,thediversityofNERSC'sworkloadwillcontinuetodriveNERSC'sstoragerequirementsinseveraldifferentdimensions.Filesystemperformancemustbemeasurednotonlyinbandwidthbutmetadataperformance,latency,andvariabilityaswell.Partnershipswithexperimentalfacilitiesandthecontinuedgrowthofdatascienceworkloadswillalsoaddnewdataretentionrequirementsintermsofbothdurabilityandmanageability.Inaddition,thesizeofNERSC-9willdemandnewlevelsofscalabilityandresilience.Theserequirementsdriveourvisionforthefutureandourstrategyingettingthere.

5.1. VisionfortheFutureWhileeveryHPCuserdesiresasingle,highperformance,highcapacity,andhighlydurablestoragesystem,costwillcontinuetorequiretieredstorageatHPCcenters.Ashasbeenthecaseforthepasttwodecades,HPCwillcontinuetodeploystoragesystemsbuiltfromenterprisecomponentswhoseeconomicsarenowbeingdrivenlargelybyconsumerandcloudmarkets.Inthe2020-2025timeframe,themostnotableshiftwillbethemoveinplatformstorageawayfromHDDsandtowardhigher-performancebuteconomicalnonvolatilememorytechnologies.

Themassivedisk-basedparallelfilesystem,whichhasservedtheHPCcommunityformorethantwodecades,willseeitsrolediminished.Itwillnolongerbethehigh-bandwidthresourceusedforalljobI/O,asemergingstoragetechnologiesexpresslybuiltforNVM—suchasIntelDAOS50,IBM'sburstbuffer51,andCrayDataWarp52—becometheprincipalinterfacetoon-platformstorage.Foroff-platform

48Folk,M.etal.2011.AnoverviewoftheHDF5technologysuiteanditsapplications.ProceedingsoftheEDBT/ICDT2011WorkshoponArrayDatabases.(2011).49Lofstead,J.etal.2009.Adaptable,metadatarichIOmethodsforportablehighperformanceIO.2009IEEEInternationalSymposiumonParallel&DistributedProcessing(May2009),1–10.50Gorda,B.2015.DAOS:AnArchitectureforExascaleStorage.31stInternationalConferenceonMassiveStorageSystemsandTechnology.(May2015).51Goldstone,R.2016.TheRoadtoCoral…andBeyond.HPCAdvisoryCouncilStanfordConference.(February2016).

Page 30: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

29

storage,cost-effectiveandscalablesolutionssuchasobjectstoreswillbegintoreplaceit.On-platformTemporary/CampaignstoragewillalmostcertainlybebuiltentirelyoutofperformanceNANDandSCM,whileoff-platformCommunity/ForeverstoragewillbeamixofQLCNAND,magneticdisk,andtapeinacombinationdictatedbycost,technologicalevolution,andperformance/capacitybalance.

AnincreasingnumberofscientificapplicationswillinteractwithstoragethroughanI/Omiddlewarelayer,allowinghighlyscalablestorage(whichprovidesPOSIXcomplianceasanoption,notadefault)totransparentlyserveasthebackingstore.Nonvolatilememorywillmakeinroadsthroughoutthestoragehierarchy,andasitdoes,storagesoftwarewillbereengineeredtowringoutperformancebottlenecksthatappearwhenlatenciesarenolongerdominatedbythephysicalcharacteristicsofdiskdrives.Wearebeginningtoseethisintheformoflow-latency,user-spaceI/OlibrariessuchasMercury53andtheNVMLibrary,54andthistrendtowardoptimizingsoftwareforlowlatencywillbecomearequirementtomatchthelowlatencyofemergingnonvolatilememorytechnologies.

Archivalstoragesoftware,oneofthelastvestigesofpurpose-builtsystemsoftwareforHPC,willberadicallyimpactedbysoftwareinnovationsfromcloudproviders.Thesameput/getinterfacesusedtostoredataincloudservicessuchasAmazonS3alsosufficeforstorageintheonsitearchive,andthearchivewillprovideaccessviathesestandardobjectAPIs,includingS3andSwift.Forlong-termstorage,thelinesmaywellbeblurredbetweendatathatresideswithinthelocalfacilityanddatathatresidesoffsite,eitherinacommercialcloudoratanotheropensciencecenter.Datareplicationcurrentlyofferedbycommercialobjectstoresandcloudproviders,includingattributestoguaranteegeographicalseparation,willbecomepartofthearchivalsoftwaresuite.

ThroughouttheHPCstoragestack,therewillbeanemphasisoneaseofmovementbetweenstoragetiers.Anewsetofstandards-basedAPIstointeractwiththeperformance,capacity,andarchivaltierswillhelpwithadoptionandportability,andeffortsarealreadyunderwaywithinDOEandamongstvendorstodeveloptheseAPIs.Job-schedulingsoftwarewillbeabletomovedatabetweenalltiersaspartofarun,withresourcemanagersincludingSlurm,Torque,andPBSproalreadybeginningtosupportthis.ThecombinationofstandardAPIsandscheduler-moderateddatamotionwillenableuserstosteerjobsandmarshaldatabetweentiersmoreexpressively.Thisrich,proceduralinterfacewillensurethatdataisinthecorrectplaceasdifferentworkflowstagesingest,manipulate,andstoredataindifferentways.

Thehierarchicalfilesystemoftodaywillonlybeoneofanumberofviewsthroughwhichuserscaninteractwiththeirdata.Alternateviewsofdata,searchablebyuser-definedattributesassociatedwithdata,areafeatureoftoday'scloud-basedstoragethatwillfindtheirwayintotheHPCspace.Thereareahandfulofeffortstoproviderichmetadatacapabilitiesatopexistingparallelfilesystems,buttheyareimplementedasanexternalsoftwarelayerandhaveseenlimitedadoptioninproductionHPC.Weanticipatethatsearchanddiscoverybasedonuser-definedmetadatawillbebetterintegrateddirectlyintothestoragesystem,andthiswillcatalyzebroaderuseradoptionandprovideamorestablefoundationonwhichdomain-specificmetadatacatalogscanbedeveloped.

52Henseler,D.etal.2016.ArchitectureandDesignofCrayDataWarp.Proceedingsofthe2016CrayUserGroup(London,2016).53Soumagne,J.etal.2013.Mercury:Enablingremoteprocedurecallforhigh-performancecomputing.2013IEEEInternationalConferenceonClusterComputing(CLUSTER)(Sep.2013),1–8.54pmem.io:NVMLibrary.http://pmem.io/nvml/.AccessedSeptember9,2017.

Page 31: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

30

Althoughthehigh-bandwidthTemporarytierwillcontinuetobepurchasedwiththesupercomputer,CommunityandForeverstoragewillbebestmanagedasseparateresourcesowingtothelongevityofthedatatheywillstore.Bydecouplingtheselonger-termtiers'refreshcadencesfromthecomputesystems'procurementcycles,wewillbeabletodeploythemostfeature-richstorageresourcesthemarketoffers,integratenewtechnologyovertime,andrealizethecostbenefitsofpurchasingstorageonlywhenitneedstobedeployed.

5.2. StrategyThechangesrequiredtorealizethisvisionforthefutureofstorageinHPCwillrequireinnovationsthatinvolvehardwarevendors,softwareandmiddlewaredevelopers,andthelargerresearchcommunity.Thefollowingstrategy,dividedintonear-term(presentdaythrough2020)andlong-term(2020-2025)targets,strivestoensureasmoothtransitionforNERSCusersandtoidentifyareaswhereNERSCleadershipandcommunityengagementwouldbemostbeneficial.TheevolutionofthestoragehierarchyduringthisperiodissummarizedinFigure11.

FIGURE11.EVOLUTIONOFTHENERSCSTORAGEHIERARCHYBETWEENTODAYAND2025.

Inthefollowingsections,wedetailtheactionsrequiredtorealizethisevolution.

5.2.1. NearTerm(2017–2020)Themostsignificantchangetothestoragehierarchyinthe2020timeframewillbeacollapseoftheburstbufferanddisk-basedscratchfilesystembackintoasingle,high-performance,modest-capacitytier.ThroughthehighlysuccessfulBurstBufferEarlyUserProgramatNERSC55andongoingproductionuseoftheburstbufferonCori,solid-statemediahasdemonstrateditsviabilityforTemporarystorage,andasingle-tier,all-flashplatformstoragesystemwouldsimplifydatamanagementforuserswithoutsacrificingsubstantialfunctionality.GiventhetrendsoftheNANDindustrydiscussedinSection4.1,thisshouldbeeconomicallyviableaswell.

55Bhimji,W.etal.2016.AcceleratingSciencewiththeNERSCBurstBufferEarlyUserProgram.Proceedingsofthe2016CrayUserGroup(London,2016).

Page 32: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

31

Inadditiontothisall-flashplatform-integratedtier,adisk-based,POSIX-compatiblestoragesystemwillalsoneedtoexistduringthistimeperiodtosatisfytheneedsofthecolderportionsoftheCampaigntierandthehotterportionsoftheCommunitytier.UnliketheNERSCprojectfilesystemoftoday,though,thistierwillbeoptimizedforcapacityandmanageability,notperformance.ItwillmeettheneedsofdatathatmustberetainedbeyondthedesignofNERSC-9'stemporarytier,suchashigh-valueexperimentalobservations,community-curateddatasets,andotheremergingusecasesoutlinedinSections3.3and3.4.Thiscapacity-optimizedtierwillpresentafamiliarfilesysteminterfacetosupportexistingdatamanagementandtransfertools,butitwillalsoprovideaccessviamorefuture-looking,object-basedAPIstoallowuserstobegintransitioningapplicationstoput/getsemantics.

The2020Campaign/CommunitystoragewillalsosatisfymanyoftheoperationalrequirementsdiscussedinSection3.5.NERSCpresentlyreliesonkeystoragemanageabilityfeatures,includingmetadatareplication,dynamicstorageresizing,snapshotting,andenforcingproject-basedquotas.The2020Campaign/Communitystoragesystemwillexpanduponthesemanageabilityfeaturesandprovideafoundationtobegindevelopingadditionalsystemmonitoringandmanagementtoolsforthefuture.ItwillalsoserveasthebasisforfuturedatacurationtoolsandinterfacesthatNERSCwillprovidetousersandsupportfeaturestofacilitateobjectorfilemetadatasearchesandqueries.

Duetothedifferentperformance,capacity,andfeaturerequirementsofthis2020Campaign/Communitytier,itwillbeacquiredandmanagedasaresourcethatisindependentofsystemplatformstoragethroughthe2020timeframe.Unlikecompute,storageisnotaresourcethatisfullyutilizedassoonasitarrives,andincrementalgrowthguidedbyuserneedsandcenterpolicywilltakeadvantageoftheexpected10%-30%annualreductionincost-per-bitandalloweconomicalresaleofextrastoragetoprojectsthatneedit.Thisplannedgrowthallowsustoadoptnewstorageandnetworktechnologiesincrementally,deploynovelsolutionsearlier,andincreaseNERSC'sagilitytoinnovateonthenewtechniquesandtechnologiesinstoragedescribedinSection4.

The2020Foreverstoragewillremainpredominantlytape-basedduetotape'seconomicadvantages.Tapetechnologywillcontinuetobemorecosteffectivethandiskthrough2020,andtransitioninganexabyteofdata(ormore)toanewstoragemediumwouldrequiresignificantcapitalinvestmentandtime.Theremaybeopportunitytoexplorealternativearchivemedia,buttherearenotrulycompellingoptionsinthenearterm.Otherkeytechnologiesthatmaybecometechnologicallyviableforarchive,suchaslowdurabilityNAND56orhyperscaledisk-basedobjectstores,willstillnotbecost-competitiveversustapeby2020.

NERSCwillundoubtedlycontinuetodeploytape-basedstoragebeyond2020,butitisunlikelythattape'seconomicscalingrateswillcontinue.AlthoughNERSC'sForeverstoragehasbeentreatedasalimitlessdatastoreforusersinthepast,theeconomicsofthetapemarketaremakingthisanunsustainablepolicy.WehavealreadybeguntotakestepstosharpenthefocusoftheNERSCarchive,resultingina10%reductioninsize,andfurtherrefinementswillbemadebasedonclosemonitoringofthetapemarket.

ThesumofthesefindingsdrivesustowardthestoragehierarchyforNERSCin2020showninFigure12.

56Peglar,R.2016.InnovationsinNon-VolatileMemory:3DNANDanditsImplications.32ndInternationalConferenceonMassiveStorageSystemsandTechnologies(2016).

Page 33: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

32

FIGURE12.TARGETTHREE-TIERSTORAGEHIERARCHYFORNERSCIN2020.

Tomeetthesenear-termrequirementsandevolvethestoragehierarchytowardthisdesign,severalcriticalactionsmustbetakenbefore2020:

1. ThepresentNERSCprojectfilesystemmustbeexpandedsignificantlytoreflectitsrolerelativetotheplatform-integratedTemporary/CampaigntiersonCoriandNERSC-9.Becausethisstoragesystemisoptimizedformanageability,accessibility,andusability,itscapacityshouldreflectthedesireofuserstostorethebulkoftheirworkingdataonit,andtheaimisforasizeof2-3xtheperformancetier.Thisisincontrastwithtoday'shierarchy,whereusersstoredataforaslongaspossibleontheperformancetier(beforedatagetspurged),andthenmovedatatotheforevertier.

2. InvestmentsmustbemadetowardfullyutilizingthedatamanagementfeaturespresentinNERSC'sprojectfilesystemandarchive.Buildingnewdatamanagementtoolsthatunifythesetierswillbeessential;thisincludesimprovingaccessibility(vianewinterfacessuchasindustry-standardobjectAPIs)andintrospection(viaexpandedindexing,monitoring,andcharacterizationcapabilities).

3. GiventhattheprojectfilesystemwillholdtheCommunitytier,weexpectdeceleratedgrowthforthetape-basedarchive.PoliciesandstricterquotasmaybenecessarytoensurethatmaintainingForeverstorageiseconomicallysustainable.

Theresultoftheseeffortswillbeasingle,high-performance,platform-integratedstoragesystemthatsatisfiestheroleofTemporarystorageandsomeveryhotCampaignstorage;ahigh-capacitybutscalableandmanageablestoragesystemthatsatisfiestheroleofCampaignandCommunitystorage;andacloselyintegrated,high-capacity,high-durabilitystoragesystemthatsatisfiestheroleofverycoldCommunitystorageandForeverstorage.

5.2.2. LongTerm(2020–2025)Thenextevolutionarystepbeyondthe2020StoragearchitecturewillaimtotransformthecloselyintegratedCommunityandForeverstoragesystemsintoasingleCommunity/Forevertierforlong-termdataretention,curation,andsharing.Thisresultsinatwo-tierstoragehierarchy,asshowninFigure13.

Page 34: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

33

FIGURE13.TARGETTWO-TIERSTORAGEHIERARCHYFORNERSCIN2025.

Aswiththe2020storageinfrastructure,theplatform-integratedtierwillemphasizeperformancefirst.ItwillprovideanativeinterfacethatdeliversextremeperformancethroughasynchronousI/O,relaxedconsistencysemantics,andauser-spaceclientimplementation.57UserswillstillbeabletoaccessthistierthroughafamiliarPOSIXinterfaceimplementedasmiddleware,butthisfile-basedAPIwillnotdeliverthefullperformanceoftheunderlyingNAND-andSCM-basedhardware.Rather,applicationsthatrequireextremeperformancewillhavetoeitheruseI/OmiddlewarethatsupportsthenativeinterfaceorrestructuretheirI/Otousethenativeinterfacedirectly.Giventhedisruptivenatureofsuchachange,thesemanticsofthisnewAPIshouldbewelldefinedby2020,andexperimentalsystemsmustbeavailabletoallowuserstobegintestingandmodernizingtheirapplicationI/O.

AttheCommunityandForevertiers,preparingforatransitionawayfromestablishedsolutionsliketape-basedstorageandHPSStowardobject-storagesolutionsbackedbyshingleddiskorarchivalNANDwillrequireacarefulassessmentofthepotentialreplacementtechnologiesandproductionhardening.Asapointofreference,DOEhasinvesteddecadesinthedevelopmentofHPSStomeetitsmissionneeds,butadoptingoff-the-shelftechnologies(suchasopen-sourceorcommercialobject-storagesolutions)willpayfuturedividendsbyaligningourapproachtomassstoragewiththoseofthecloudandhyperscalecommunities.Movinguserstoanobject-basedinterfaceforthearchivewillallowustotransparentlymigrateawayfromtape-basedmediashouldtapecontinuetodecline.However,buildingthesebridgesrequiresconnectinguserswiththesetechnologies,andensuringtheymeetuserandoperationalrequirementswillrequireinvestmentonthepartofNERSCandtheHPCcommunity.

PreparingtheNERSCstoragehierarchytotransitionintothislong-termvisionby2025requiresadditionalactionswithinthenextfiveyears:

1. TheNERSCDataArchivemissionmustberedefinedtoalignitsgrowthtrajectorywiththelong-termtargetcapacitiesandinvestmentssothatthetransitionto2025isseamless.Thiswill

57SeediscussionofMercury,NVML,DAOS,andothersoftwareinterfacesdiscussedinSections4.1.3,4.2.1,and5.1.

Page 35: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

34

involveuserengagementwiththoseuserswhosedataneedswillexceedstoragecapacityprojections,anditwillinvolvedevelopingsoftwareandinfrastructuretoassistusersinmanagingandmigratingtheirdata.

2. TestplatformsmustbefieldedtoexplorenewI/Oparadigms,includingperformance-orientedobjectstoresandsoftwaresystemscapableofeffectivelyutilizingnext-generationnonvolatilememorytechnologies.ThiswillallowNERSCtoestablishacredibleunderstandingofhowdifficultafuturetransitiontosuchsystemswouldbeforourusers,andalsoallowustodeveloptoolsthataddressthosedifficulties.Suchasystemwouldalsoinformthereturnoninvestmentsuserscanexpectfromthiseffortandmaintainourunderstandingofthesetechnologies'maturity.

3. Wemustdevelopthetoolsandinfrastructurethatallowtheperformance/projectstiersandcampaign/archivetierstocollapse.Forexample,manycomponentsofDAOSwouldgluetogethertheperformanceaspectsofDAOS'asynchronousobjectinterfacewithalower-performancebuthigher-durabilityflashlayer.Similarly,asoftwaretechnologysuchasIBM'sGHIwouldhavetobeprovenouttointegrateaGPFS-basedcampaigntierwithanHPSS-basedarchivetier.

5.2.3. OpportunitiestoInnovateandContributeNERSCisuniquelypositionedtoleadatransitiontothisstoragearchitecturebecauseofitsbroaduserbase,deepunderstandingofuserrequirements,andprovenabilitytopartnerwithapplicationdevelopersincodemodernizationefforts.Assuch,ourroleinleadingatransitiontofuturestoragetechnologiesiscenteredaroundtwokeyareas:

1. Drivingrequirementsthatwillsteeremergingsoftware,middleware,andhardwaretechnologiesinadirectionthatwillbebroadlyaccessibleandusefulacrossallsegmentsofHPCandscientificcomputingmarkets.

2. Demonstratingandhardeningemergingsoftware,middleware,andhardwaretechnologiesinextreme-scalebuthighlydiverseworkloadenvironmentsthatspantraditionalhigh-performancesimulation,high-throughputexperimentaldataprocessingandsynthesis,andmachinelearning-drivendataanalyticsatscale.

Ultimately,leadingtheground-updesignofnovelstoragesystemsordefiningnewstorageparadigmsatthebleedingedgeofcomputationalscienceisnotwithintheNERSCmission.Rather,ourexpertiseliesinunderstandinghowsuchradicalchangeswillaffecteachofthescientificdomainareas'workflowsatallscales,andthisiswhereNERSC'sleadershipwillbeessentialtoensurethatemergingI/OtechnologieswillbeviableandsustainableastheymatureintothebroaderHPCecosystem.ThiscontributionisessentialtohelpnewstoragesystemsandAPIsmeettheirfullpotentialbybroadeninguseradoption.Opportunitiestodriverequirementsaremanyfold,andwecategorizetheseopportunitiesasbeinginsoftware,middleware,andhardware.

Atthesoftwarelevel,NERSC'sbroaduserbaseservesasauniquesoundingboardforemergingI/OAPIsandsoftwaretechnologies.TheNERSCBurstBufferEarlyUserProgramhasbeenanexemplarofhowwellNERSCissuitedtoprovingoutnewstoragesystems,newmodesofuser-definedconfiguration,andnewmechanismsofdataaccess.TheprogramprovidedthevendorwithcontinuousfeedbackabouthowdifferentusercommunitieswantedtointeractwithflashstorageandbothdroveitsdesignanddemonstrateditsviabilitytothegreaterHPCcommunity.Notonlydidthisworkstrengthentheburstbuffersoftware(muchtothebenefitoftheusercommunityandthevendor),itdemonstratedthatsoftware-definedstorageandflash-basedfilesystemsareviabletechnologiesforthefuture.ThiseffortisaugmentednowbytheTieredStorageWorkingGroup,apartnershipofDOElabsandburstbuffervendors,todefinestandards-basedAPIsforinteractingwithfuturemulti-tierstorageplatforms.

Page 36: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

35

ItiscriticalthatNERSCcontinuetomakeinvestmentsinpartneringwithstoragesoftwareproviderstoensurethatourusers'needsarerepresentedindesigns.ThestrategicimportanceofthiscannotbeoverstatedastheHPCindustrybeginstoexploreradicallynewalternativestothetraditionalparallelfilesystemandastheenterpriseindustrydrivesobject-basedarchivalsolutionsintotheHPCspace.Failingtoengagebothsoftwarevendorsanduserstoexplorenewstorageparadigmspresentsasignificantriskthatthesestoragesolutionswillevolveindirectionsnotsuitableforthebroadusercommunityandthatcompute-anddata-intensivecomputingwillbifurcateatthestoragelayer.

ThemiddlewarelevelrepresentsanidealareawhereNERSCshouldleadinbridgingthegapbetweenrapidlychangingstoragehardwareandthediversityofuserapplicationsthatchangemuchmoreslowly.AcaseinpointwasarecentdemonstrationofusingtheHDF5middlewaretointerfacedirectlywithDAOS58;becauseasignificantnumberofNERSCdataisstoredasHDF5,asubstantialamountoftheworkrequiredtoportapplicationstoentirelynewI/OAPIsandparadigmscanbedoneinthemiddlewarelayer,effectivelyenablingbroadadoptionatonlyamodestinvestmentfromNERSC.GiventhebroadandincreasinguseofI/OmiddlewareinHPC,thisinvestmentwouldbeofsignificantbenefittothegreaterHPCcommunityaswell.

ItisthereforeessentialthatwecontinuetoengagewiththebroadusercommunitytotransitionapplicationstouseI/Omiddlewarewhereappropriate.Furthermore,wemustcontinuecloseengagementwithmiddlewaredeveloperstoensurethattheessentialfeaturesofusers,includingmetadata,provenancetracking,andeaseofuse,guidethedevelopmentofthesemiddleware.Failuretoinvestinthiswillholdopenagapbetweentoday'sapplicationsandthenativeinterfacesofnon-POSIXstoragesystems,reducingtheperformanceandscalabilitybenefitsofferedbynew,nonvolatilehardware.

Atthehardwarelevel,NERSChasbegunanefforttointegratethemonitoringofthestoragetiersintoaholisticunderstandingofemergingI/Odemands,andcontinuingthisworkwillprovidecriticalfeedbacktovendors.Forexample,monitoringtheworkloadsandwearratesonCori'sburstbufferhasidentifiedthatHPCworkloadswouldbenefitgreatlyfrommulti-streamsupportinSSDfirmware,59andongoingvendorengagementandsharingofendurancedatahasfoundthatHPCworkloadswouldbebetterservedbytradinghighwriteenduranceforaddedcapacityonenterpriseSSDs.Furthermore,thesemonitoringeffortsareimprovingtheperformance,reliability,andusabilityofNERSC'sstoragesystemsbyestablishingdetailedbaselinebehaviorandmaintainingrelationshipswithvendorsthatfacilitaterapiddiagnosis,resolution,andimprovementswhenaberrationsarise.

TrackingNERSCproductionworkloadtelemetry,curatingandcontextualizingit,sharingitwiththelargervendorandresearchcommunity,andactivelymaintainingproductiveengagementswithvendorsandresearchershaveprovidedsignificantreturnsforNERSCandthelargerHPCcommunity.IntheabsenceofNERSCinvestment,theevolutionofnewstoragetechnologiesmaybeshapedbyboutiqueworkloadsandtheenterprisemarket.ThiswouldresultinoveralllossofvalueinfuturegenerationsofNVM,networktechnologies,andSCM.

58Breitenfeld,M.S.etal.2016.UseofanewI/Ostackforextreme-scalesystemsinscientificapplications.Proceedingsofthe1stJointInternationalWorkshoponParallelDataStorage&DataIntensiveScalableComputingSystems(2016).59Han,J.etal.2017.AcceleratingaBurstBufferviaUser-LevelI/OIsolation.2017IEEEInternationalConferenceonClusterComputing(CLUSTER)(2017),245–255.

Page 37: Lawrence Berkeley National Laboratory · Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Storage2020:AVisionfortheFutureofHPCStorage LBNL-2001072

36

6. ConclusionTheincreasedamountofdatageneratedatexperimentalfacilitiesandtheprevalenceofhigh-speednetworkconnectionsbetweentheirinstrumentsandcenterssuchasNERSCpointtoanexplosiveincreaseinthevolumeofexperimentaldatastoredatcomputingsites.This,combinedwiththemassiveincreaseofdataproducedbyexascalecomputations,requiresrethinkingtheHPCstoragehierarchytomaintainacceptableperformanceandcost.Wehaveestablishedfourlogicaltiersofdatastoragebasedonrequiredperformance,capacity,shareability,andmanageabilityandmappedtheselogicaltierstophysicalstoragesystemsbasedontheprevalenttrendsinstoragetechnologies.

Intheshortterm,collapsingplatform-integrated,high-performance,flash-basedstoragesystemsintoasingletierthatsatisfiestherequirementsofTemporaryandhotCampaignstorageisfeasibleanddesirabletosimplifyI/Oforscientificworkflowsanddatamanagement.Movingthecolder,disk-basedCampaign/Communityandtape-basedForeverstoragetiersintoamorecloselyintegratedgroupofsystemsisalsotractableby2020andpositionsNERSCforatwo-tierstoragehierarchyin2025.

Thistwo-tiered2025storagesystemestablishesaconvergedTemporary/CampaignstoragesystemandaCommunity/Foreverstoragesystem,allowingNERSCtoseparatelyoptimizeextremeI/Operformancefromtheorthogonalneedsoflong-lived,high-valuecommunitydatasets.ThistransitionwillbecriticaltomeetingtheneedsofNERSCusersusingthebestavailablestoragetechnologiesinboth2020and2025,andimmediateinvestmentsinsoftware,middleware,andhardwaretechnologiesarenecessarytoachievethebenefitsforeseenbythattransition.

AstheprincipalproviderofHPCservicestotheDOEOfficeofScience,NERSCwilldeploythesenewstoragetechnologieswhilecontinuingtoprovidefastandreliablestorageresourcesthatmeettheneedsofourbroadspectrumofusers.ThediversityofworkflowsanduniquedatasetsthatrelyonNERSC'scomputationalandstorageresourcesputNERSCinastrongpositiontounderstandhowthechangingstoragelandscapewillaffectthescientificdomainareas'workflowsatallscales.ExecutingthestrategypresentedinthisdocumentwillensurethatemergingI/OtechnologieswillbeviableandsustainablesolutionstomeetingtheneedsoftheDOEOfficeofScienceaswellasthebroaderHPCcommunity.