Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering...
Transcript of Large-Scale Data Analytics and Its Relationship to … · § Data Analytics = Discovering...
Sandia National Laboratories is a multi-mission laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Large-ScaleDataAnalyticsandItsRelationshiptoSimulationCMSE Frontiers in Data Science and Computing WorkshopMichigan State UniversityOctober 4, 2016 RobLeland
VicePresident,Science&TechnologyChiefTechnologyOfficerSandiaNationalLaboratories
SAND2016-9893
Outline
2
§ Somenecessarybackground
§ AchargefromtheNationalStrategicComputingInitiative
§ Answerstothreekeyquestions§ Whyisaincreasingcoherencebetweensimulationandanalyticsimportant?§ Whatisreallymeantby“increasingcoherence”betweenthetwo?§ Howmightcoherencebefurtheredinpractice?
§ Aunifyingvision
Termsandcontext
3
§ Simulation§ Computationstounderstandphysicalphenomenaorconductengineering
§ LargeScaleDataAnalytics(LSDA)§ DataAnalytics=Discoveringmeaningfulpatternsindata§ LargeScale=Requiringleading-edgeprocessingandstoragecapabilities
§ LSDAisincreasinginimportance§ Pervasive
§Commerce,finance,healthcare,science,engineering,nationalsecurity,...§ Lastingsocietalsignificance
§ Internetsearch,genomics,climatemodeling,Higgsparticle,...
§ LSDAisgetting“harder”§ Captureddatagrowingexponentiallywithtime§ Individualanalysisbecomingmoresophisticated§ Morepeopleexaminingmoredatamorefrequently§ AggregateworkgrowingmuchfasterthanMoore’sLaw
TheEconomist:
NationalStrategicComputingInitiative(NSCI)
4
NSCIStrategicObjectives
5
§ (1)Acceleratingdeliveryofacapableexascale computingsystemthatintegrateshardwareandsoftwarecapabilitytodeliverapproximately100timestheperformanceofcurrent10petaflopsystemsacrossarangeofapplicationsrepresentinggovernmentneeds.
§ (2)Increasingcoherencebetweenthetechnologybaseusedformodelingandsimulationandthatusedfordataanalyticcomputing.
§ (3)Establishing,overthenext15years,aviablepathforwardforfutureHPCsystemsevenafterthelimitsofcurrentsemiconductortechnologyarereached(the"post-Moore'sLawera").
§ (4)IncreasingthecapacityandcapabilityofanenduringnationalHPCecosystembyemployingaholisticapproachthataddressesrelevantfactorssuchasnetworkingtechnology,workflow,downwardscaling,foundationalalgorithmsandsoftware,accessibility,andworkforcedevelopment.
§ (5)Developinganenduringpublic-privatecollaborationtoensurethatthebenefitsoftheresearchanddevelopmentadvancesare,tothegreatestextent,sharedbetweentheUnitedStatesGovernmentandindustrialandacademicsectors.
Q1:Whyisincreasingcoherencebetweensimulationandanalyticsimportant?
6
§ Forsimulation§ HPCsimulationmustrideonsomecommoditycurve§ Largermarketforcesbehindanalytics§ Canexploitcommoditycomponenttechnologyfromanalytics
§ Foranalytics§ LargeScaleDataAnalyticsproblemsbecomingevermoresophisticated§ Requiringmorecoupledmethods§ CanexploitarchitecturallessonsfromHPCsimulation
§ Forboth:Integrationofsimulationandanalyticsinthesameworkflow§ Automationofanalysisofdatafromsimulation§ Creationofsyntheticdataviasimulationtoaugmentanalysis§ Automatedgenerationandtestingofhypothesis§ Explorationofnewscientificandtechnicalscenarios§ ...
Mutualinspiration,technicalsynergy,andeconomiesofscaleinthecreation,deployment,anduseofHPCresources
7
Achallengebecausesimulationandanalyticsdifferinmanyrespects…
DatastructuresdescribingsimulationandanalyticsdifferGraphsfromsimulationsmaybeirregular,buthavemorelocalitythanthosederivedfromanalytics
ComputationalSimulationofphysicalphenomena:
Climatemodeling Carcrash
Internetconnectivity Yeastproteininteractions
LargeScaleDataAnalytics:
FiguresfromLelandet.al.courtesyofYelick,LBNL.
TheU.S.roadmap,whichhasspatiallocalityandisthusmostsimilarofthethreeinstructuretocomputationalpatternsthatwouldariseintypicalphysicalsimulations.
Computationandcommunicationpatternsdiffer
Black =timespentcomputingGreen =timespentcommunicatingWhite =timespentwaitingfordatatobecommunicated
TheErdős-Rényi graph,awell-studiedexampleingraphtheorywork.
A scale-freegraph,anexamplemorereflectiveofreal-worldnetworks.
FigurefromLelandet.al.courtesyofJohnson,PNNL.
Simulation
Analytics
Standardbenchmarksinclude:• LINPACK(smallestdataintensiveness;barelyvisibleongraph)• STREAM• SPECFP• SpecInt
MemoryperformancedemandsdifferAkeydifferentiatorintheperformanceofsimulationandanalytics
FigurefromMurphy&Kogge withadjustmenttodoubleradiusofLinpack datapointtomakeitvisible.
Areaofthecircle=relativedataintensiveness(i.e.totalamountofuniquedataaccessed overafixedintervalofinstructions)
Simulation
Analytics
Applicationcodeproperty Simulation Analytics
Spatiallocality High Low
Temporallocality Moderate Low
Memoryfootprint Moderate High
Computationtype Maybefloating-pointdominated* Integerintensive
Input-outputorientation Outputdominated Inputdominated
*Increasingly,simulationworkhasbecomelessfloating-pointdominated
Applicationcodecharacteristicsdiffer
Contrastingproperties:
Q2:Sowhatismeantby“increasingcoherence”betweensimulationandanalytics?
12
§ NOTonesystemostensiblyoptimizedforbothsimulationandanalytics
§ Greatercommonalityinunderlyingcomponentryanddesignprinciples
§ Greaterinteroperability,allowinginterleavingofbothtypesofcomputations
…Amorecommonhardwareandsoftwareroadmapbetweensimulationandanalytics
13
Andyet,thereishope…
Simulationandanalyticsareevolvingtobecomemoresimilarintheirarchitecturalneeds
14
§ CurrentchallengesfortheLSDAcommunity§ Datamovement§ Powerconsumption§ Memory/interconnectbandwidth§ Scalingefficiency
§ InstructionmixforSandia’sHPCengineeringcodes§ Memoryoperations 40%§ Integeroperations 40%§ Floatingpoint 10%§ Other 10%
§ Commondesignimpactsofenergycosttrends§ Increasedconcurrency(processingthreads,cores,memorydepth)§ Increasedcomplexityandburdenon
§ systemsoftware,languages,tools,runtimesupport,codes
…similartoHPCsimulation
…similartoLSDA
Energycostofmovingdataisbecomingdominant
Energyco
st,inpicojoules
(pJ),per
64-bitflo
ating-po
into
peratio
n
Costestimatesfortechnologyyear
Energycostforvariouscommonoperations
FromDanMcMorrow,TechnicalChallengesofExascaleComputing,JSR-12-310,JASON,MITRECorporation,April2013.
ArchitecturalCharacteristic Simulation Analytics
Computation Memoryaddressgenerationdominated Same
Primarymemory Lowpower,highbandwidth,semi-randomaccess Same
Secondarymemory Emergingtechnologiesmayoffsetcost,allowingmuchmorememory …require extremelylargememoryspaces
Storage Integrationofanotherlayerofmemoryhierarchytosupportcheckpoint/restart …tosupportout-of-coredatasetaccess
Interconnecttechnology Highbisectionbandwidth,(forrelativelycoarse-grainedaccess) …(forfine-grainedaccess)
Systemsoftware(node-level)
Lowdependenceonsystemservices,increasinglyadaptive,resourcemanagementforstructured parallelism
…highlyadaptive,resourcemanagementforunstructured parallelism
Systemsoftware(system-level) Increasinglyirregularworkflows Irregularworkflows
Emergingarchitecturalandsystemsoftwaresynergies
Similarneeds:
Q3:Howmightcoherencebefurtheredinpractice?
17
§ Makingitanelementofnationalstrategy§ CheckviatheNSCI
§ Buildingthisintoexascale computingefforts§ AlsoacomponentoftheNSCI
§ Communicatingwithandenlistingthetechnicalcommunitiesconcerned§ Thisforumandsimilarevents
§ Furtherdevelopingthevision§ Today’sdialoguesession!
Aunifyingvisionforsimulationandanalytics
FromTheFourthParadigm:Data-IntensiveScientificDiscoverybyJimGray
Dataanalysiscomplementstheory,experiment,andcomputation
Acknowledgements
19
Additionalreferences
20
§ TheEconomist,“Data,Data,Everywhere,” Feb25th,2010
§ R.C.MurphyandP.M.Kogge,“OntheMemoryAccessPatternsofSupercomputerApplications:BenchmarkSelectionandItsImplications,”IEEETransactionsonComputers56(7,July2007):937–945.
§ R.Murphy,“PowerIssues,”presentationtoJASON2012,June2012.
§ PeterKogge (editor)etal.,ExaScale ComputingStudy:TechnologyChallengesinAchievingExascaleSystems. DARPA,2008.
§ DanMcMorrow,TechnicalChallengesofExascaleComputing,JSR-12-310,JASON,MITRECorporation,April2013.
§ TonyHey,StewartTansley,andKristinTolle(editors), TheFourthParadigm:Data-IntensiveScientificDiscovery,MicrosoftResearch,2009.
§ JimGray,TheFourthParadigm:Data-IntensiveScientificDiscovery
Suggestedquestionsforbreakoutdialogue
21
§ Whywouldincreasingthecoherencebetweenthetechnologybaseusedforsimulationandthatforanalyticsbringvalueinthecontextofyourwork?
§ Whatresearchanddevelopmentwouldbestsupportdevelopmentofamorecommoncomponentroadmapanddesignprinciplesbridgingsimulationandanalytics?
§ Howwouldthisresearchbebestorganized?
22
SupplementaryMaterial
GraphmatchingexampleofdataanalyticsAkeyanalyticprimitive-- usedtofindaspecificinstanceofanabstractpatternofinterest
FromCoffman,Greenblatt,andMarcus,Graph-BasedTechnologiesforIntelligenceAnalysis, CommunicationsoftheACM,47,March2004.