Post on 06-Sep-2019
`,WorkwithShantenuJha,KannanGovindarajan,PulasthiWickramasinghe,GurhanGunduz,AhmetUyar
8/14/18 1
HPBDC2018:The4thIEEEInternationalWorkshoponHigh-PerformanceBigData,DeepLearning,andCloudComputing
GeoffreyFox,May21,2018
JudyQiu,SupunKamburugamuveDepartmentofIntelligentSystemsEngineering
gcf@indiana.edu,http://www.dsc.soic.indiana.edu/,http://spidal.org/
Twister2:AHigh-PerformanceBigDataProgrammingEnvironment
Abstract• WeanalysethecomponentsthatareneededinprogrammingenvironmentsforBigDataAnalysisSystemswithscalableHPCperformanceandthefunctionalityofABDS–theApacheBigDataSoftwareStack.
• OnehighlightisHarp-DAALwhichisamachinelibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.
• AnotherhighlightisTwister2whichconsistsofasetofmiddlewarecomponentstosupportbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance
• Twister2coversbulksynchronousanddataflowcommunication;taskmanagementasinMesos,YarnandKubernetes;dataflowgraphexecutionmodels;launchingoftheHarp-DAALlibrary;streamingandrepositorydataaccessinterfaces,in-memorydatabasesandfaulttoleranceatdataflownodes.
• SimilarcapabilitiesareavailableincurrentApachesystemsbutasintegratedpackageswhichdonotallowneededcustomizationfordifferentapplicationscenarios.
8/14/18 2
• Ongeneralprinciplesparallelanddistributedcomputinghavedifferentrequirementsevenifsometimessimilarfunctionalities
• ApachestackABDStypicallyusesdistributedcomputingconcepts• Forexample,ReduceoperationisdifferentinMPI(Harp)andSpark
• Largescalesimulationrequirementsarewellunderstood• BigDatarequirementsarenotagreedbutthereareafewkeyusetypes
1) Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstatisticsandvisualizations;possiblyStreaming
2) DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3) GlobalMachineLearningGMLwithsinglejobusingmultiplenodesasclassicparallel
computing4) DeepLearningcertainlyneedsHPC–possiblyonlymultiplesmallsystems
• Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoApacheBigDataSoftware(withnoHPC)
• ThisexplainswhySparkwithpoorGMLperformancecanbesosuccessful
Requirements
8/14/18 3
DifficultyinParallelismSizeofSynchronizationconstraints
SpectrumofApplicationsandAlgorithmsThereisalsodistributionseeningrid/edgecomputing
8/14/18 4
PleasinglyParallelOftenindependentevents
MapReduceasinscalabledatabases
StructuredAdaptiveSparsityHugeJobs
LooselyCoupled
Largescalesimulations
CurrentmajorBigDatacategory
CommodityClouds HPCCloudsHighPerformanceInterconnect
ExascaleSupercomputers
GlobalMachineLearninge.g.parallelclustering
DeepLearning
HPCClouds/SupercomputersMemoryaccessalsocritical
UnstructuredAdaptiveSparsityMediumsizeJobs
GraphAnalyticse.g.subgraphmining
LDA
LinearAlgebraatcore(typicallynotsparse)
SizeofDiskI/O
NeedatoolkitcoveringallapplicationswithsameAPIbutdifferentimplementations
TightlyCoupled
Parametersweepsimulations
These3arefocusofTwister2butweneedtopreservecapabilityonfirst2paradigms
ClassicCloudWorkload
GlobalMachineLearning
NoteProblemandSystemArchitectureasefficientexecutionsaystheymustmatch
8/14/18 5
Needatoolkitcovering5mainparadigmswithsameAPIbutdifferentimplementations
ComparingSpark,FlinkandMPI• OnGlobalMachineLearningGML.
8/14/18 6
MachineLearningwithMPI,SparkandFlink
• Threealgorithmsimplementedinthreeruntimes• MultidimensionalScaling(MDS)• Terasort• K-Means(dropasnotimeandlookedatlater)
• ImplementationinJava• MDSisthemostcomplexalgorithm-threenestedparallelloops• K-Means-oneparallelloop• Terasort-noiterations
• Withcare,Javaperformance~Cperformance• Withoutcare,Javaperformance<<Cperformance(detailsomitted)
8/14/18 7
MultidimensionalScaling:3NestedParallelSections
MDSexecutiontimeon16nodeswith20processesineachnodewith
varyingnumberofpoints
MDSexecutiontimewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks
Spark,FlinkNoSpeedup
8/14/18 8
Flink
Spark
MPI
MPIFactorof20-200FasterthanSpark/Flink
Kmeansalsobad–seelater
Terasort
9
Sorting1TBofdatarecords
Terasortexecutiontimein64and32nodes.OnlyMPIshowsthesortingtimeandcommunicationtimeasothertwoframeworksdoesn'tprovideaclearmethodtoaccuratelymeasurethem.Sorting
timeincludesdatasavetime.MPI-IB-MPIwithInfiniband
Partitionthedatausingasampleandregroup
SoftwareHPC-ABDSHPC-FaaS
8/14/18 10
NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science
Ogres Application Analysis
HPC-ABDS and HPC-FaaS Software Harp and Twister2 Building Blocks
SPIDAL Data Analytics Library
8/14/18 11
Software:MIDASHPC-ABDS
HPC-ABDSIntegratedwiderangeofHPCandBigDatatechnologies.IgaveupupdatinglistinJanuary2016!
8/14/18 12
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting
Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53
21layers Over350SoftwarePackagesJanuary292016
DifferentchoicesinsoftwaresystemsinCloudsandHPC.HPC-ABDStakescloudsoftwareaugmentedbyHPCwhenneededtoimproveperformance16of21layerspluslanguages
8/14/18 13
HarpPluginforHadoop:ImportantpartofTwister2
14
WorkofJudyQiu
Map Collective Run time merges MapReduce and HPC
allreduce reduce
rotate push & pull
allgather
regroup
broadcast
RuntimesoftwareforHarp
15
DynamicRotationControlforLatentDirichletAllocationandMatrixFactorizationSGD(stochasticgradientdescent)
OtherModelParametersFromCaching
ModelParametersFromRotation
ModelRelatedData Computesuntilthetimearrives,thenstartsmodelrotationtoaddressloadimbalance
Multi-ThreadExecution
• Datasets:5millionpoints,10thousandcentroids,10featuredimensions
• 10to20nodesofIntelKNL7250processors
• Harp-DAALhas15xspeedupsoverSparkMLlib
• Datasets:500Kor1milliondatapointsoffeaturedimension300
• RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)
• Harp-DAALachieves3xto6xspeedups
• Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices
• 25nodesofIntelXeonE52670• Harp-DAALhas2xto5xspeedups
overstate-of-the-artMPI-Fasciasolution
Harpv.SparkHarpv.TorchHarpv.MPI
17
• MahoutwasHadoopmachinelearninglibrarybutlargelyabandonedasSparkoutperformedHadoop
• SPIDALoutperformsSparkMLlibandFlinkduetobettercommunicationandbetterdatafloworBSPcommunication.
• HasHarp-(DAAL)optimizedmachinelearninginterface
• SPIDALalsohascommunityalgorithms• BiomolecularSimulation• GraphsforNetworkScience• Imageprocessingforpathologyandpolarscience
MahoutandSPIDAL
18
QiuCoreSPIDALParallelHPCLibrarywithCollectiveUsed
19
• DA-MDSRotate,AllReduce,Broadcast• DirectedForceDimensionReductionAllGather,Allreduce
• IrregularDAVSClusteringPartialRotate,AllReduce,Broadcast
• DASemimetricClustering(DeterministicAnnealing)Rotate,AllReduce,Broadcast
• K-meansAllReduce,Broadcast,AllGatherDAAL
• SVMAllReduce,AllGather• SubGraphMiningAllGather,AllReduce
• LatentDirichletAllocationRotate,AllReduce• MatrixFactorization(SGD)RotateDAAL
• RecommenderSystem(ALS)RotateDAAL• SingularValueDecomposition(SVD)AllGatherDAAL
• QRDecomposition(QR)Reduce,BroadcastDAAL• NeuralNetworkAllReduceDAAL• CovarianceAllReduceDAAL• LowOrderMomentsReduceDAAL• NaiveBayesReduceDAAL• LinearRegressionReduceDAAL• RidgeRegressionReduceDAAL• Multi-classLogisticRegressionRegroup,Rotate,AllGather
• RandomForestAllReduce• PrincipalComponentAnalysis(PCA)AllReduceDAAL
DAALimpliesintegratedonnodewithIntelDAALOptimizedDataAnalyticsLibrary
ImplementingTwister2indetailI
Thisbreaksrulefrom2012-2017ofnot“competing”withbutrather“enhancing”Apache
8/14/18 20
http://www.iterativemapreduce.org/
• Analyzetheruntimeofexistingsystems• Hadoop,Spark,Flink,PregelBigDataProcessing• OpenWhiskandcommercialFaaS• Storm,Heron,ApexStreamingDataflow• Kepler,Pegasus,NiFiworkflowsystems• HarpMap-Collective,MPIandHPCAMTruntimelikeDARMA• AndapproachessuchasGridFTPandCORBA/HLA(!)forwideareadatalinks
• Alotofconfusioncomingfromdifferentcommunities(database,distributed,parallelcomputing,machinelearning,computational/datascience)investigatingsimilarideaswithlittleknowledgeexchangeandmixedup(unclear)requirements
Twister2:“NextGenerationGrid-Edge–HPCCloud”ProgrammingEnvironment
21
http://www.iterativemapreduce.org/
• Harp-DAALwithakernelMachineLearninglibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.ThebroadapplicabilityofHarp-DAALissupportingall5classesofdata-intensivecomputation,frompleasinglyparalleltomachinelearningandsimulations.
• Twister2isatoolkitofcomponentsthatcanbepackagedindifferentways• IntegratedbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance.
• Separatebulksynchronousanddataflowcommunication;• TaskmanagementasinMesos,YarnandKubernetes• Dataflowgraphexecutionmodels• LaunchingoftheHarp-DAALlibrary• Streamingandrepositorydataaccessinterfaces,• In-memorydatabasesandfaulttoleranceatdataflownodes.(useRDDtodoclassiccheckpoint-restart)
IntegratingHPCandApacheProgrammingEnvironments
22
Approach• Clearlydefineanddevelopfunctionallayers(usingexistingtechnologywhenpossible)
• Developlayersasindependentcomponents• Useinteroperablecommonabstractionsbutmultiplepolymorphicimplementations.
• Allowuserstopickandchooseaccordingtorequirementssuchas• Communication+DataManagement• Communication+Staticgraph
• UseHPCfeatureswhenpossible
23
Twister2ComponentsI
9/25/2017 24
Area Component Implementation Comments: User API
Architecture Specification
Coordination Points State and Configuration Management; Program, Data and Message Level
Change execution mode; save and reset state
Execution Semantics
Mapping of Resources to Bolts/Maps in Containers, Processes, Threads
Different systems make different choices - why?
Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule
Job Submission (Dynamic/Static) Resource Allocation
Plugins for Slurm, Yarn, Mesos, Marathon, Aurora
Client API (e.g. Python) for Job Management
Task System
Task migration Monitoring of tasks and migrating tasks for better resource utilization
Task-based programming with Dynamic or Static Graph API; FaaS API; Support accelerators (CUDA,KNL)
Elasticity OpenWhisk
Streaming and FaaS Events
Heron, OpenWhisk, Kafka/RabbitMQ
Task Execution Process, Threads, Queues
Task Scheduling Dynamic Scheduling, Static Scheduling, Pluggable Scheduling Algorithms
Task Graph Static Graph, Dynamic Graph Generation
Twister2ComponentsII
9/25/2017 25
Area Component Implementation Comments
Communication API
Messages Heron This is user level and could map to multiple communication systems
Dataflow Communication
Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler?
Streaming, ETL data pipelines;
Define new Dataflow communication API and library
BSP Communication Map-Collective
Conventional MPI, Harp MPI Point to Point and Collective API
Data Access Static (Batch) Data File Systems, NoSQL, SQL
Data API Streaming Data Message Brokers, Spouts
Data Management Distributed Data Set
Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data
Data Transformation API; Spark RDD, Heron Streamlet
Fault Tolerance Check Pointing Upstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models
Streaming and batch cases distinct; Crosses all components
Security Storage, Messaging, execution
Research needed Crosses all Components
Differentapplicationsatdifferentlayers
26
Spark,Flink
Hadoop,Heron,Storm
None
ImplementingTwister2indetailII
LookatCommunicationindetail
8/14/18 27
http://www.iterativemapreduce.org/
CommunicationModels• MPICharacteristics:Tightlysynchronizedapplications
• Efficientcommunications(µslatency)withuseofadvancedhardware• Inplacecommunicationsandcomputations(Processscopeforstate)
• Basicdataflow:Modelacomputationasagraph• NodesdocomputationswithTaskascomputationsandedgesareasynchronouscommunications
• Acomputationisactivatedwhenitsinputdatadependenciesaresatisfied
• Streamingdataflow:Pub-Subwithdatapartitionedintostreams• Streamsareunbounded,ordereddatatuples• Orderofeventsimportantandgroupdataintotimewindows
• MachineLearningdataflow:Iterativecomputationsandkeeptrackofstate• ThereisbothModelandData,buttypicallyonlycommunicatethemodel• CollectivecommunicationoperationssuchasAllReduceAllGather(nodifferentialoperatorsinBigDataproblems)
• Canusein-placeMPIstylecommunication
S
W G
S
W
WDataflow
8/14/18 28
Twister2DataflowCommunications• Twister:Netofferstwocommunicationmodels
• BSP(BulkSynchronousProcessing)communicationusingTCorMPIseparatedfromitstaskmanagementplusextraHarpcollectives
• plusanewDataflowlibraryDFWbuiltusingMPIsoftwarebutatdatamovementnotmessagelevel
• Non-blocking• Dynamicdatasizes• Streamingmodel
• Batchcaseismodeledasafinitestream• Thecommunicationsarebetweenasetoftasksinanarbitrarytaskgraph
• Keybasedcommunications• Communicationsspillingtodisks• Targettaskscanbedifferentfromsourcetasks
29
Twister:Net
30
• Communicationoperatorsarestateful• Bufferdata• handleimbalanceddynamicallysizedcommunications,• actasacombiner
• Threadsafe• Initialization
• MPI• TCP/ZooKeeper
• Buffermanagement• Themessagesareserializedbythelibrary
• Back-pressure• Usesflowcontrolbytheunderlyingchannel
Architecture
OptimizedoperationvsBasic(Flink,Heron)
Reduce Gather Partition Broadcast
AllReduce AllGather Keyed-Partition
Keyed-Reduce KeyedGather
BatchandStreamingversionsofabovecurrentlyavailable
LatencyofMPIandTwister:Netwithdifferentmessagesizesona
two-nodesetupBandwidthutilizationofFlink,Twister2andOpenMPIover1Gbps,10GbpsandIBwithFlinkonIPoIB
Bandwidth&LatencyKernel
Latencyandbandwidthbetweentwotasksrunningintwonodes
LatencyforReduceandGatheroperationsin32nodeswith256-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize.ForBSP-ObjectcasewedotwoMPIcallswithMPIAllReduce/MPIAllGatherfirsttogetthelengthsofthemessagesandtheactualcall.InfiniBandnetworkisused.
TotaltimeforFlinkandTwister:NetforReduceandPartitionoperationsin32nodeswith640-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize
Flink,BSPandDFWPerformance
Left:K-meansjobexecutiontimeon16nodeswithvaryingcenters,2millionpointswith320-wayparallelism.Right:K-Meanswth4,8and16nodeswhereeachnodehaving20tasks.2millionpointswith16000centersused.
K-MeansalgorithmperformanceAllReduceCommunication
Left:Terasorttimeona16nodeclusterwith384parallelism.BSPandDFWshowsthecommunicationtime.Right:Terasorton32nodeswith.5TBand1TBdatasets.Parallelismof320.Right16nodecluster(Victor),Left32nodecluster(Juliet)withInfiniBand.
Partitionthedatausingasampleandregroup
SortingRecordsForDFWcase,asinglenodecangetcongestedifmany
processessendmessagesimultaneously.
BSPalgorithmwaitsforotherstosendmessagesinaringtopologyandcanbein-efficientcomparedtoDFWcasewhereprocessesdonotwait.
LatencyofApacheHeronandTwister:NetDFW(Dataflow)forReduce,BroadcastandPartitionoperationsin16nodeswith256-wayparallelism
Twister:NetandApacheHeronforStreaming
RobotAlgorithms
RobotwithaLaserRange
Finder
MapBuiltfromRobotdata Robotsneedtoavoidcollisionswhentheymove
N-BodyCollisionAvoidanceSimultaneousLocalizationandMapping
SLAMSimultaneousLocalizationandMapping
MessageBrokersRabbitMQ,Kafka
Gateway
Sendingtopub-sub
SendingtoPersistingtostorage
Streamingworkflow
Astreamapplicationwithsometasksrunninginparallel
Multiplestreamingworkflows
StreamingSLAMAlgorithmApacheStorm
HostedinFutureSystemsOpenStackcloudwhichisaccessiblethroughIUnetwork
Endtoenddelayswithoutanyprocessingislessthan10ms
RaoblackwellizedparticlefilterbasedSLAM
PerformanceofSLAMStormv.Twister2
38
180Laserreadings
StormImplementationSpeedup
Twister2Implementationspeedup.
640Laserreadings
180Laserreadings
640Laserreadings
ImplementingTwister2indetailIII
State
8/14/18 39
http://www.iterativemapreduce.org/
ResourceAllocation
40
• JobSubmission&Management• twister2submit
• ResourceManagers• Slurm
• Nomad• Kubernetes• Mesos
• Ittakesaround5secondstoinitializeaworkerinKubernetes.• Ittakesaround3secondstoinitializeaworkerinMesos.• When3workersaredeployedinoneexecutororpod,initializationtimesarefasterin
bothsystems.
Kubernetes and Mesos Worker Initialization Times
Kubernetes Mesos
0.02.04.06.08.0
10.012.014.016.018.020.0
3 9 18 36 54
Wor
ker S
tart
Tim
es (s
ec)
TotalNumberofWorkers
3workersperpod 1workerperpod
0.02.04.06.08.010.012.014.016.018.020.0
3 9 18 36 54
WorkerS
tartTim
es(sec)
TotalNumberofWorkers
3workersperexecutor 1workerperexecutor
TaskSystem
• Generatecomputationgraphdynamically• Dynamicschedulingoftasks• Allowfinegrainedcontrolofthegraph
• Generatecomputationgraphstatically• Dynamicorstaticscheduling• Suitableforstreaminganddataqueryapplications• Hardtoexpresscomplexcomputations,especiallywithloops
• Hybridapproach• Combinebothstaticanddynamicgraphs
42
Userdefinedoperator
Communication
TaskGraphExecution
43
UserGraph SchedulerPlan
WorkerPlan
Scheduler
Network
ExecutionPlanner ExecutorExecutionPlan
• TaskSchedulerispluggable• Executorispluggable• Schedulerrunningonalltheworkers
• Streaming• Roundrobin• Firstfit
• Batch• Datalocalityaware
SchedulingAlgorithms
DataflowatDifferentGrainsizes
8/14/18 44
Reduce
Maps
Iterate
InternalExecutionDataflowNodes
HPCCommunication
CoarseGrainDataflowslinksjobsinsuchapipeline
Datapreparation ClusteringDimensionReduction
Visualization
Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints
• P=loadPoints()• C=loadInitCenters()• for(inti=0;i<10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}Iterate
CorrespondingtoclassicSparkK-meansDataflow
WorkflowvsDataflow:Differentgrainsizesanddifferentperformancetrade-offs
45
WorkflowControlledbyWorkflowEngineoraScript Dataflowapplicationrunningasasinglejob
ThedataflowcanexpandfromEdgetoCloud
NiFiWorkflow
8/14/18 46
FlinkMDSDataflowGraph
8/30/2017
SystemsStateSparkKmeansDataflow
• P=loadPoints()• C=loadInitCenters()• for(inti=0;i<10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}SaveStateatCoordinationPointStoreCinRDD
8/14/18 48
• Stateishandleddifferentlyinsystems• CORBA,AMT,MPIandStorm/
Heronhavelongrunningtasksthatpreservestate
• SparkandFlinkpreservedatasetsacrossdataflownodeusingin-memorydatabases
• Allsystemsagreeoncoarsegraindataflow;onlykeepstatebyexchangingdata
Iterate
FaultToleranceandState • Similarformofcheck-pointingmechanismisusedalreadyinHPCandBigData
• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobetterthanMPIduetouseofdatabasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs
• Checkpointaftereachstageofthedataflowgraph(atlocationofintelligentdataflownodes)
• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms
8/14/18 49
ImplementingTwister2Futures
8/14/18 50
http://www.iterativemapreduce.org/
Twister2Timeline:EndofAugust2018• Twister:NetDataflowCommunicationAPI
• DataflowcommunicationswithMPIorTCP• HarpforMachineLearning(CustomBSPCommunications)
• Richcollectives• Around30MLalgorithms
• HDFSIntegration• TaskGraph
• Streaming-Stormmodel• Batchanalytics-Hadoop
• DeploymentsonDocker,Kubernetes,Mesos(Aurora),Nomad,Slurm
8/14/18 51
Twister2Timeline:EndofDecember2018• NativeMPIintegrationtoMesos,Yarn• NaiadmodelbasedTasksystemforMachineLearning• LinktoPilotJobs• Faulttolerance
• Streaming• Batch
• HierarchicaldataflowswithStreaming,MachineLearningandBatchintegratedseamlessly
• Dataabstractionsforstreamingandbatch(Streamlets,RDD)• Workflowgraphs(Kepler,Spark)withlinkagedefinedbyDataAbstractions(RDD)
• Endtoendapplications
8/14/18 52
Twister2Timeline:AfterDecember2018• Dynamictaskmigrations• RDMAandothercommunicationenhancements• IntegratepartsofTwister2componentsasbigdatasystemsenhancements(i.e.runcurrentBigDatasoftwareinvokingTwister2components)
• Heron(easiest),Spark,Flink,Hadoop(likeHarptoday)• SupportdifferentAPIs(i.e.runTwister2lookinglikecurrentBigDataSoftware)
• Hadoop• Spark(Flink)• Storm
• RefinementslikeMarathonwithMesosetc.• FunctionasaServiceandServerless• Supporthigherlevelabstractions
• Twister:SQL
8/14/18 53
Summary of Twister2: Next Generation HPC Cloud + Edge + Grid • WehavebuiltahighperformancedataanalysislibrarySPIDAL• WehaveintegratedHPCintomanyApachesystemswithHPC-ABDSwithrichsetofcollectives
• WehavedoneapreliminaryanalysisofthedifferentruntimesofHadoop,Spark,Flink,Storm,Heron,Naiad,DARMA(HPCAsynchronousManyTask)andidentifiedkeycomponents
• Therearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstractionssuchascommunication/data/taskAPI’s
• Apachesystemsusedataflowcommunicationwhichisnaturalfordistributedsystemsbutslowerforclassicparallelcomputing
• Nostandarddataflowlibrary(why?).AddDataflowprimitivesinMPI-4?• HPCcouldadoptsomeoftoolsofBigDataasinCoordinationPoints(dataflownodes),Statemanagement(faulttolerance)withRDD(datasets)
• Couldintegratedataflowandworkflowinacleanerfashion• Notclearsomanybigdataandresourcemanagementapproachesneeded
8/14/18 54