Twister2: A High-Performance Big Data Programming...

`,WorkwithShantenuJha,KannanGovindarajan,PulasthiWickramasinghe,GurhanGunduz,AhmetUyar

8/14/18 1

HPBDC2018:The4thIEEEInternationalWorkshoponHigh-PerformanceBigData,DeepLearning,andCloudComputing

GeoffreyFox,May21,2018

JudyQiu,SupunKamburugamuveDepartmentofIntelligentSystemsEngineering

gcf@indiana.edu,http://www.dsc.soic.indiana.edu/,http://spidal.org/

Twister2:AHigh-PerformanceBigDataProgrammingEnvironment

Abstract•  WeanalysethecomponentsthatareneededinprogrammingenvironmentsforBigDataAnalysisSystemswithscalableHPCperformanceandthefunctionalityofABDS–theApacheBigDataSoftwareStack.

• OnehighlightisHarp-DAALwhichisamachinelibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.

• AnotherhighlightisTwister2whichconsistsofasetofmiddlewarecomponentstosupportbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance

•  Twister2coversbulksynchronousanddataflowcommunication;taskmanagementasinMesos,YarnandKubernetes;dataflowgraphexecutionmodels;launchingoftheHarp-DAALlibrary;streamingandrepositorydataaccessinterfaces,in-memorydatabasesandfaulttoleranceatdataflownodes.

•  SimilarcapabilitiesareavailableincurrentApachesystemsbutasintegratedpackageswhichdonotallowneededcustomizationfordifferentapplicationscenarios.

8/14/18 2

•  Ongeneralprinciplesparallelanddistributedcomputinghavedifferentrequirementsevenifsometimessimilarfunctionalities

•  ApachestackABDStypicallyusesdistributedcomputingconcepts•  Forexample,ReduceoperationisdifferentinMPI(Harp)andSpark

•  Largescalesimulationrequirementsarewellunderstood•  BigDatarequirementsarenotagreedbutthereareafewkeyusetypes

1)  Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstatisticsandvisualizations;possiblyStreaming

2)  DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3)  GlobalMachineLearningGMLwithsinglejobusingmultiplenodesasclassicparallel

computing4)  DeepLearningcertainlyneedsHPC–possiblyonlymultiplesmallsystems

•  Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoApacheBigDataSoftware(withnoHPC)

•  ThisexplainswhySparkwithpoorGMLperformancecanbesosuccessful

Requirements

8/14/18 3

DifficultyinParallelismSizeofSynchronizationconstraints

SpectrumofApplicationsandAlgorithmsThereisalsodistributionseeningrid/edgecomputing

8/14/18 4

PleasinglyParallelOftenindependentevents

MapReduceasinscalabledatabases

StructuredAdaptiveSparsityHugeJobs

LooselyCoupled

Largescalesimulations

CurrentmajorBigDatacategory

CommodityClouds HPCCloudsHighPerformanceInterconnect

ExascaleSupercomputers

GlobalMachineLearninge.g.parallelclustering

DeepLearning

HPCClouds/SupercomputersMemoryaccessalsocritical

UnstructuredAdaptiveSparsityMediumsizeJobs

GraphAnalyticse.g.subgraphmining

LinearAlgebraatcore(typicallynotsparse)

SizeofDiskI/O

NeedatoolkitcoveringallapplicationswithsameAPIbutdifferentimplementations

TightlyCoupled

Parametersweepsimulations

These3arefocusofTwister2butweneedtopreservecapabilityonfirst2paradigms

ClassicCloudWorkload

GlobalMachineLearning

NoteProblemandSystemArchitectureasefficientexecutionsaystheymustmatch

8/14/18 5

Needatoolkitcovering5mainparadigmswithsameAPIbutdifferentimplementations

ComparingSpark,FlinkandMPI•  OnGlobalMachineLearningGML.

8/14/18 6

MachineLearningwithMPI,SparkandFlink

•  Threealgorithmsimplementedinthreeruntimes•  MultidimensionalScaling(MDS)•  Terasort•  K-Means(dropasnotimeandlookedatlater)

•  ImplementationinJava•  MDSisthemostcomplexalgorithm-threenestedparallelloops•  K-Means-oneparallelloop•  Terasort-noiterations

• Withcare,Javaperformance~Cperformance• Withoutcare,Javaperformance<<Cperformance(detailsomitted)

8/14/18 7

MultidimensionalScaling:3NestedParallelSections

MDSexecutiontimeon16nodeswith20processesineachnodewith

varyingnumberofpoints

MDSexecutiontimewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks

Spark,FlinkNoSpeedup

8/14/18 8

MPIFactorof20-200FasterthanSpark/Flink

Kmeansalsobad–seelater

Terasort

Sorting1TBofdatarecords

Terasortexecutiontimein64and32nodes.OnlyMPIshowsthesortingtimeandcommunicationtimeasothertwoframeworksdoesn'tprovideaclearmethodtoaccuratelymeasurethem.Sorting

timeincludesdatasavetime.MPI-IB-MPIwithInfiniband

Partitionthedatausingasampleandregroup

SoftwareHPC-ABDSHPC-FaaS

8/14/18 10

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Ogres Application Analysis

HPC-ABDS and HPC-FaaS Software Harp and Twister2 Building Blocks

SPIDAL Data Analytics Library

8/14/18 11

Software:MIDASHPC-ABDS

HPC-ABDSIntegratedwiderangeofHPCandBigDatatechnologies.IgaveupupdatinglistinJanuary2016!

8/14/18 12

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting

Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53

21layers Over350SoftwarePackagesJanuary292016

DifferentchoicesinsoftwaresystemsinCloudsandHPC.HPC-ABDStakescloudsoftwareaugmentedbyHPCwhenneededtoimproveperformance16of21layerspluslanguages

8/14/18 13

HarpPluginforHadoop:ImportantpartofTwister2

WorkofJudyQiu

Map Collective Run time merges MapReduce and HPC

allreduce reduce

rotate push & pull

allgather

regroup

broadcast

RuntimesoftwareforHarp

DynamicRotationControlforLatentDirichletAllocationandMatrixFactorizationSGD(stochasticgradientdescent)

OtherModelParametersFromCaching

ModelParametersFromRotation

ModelRelatedData Computesuntilthetimearrives,thenstartsmodelrotationtoaddressloadimbalance

Multi-ThreadExecution

•  Datasets:5millionpoints,10thousandcentroids,10featuredimensions

•  10to20nodesofIntelKNL7250processors

•  Harp-DAALhas15xspeedupsoverSparkMLlib

•  Datasets:500Kor1milliondatapointsoffeaturedimension300

•  RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)

•  Harp-DAALachieves3xto6xspeedups

•  Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices

•  25nodesofIntelXeonE52670•  Harp-DAALhas2xto5xspeedups

overstate-of-the-artMPI-Fasciasolution

Harpv.SparkHarpv.TorchHarpv.MPI

•  MahoutwasHadoopmachinelearninglibrarybutlargelyabandonedasSparkoutperformedHadoop

•  SPIDALoutperformsSparkMLlibandFlinkduetobettercommunicationandbetterdatafloworBSPcommunication.

•  HasHarp-(DAAL)optimizedmachinelearninginterface

•  SPIDALalsohascommunityalgorithms•  BiomolecularSimulation•  GraphsforNetworkScience•  Imageprocessingforpathologyandpolarscience

MahoutandSPIDAL

QiuCoreSPIDALParallelHPCLibrarywithCollectiveUsed

•  DA-MDSRotate,AllReduce,Broadcast•  DirectedForceDimensionReductionAllGather,Allreduce

•  IrregularDAVSClusteringPartialRotate,AllReduce,Broadcast

•  DASemimetricClustering(DeterministicAnnealing)Rotate,AllReduce,Broadcast

•  K-meansAllReduce,Broadcast,AllGatherDAAL

•  SVMAllReduce,AllGather•  SubGraphMiningAllGather,AllReduce

•  LatentDirichletAllocationRotate,AllReduce•  MatrixFactorization(SGD)RotateDAAL

•  RecommenderSystem(ALS)RotateDAAL•  SingularValueDecomposition(SVD)AllGatherDAAL

•  QRDecomposition(QR)Reduce,BroadcastDAAL•  NeuralNetworkAllReduceDAAL•  CovarianceAllReduceDAAL•  LowOrderMomentsReduceDAAL•  NaiveBayesReduceDAAL•  LinearRegressionReduceDAAL•  RidgeRegressionReduceDAAL•  Multi-classLogisticRegressionRegroup,Rotate,AllGather

•  RandomForestAllReduce•  PrincipalComponentAnalysis(PCA)AllReduceDAAL

DAALimpliesintegratedonnodewithIntelDAALOptimizedDataAnalyticsLibrary

ImplementingTwister2indetailI

Thisbreaksrulefrom2012-2017ofnot“competing”withbutrather“enhancing”Apache

8/14/18 20

http://www.iterativemapreduce.org/

•  Analyzetheruntimeofexistingsystems•  Hadoop,Spark,Flink,PregelBigDataProcessing•  OpenWhiskandcommercialFaaS•  Storm,Heron,ApexStreamingDataflow•  Kepler,Pegasus,NiFiworkflowsystems•  HarpMap-Collective,MPIandHPCAMTruntimelikeDARMA•  AndapproachessuchasGridFTPandCORBA/HLA(!)forwideareadatalinks

•  Alotofconfusioncomingfromdifferentcommunities(database,distributed,parallelcomputing,machinelearning,computational/datascience)investigatingsimilarideaswithlittleknowledgeexchangeandmixedup(unclear)requirements

Twister2:“NextGenerationGrid-Edge–HPCCloud”ProgrammingEnvironment

•  Harp-DAALwithakernelMachineLearninglibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.ThebroadapplicabilityofHarp-DAALissupportingall5classesofdata-intensivecomputation,frompleasinglyparalleltomachinelearningandsimulations.

•  Twister2isatoolkitofcomponentsthatcanbepackagedindifferentways•  IntegratedbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance.

•  Separatebulksynchronousanddataflowcommunication;•  TaskmanagementasinMesos,YarnandKubernetes•  Dataflowgraphexecutionmodels•  LaunchingoftheHarp-DAALlibrary•  Streamingandrepositorydataaccessinterfaces,•  In-memorydatabasesandfaulttoleranceatdataflownodes.(useRDDtodoclassiccheckpoint-restart)

IntegratingHPCandApacheProgrammingEnvironments

Approach• Clearlydefineanddevelopfunctionallayers(usingexistingtechnologywhenpossible)

•  Developlayersasindependentcomponents• Useinteroperablecommonabstractionsbutmultiplepolymorphicimplementations.

• Allowuserstopickandchooseaccordingtorequirementssuchas•  Communication+DataManagement•  Communication+Staticgraph

• UseHPCfeatureswhenpossible

Twister2ComponentsI

9/25/2017 24

Area Component Implementation Comments: User API

Architecture Specification

Coordination Points State and Configuration Management; Program, Data and Message Level

Change execution mode; save and reset state

Execution Semantics

Mapping of Resources to Bolts/Maps in Containers, Processes, Threads

Different systems make different choices - why?

Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule

Job Submission (Dynamic/Static) Resource Allocation

Plugins for Slurm, Yarn, Mesos, Marathon, Aurora

Client API (e.g. Python) for Job Management

Task System

Task migration Monitoring of tasks and migrating tasks for better resource utilization

Task-based programming with Dynamic or Static Graph API; FaaS API; Support accelerators (CUDA,KNL)

Elasticity OpenWhisk

Streaming and FaaS Events

Heron, OpenWhisk, Kafka/RabbitMQ

Task Execution Process, Threads, Queues

Task Scheduling Dynamic Scheduling, Static Scheduling, Pluggable Scheduling Algorithms

Task Graph Static Graph, Dynamic Graph Generation

Twister2ComponentsII

9/25/2017 25

Area Component Implementation Comments

Communication API

Messages Heron This is user level and could map to multiple communication systems

Dataflow Communication

Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler?

Streaming, ETL data pipelines;

Define new Dataflow communication API and library

BSP Communication Map-Collective

Conventional MPI, Harp MPI Point to Point and Collective API

Data Access Static (Batch) Data File Systems, NoSQL, SQL

Data API Streaming Data Message Brokers, Spouts

Data Management Distributed Data Set

Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data

Data Transformation API; Spark RDD, Heron Streamlet

Fault Tolerance Check Pointing Upstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models

Streaming and batch cases distinct; Crosses all components

Security Storage, Messaging, execution

Research needed Crosses all Components

Differentapplicationsatdifferentlayers

Spark,Flink

Hadoop,Heron,Storm

ImplementingTwister2indetailII

LookatCommunicationindetail

8/14/18 27

CommunicationModels•  MPICharacteristics:Tightlysynchronizedapplications

•  Efficientcommunications(µslatency)withuseofadvancedhardware•  Inplacecommunicationsandcomputations(Processscopeforstate)

•  Basicdataflow:Modelacomputationasagraph•  NodesdocomputationswithTaskascomputationsandedgesareasynchronouscommunications

•  Acomputationisactivatedwhenitsinputdatadependenciesaresatisfied

•  Streamingdataflow:Pub-Subwithdatapartitionedintostreams•  Streamsareunbounded,ordereddatatuples•  Orderofeventsimportantandgroupdataintotimewindows

• MachineLearningdataflow:Iterativecomputationsandkeeptrackofstate•  ThereisbothModelandData,buttypicallyonlycommunicatethemodel•  CollectivecommunicationoperationssuchasAllReduceAllGather(nodifferentialoperatorsinBigDataproblems)

•  Canusein-placeMPIstylecommunication

WDataflow

8/14/18 28

Twister2DataflowCommunications•  Twister:Netofferstwocommunicationmodels

•  BSP(BulkSynchronousProcessing)communicationusingTCorMPIseparatedfromitstaskmanagementplusextraHarpcollectives

• plusanewDataflowlibraryDFWbuiltusingMPIsoftwarebutatdatamovementnotmessagelevel

•  Non-blocking•  Dynamicdatasizes•  Streamingmodel

•  Batchcaseismodeledasafinitestream•  Thecommunicationsarebetweenasetoftasksinanarbitrarytaskgraph

•  Keybasedcommunications•  Communicationsspillingtodisks•  Targettaskscanbedifferentfromsourcetasks

Twister:Net

•  Communicationoperatorsarestateful•  Bufferdata•  handleimbalanceddynamicallysizedcommunications,•  actasacombiner

•  Threadsafe•  Initialization

•  MPI•  TCP/ZooKeeper

•  Buffermanagement•  Themessagesareserializedbythelibrary

•  Back-pressure•  Usesflowcontrolbytheunderlyingchannel

Architecture

OptimizedoperationvsBasic(Flink,Heron)

Reduce Gather Partition Broadcast

AllReduce AllGather Keyed-Partition

Keyed-Reduce KeyedGather

BatchandStreamingversionsofabovecurrentlyavailable

LatencyofMPIandTwister:Netwithdifferentmessagesizesona

two-nodesetupBandwidthutilizationofFlink,Twister2andOpenMPIover1Gbps,10GbpsandIBwithFlinkonIPoIB

Bandwidth&LatencyKernel

Latencyandbandwidthbetweentwotasksrunningintwonodes

LatencyforReduceandGatheroperationsin32nodeswith256-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize.ForBSP-ObjectcasewedotwoMPIcallswithMPIAllReduce/MPIAllGatherfirsttogetthelengthsofthemessagesandtheactualcall.InfiniBandnetworkisused.

TotaltimeforFlinkandTwister:NetforReduceandPartitionoperationsin32nodeswith640-wayparallelism.Thetimeisfor1millionmessagesineachparallelunit,withthegivenmessagesize

Flink,BSPandDFWPerformance

Left:K-meansjobexecutiontimeon16nodeswithvaryingcenters,2millionpointswith320-wayparallelism.Right:K-Meanswth4,8and16nodeswhereeachnodehaving20tasks.2millionpointswith16000centersused.

K-MeansalgorithmperformanceAllReduceCommunication

Left:Terasorttimeona16nodeclusterwith384parallelism.BSPandDFWshowsthecommunicationtime.Right:Terasorton32nodeswith.5TBand1TBdatasets.Parallelismof320.Right16nodecluster(Victor),Left32nodecluster(Juliet)withInfiniBand.

Partitionthedatausingasampleandregroup

SortingRecordsForDFWcase,asinglenodecangetcongestedifmany

processessendmessagesimultaneously.

BSPalgorithmwaitsforotherstosendmessagesinaringtopologyandcanbein-efficientcomparedtoDFWcasewhereprocessesdonotwait.

LatencyofApacheHeronandTwister:NetDFW(Dataflow)forReduce,BroadcastandPartitionoperationsin16nodeswith256-wayparallelism

Twister:NetandApacheHeronforStreaming

RobotAlgorithms

RobotwithaLaserRange

Finder

MapBuiltfromRobotdata Robotsneedtoavoidcollisionswhentheymove

N-BodyCollisionAvoidanceSimultaneousLocalizationandMapping

SLAMSimultaneousLocalizationandMapping

MessageBrokersRabbitMQ,Kafka

Gateway

Sendingtopub-sub

SendingtoPersistingtostorage

Streamingworkflow

Astreamapplicationwithsometasksrunninginparallel

Multiplestreamingworkflows

StreamingSLAMAlgorithmApacheStorm

HostedinFutureSystemsOpenStackcloudwhichisaccessiblethroughIUnetwork

Endtoenddelayswithoutanyprocessingislessthan10ms

RaoblackwellizedparticlefilterbasedSLAM

PerformanceofSLAMStormv.Twister2

180Laserreadings

StormImplementationSpeedup

Twister2Implementationspeedup.

640Laserreadings

180Laserreadings

640Laserreadings

ImplementingTwister2indetailIII

8/14/18 39

ResourceAllocation

•  JobSubmission&Management•  twister2submit

•  ResourceManagers•  Slurm

•  Nomad•  Kubernetes•  Mesos

•  Ittakesaround5secondstoinitializeaworkerinKubernetes.•  Ittakesaround3secondstoinitializeaworkerinMesos.•  When3workersaredeployedinoneexecutororpod,initializationtimesarefasterin

bothsystems.

Kubernetes and Mesos Worker Initialization Times

Kubernetes Mesos

0.02.04.06.08.0

10.012.014.016.018.020.0

3 9 18 36 54

TotalNumberofWorkers

3workersperpod 1workerperpod

0.02.04.06.08.010.012.014.016.018.020.0

3 9 18 36 54

WorkerS

tartTim

es(sec)

TotalNumberofWorkers

3workersperexecutor 1workerperexecutor

TaskSystem

• Generatecomputationgraphdynamically•  Dynamicschedulingoftasks•  Allowfinegrainedcontrolofthegraph

• Generatecomputationgraphstatically•  Dynamicorstaticscheduling•  Suitableforstreaminganddataqueryapplications•  Hardtoexpresscomplexcomputations,especiallywithloops

• Hybridapproach•  Combinebothstaticanddynamicgraphs

Userdefinedoperator

Communication

TaskGraphExecution

UserGraph SchedulerPlan

WorkerPlan

Scheduler

Network

ExecutionPlanner ExecutorExecutionPlan

•  TaskSchedulerispluggable•  Executorispluggable•  Schedulerrunningonalltheworkers

•  Streaming•  Roundrobin•  Firstfit

•  Batch•  Datalocalityaware

SchedulingAlgorithms

DataflowatDifferentGrainsizes

8/14/18 44

Reduce

Iterate

InternalExecutionDataflowNodes

HPCCommunication

CoarseGrainDataflowslinksjobsinsuchapipeline

Datapreparation ClusteringDimensionReduction

Visualization

Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints

•  P=loadPoints()•  C=loadInitCenters()•  for(inti=0;i<10;i++){•  T=P.map().withBroadcast(C)•  C=T.reduce()}Iterate

CorrespondingtoclassicSparkK-meansDataflow

WorkflowvsDataflow:Differentgrainsizesanddifferentperformancetrade-offs

WorkflowControlledbyWorkflowEngineoraScript Dataflowapplicationrunningasasinglejob

ThedataflowcanexpandfromEdgetoCloud

NiFiWorkflow

8/14/18 46

FlinkMDSDataflowGraph

8/30/2017

SystemsStateSparkKmeansDataflow

• P=loadPoints()• C=loadInitCenters()•  for(inti=0;i<10;i++){•  T=P.map().withBroadcast(C)•  C=T.reduce()}SaveStateatCoordinationPointStoreCinRDD

8/14/18 48

•  Stateishandleddifferentlyinsystems•  CORBA,AMT,MPIandStorm/

Heronhavelongrunningtasksthatpreservestate

•  SparkandFlinkpreservedatasetsacrossdataflownodeusingin-memorydatabases

•  Allsystemsagreeoncoarsegraindataflow;onlykeepstatebyexchangingdata

Iterate

FaultToleranceandState • Similarformofcheck-pointingmechanismisusedalreadyinHPCandBigData

• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobetterthanMPIduetouseofdatabasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs

• Checkpointaftereachstageofthedataflowgraph(atlocationofintelligentdataflownodes)

• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms

8/14/18 49

ImplementingTwister2Futures

8/14/18 50

Twister2Timeline:EndofAugust2018•  Twister:NetDataflowCommunicationAPI

•  DataflowcommunicationswithMPIorTCP• HarpforMachineLearning(CustomBSPCommunications)

•  Richcollectives•  Around30MLalgorithms

• HDFSIntegration•  TaskGraph

•  Streaming-Stormmodel•  Batchanalytics-Hadoop

• DeploymentsonDocker,Kubernetes,Mesos(Aurora),Nomad,Slurm

8/14/18 51

Twister2Timeline:EndofDecember2018•  NativeMPIintegrationtoMesos,Yarn•  NaiadmodelbasedTasksystemforMachineLearning•  LinktoPilotJobs•  Faulttolerance

•  Streaming•  Batch

•  HierarchicaldataflowswithStreaming,MachineLearningandBatchintegratedseamlessly

•  Dataabstractionsforstreamingandbatch(Streamlets,RDD)• Workflowgraphs(Kepler,Spark)withlinkagedefinedbyDataAbstractions(RDD)

•  Endtoendapplications

8/14/18 52

Twister2Timeline:AfterDecember2018•  Dynamictaskmigrations•  RDMAandothercommunicationenhancements•  IntegratepartsofTwister2componentsasbigdatasystemsenhancements(i.e.runcurrentBigDatasoftwareinvokingTwister2components)

•  Heron(easiest),Spark,Flink,Hadoop(likeHarptoday)•  SupportdifferentAPIs(i.e.runTwister2lookinglikecurrentBigDataSoftware)

•  Hadoop•  Spark(Flink)•  Storm

•  RefinementslikeMarathonwithMesosetc.•  FunctionasaServiceandServerless•  Supporthigherlevelabstractions

•  Twister:SQL

8/14/18 53

Summary of Twister2: Next Generation HPC Cloud + Edge + Grid • WehavebuiltahighperformancedataanalysislibrarySPIDAL• WehaveintegratedHPCintomanyApachesystemswithHPC-ABDSwithrichsetofcollectives

• WehavedoneapreliminaryanalysisofthedifferentruntimesofHadoop,Spark,Flink,Storm,Heron,Naiad,DARMA(HPCAsynchronousManyTask)andidentifiedkeycomponents

•  Therearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstractionssuchascommunication/data/taskAPI’s

•  Apachesystemsusedataflowcommunicationwhichisnaturalfordistributedsystemsbutslowerforclassicparallelcomputing

•  Nostandarddataflowlibrary(why?).AddDataflowprimitivesinMPI-4?•  HPCcouldadoptsomeoftoolsofBigDataasinCoordinationPoints(dataflownodes),Statemanagement(faulttolerance)withRDD(datasets)

•  Couldintegratedataflowandworkflowinacleanerfashion•  Notclearsomanybigdataandresourcemanagementapproachesneeded

8/14/18 54

Twister2: A High-Performance Big Data Programming...

Documents

Transcript of Twister2: A High-Performance Big Data Programming...

Designing Twister2: Efﬁcient Programming Environment ...dsc.soic.indiana.edu/publications/Twister2.pdf · ow semantics. Some other examples of batch processing systems include Microsoft

Circular 932

SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Goats - wabaunsee.k-state.edu

High Performance Computing and Which Big Data?web.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-baru.pdf · High Performance Computing and Which Big Data? Chaitan Baru, Associate

Expresión 932

The Performance Analysis of Cache Architecture based on ...web.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/xu-slides.pdf · based on Alluxio over Virtualized Infrastructure Xu Chang,

Achieving Performance and Programmability for …web.cse.ohio-state.edu › ~lu.932 › hpbdc2020 › slides › gagan...• Supports a wide range of applications • Optimized for

Sunrise or Sunset: Exploring the Design Space of Big Data …web.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17... · 2017. 5. 31. · • 2009 Pregel Apache Giraph • 2010

Exploring the Performance of Spark for a Scientiﬁc Use Caseweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16... · 2016-05-27 · the Spark implementation as compared with

Sweet Things along the High Iron - Walthers · Atlantic Coast Line 932-5784 932-25784 SOU 932-5785 932-25785 Seaboard Coast Line 932-5786 932-25786 MP 932-5787 932-25787 Chattahoochee

RESEARCH Twister2:DesignofaBigDataToolkitdsc.soic.indiana.edu/publications/twister2_design_big_data_toolkit.pdf · Elasticity OpenWhisk StreamingandFaaSEvents Heron,OpenWhisk,Kafka/RabbitMQ

Industry Standards for Benchmarking Big Data Systemsweb.cse.ohio-state.edu/~lu.932/hpbdc2015/slides/... · –Technology Conference Series on Performance Evaluation and Benchmarking

PACM: A Prediction-based Auto-adaptive Compression Model for …web.cse.ohio-state.edu › ~lu.932 › hpbdc2016 › slides › hpbdc16... · 2016-05-27 · PACM:Prediction-based

datastructII - web.cse.ohio-state.edu

K-STATE.EDU/FIRS T

WEST SEATTLE PETROGLYPHS · Ron Nims (206) 932-3626 (206) 935-3040 (206) 935-8180 (206) 932-3292 (206) 932-3292 (206) 932-3292 (206) 774-8565 (206) 633-0721 (206) 932-3292

D3Tutorial - web.cse.ohio-state.edu

Spark and HPC for High Energy Physics Data Analysesweb.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbd… · · 2017-05-30for High Energy Physics Data Analyses Marc Paterno, Jim

Graph Analytics: Complexity, Scalability, and Architecturesweb.cse.ohio-state.edu/~lu.932/hpbdc2017/slides/hpbdc17-peter.pdf · Complexity, Scalability, and Architectures Peter M.