Stream all the things

42
Stream All the Things Streaming Architectures for Data Streams that Never End Dean Wampler, Ph.D. (@deanwampler)

Transcript of Stream all the things

Page 1: Stream all the things

Stream All the Things

Streaming Architectures for Data Streams that Never End

Dean Wampler, Ph.D. (@deanwampler)

Page 2: Stream all the things

Free, as in 🍺

bit.ly/fastdata-ORbook

Mybook,publishedlastfallthatdescribesthepointsinthistalkingreaterdepth.I’verefinedthetalkabitsincethiswaspublished.

Page 3: Stream all the things

Streaming Engines in Context…

Page 4: Stream all the things

Classic Batch Architecture: Hadoop

Page 5: Stream all the things

Logs

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

YARN

ResourceManager

NodeManager

NM

Batch

MapReduce

Spark

Flume

SqoopDBs

SchematicviewofHadoop.Youneed3corethings…

Page 6: Stream all the things

Logs

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

YARN

ResourceManager

NodeManager

NM

Batch

MapReduce

Spark

Flume

SqoopDBs

1.Storagetier,eitherHDFS(HadoopDistributedFileSystem)oralternativeslikeS3ordatabases.

Page 7: Stream all the things

Logs

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

YARN

ResourceManager

NodeManager

NM

Batch

MapReduce

Spark

Flume

SqoopDBs

2.Youneedacomputeengineforprocessingdata.FirstwehadMapReduce,thenafewothertoolstoreplaceit,finallySparkhasemergedasthemostviablesuccessor.

Page 8: Stream all the things

Logs

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

YARN

ResourceManager

NodeManager

NM

Batch

MapReduce

Spark

Flume

SqoopDBs

3.Finally,youneedamanagerofresources,schedulerofjobsandtasks,etc.

Page 9: Stream all the things

Logs

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

YARN

ResourceManager

NodeManager

NM

Batch

MapReduce

Spark

Flume

SqoopDBs

Othertoolsarebuiltonthisfoundationorsupplementit,liketoolsforingestingdata,suchasSqoopandFlume.

Page 10: Stream all the things

New Streaming, “Fast Data” Architecture

(but it also supports batch)

Page 11: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

Numberingisinthereport,soit’seasierforthetexttoreferbacktothefigure.

Page 12: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

1&2.Datasourcesincludestreamsofdataoversocketsandfromlogs.YoumightalsoingestthroughRESTchannels,butunlesstheyareasync,theoverheadwillbetoohigh,soRESTmightbeusedonlyforcommunicatingwiththemicroservices(3)youwritetocompleteyourenvironment.

Page 13: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

3.Arealfastdataenvironmentissimilartomoregeneralservices,you’llhavethe“heavyhitters”likeSpark,Kafka,etc.,butyou’llalsoneedtowritesupportmicroservicestocompletetheenvironment.

Page 14: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

4.Someservices,likeKafka,requireZooKeeperformanagingsharesstate,suchasleaders.

Page 15: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

5.Kafkaisthecoreofthisarchitecture,theplacewheredataisingestedinqueues,organizedintotopics.PublishersareconsumersaredecoupledandN-to-M.Kafkahasmassivescalabilityandexcellentresiliencyanddatadurability.AllservicescancommunicatethrougheachotherusingKafka,too,ratherthanhavingtomanagearbitrarypoint-to-pointconnections.

Page 16: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

6,8,and10.Manypurestreaming,mini-batchstreaming,andbatchenginesarevyingforyourattention.We’llfocusintothisareaafterthisoverview.

Page 17: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

7,9,and10.Datacanalsobereadandwrittenbetweenthesecomputeenginesandstorage,notjustKafka.

Page 18: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

11.YoucanrunthisarchitectureonMesos,YARN(Hadoop),onpremiseorinthecloud.(AtLightbend,wethinkMesosisthebestchoice.)

Page 19: Stream all the things

Streaming Engines

Nowlet’szeroin…

Page 20: Stream all the things

Features to Consider

Whendecidingwhattouse,thereareseveraldimensionsorfeaturestoconsider.

Page 21: Stream all the things

•Low latency? How low?•…

Ifyouhavetightlatencyrequirements,youcan’tuseamini-batchorbatchengine.Ifyou’reconstraintsaremoreflexible,youcandomoresophisticatedthings,liketrainingMLmodels,writingtodatabases,etc.

Page 22: Stream all the things

•Low latency? How low?•High volume? How high?•…

Sometoolsarebetteratlargevolumesthanothers.Ifyouhavelowvolumes,you’llwanttoolsthatareefficientatlowvolumes,whereassomeofthehigh-volumetoolsareefficientperrecordwhenamortizedoverthestream.

Page 23: Stream all the things

•Low latency? How low?•High volume? How high?•Integration with other tools? •Which ones?

•…

ConnectingeverythingthroughKafkacaneliminatethisrequirement,butoftenthat’snotidealandyoumayneedsometoolstohavedirectconnectionstoothertools.

Page 24: Stream all the things

•…•Which kinds of data processing, analytics?•Process individual events? •Process records in bulk?

IsyourdataprocessingessentialdatawarehousekindsofanalyticsorsimilarSQLqueries?AreyoudoingETLofdata?Areyoudoingaggregations?Whoisconsumingthisdata?Areyoudoingcomplexeventprocessing(CEP)whereyouneedtomakedecisionsindividually,perevent?Or,isitastreamwithprocessing“enmasse”,liketypicalSQLandETLprocessing.

We’llcommentonhowthesecharacteristicsarerealizedinthespecifictoolswe’lldiscussnext,butifyou’reusingothertools,considerhowtheyfitthesedimensions.

Page 25: Stream all the things

Best of Breed Streaming Engines

Page 26: Stream all the things

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

Ka>a Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

Focusinginonthepure(low-latency)streamingandminibatchenginesontheright…

Page 27: Stream all the things

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

Ka>a Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

• Apache Beam• Based on Google Dataflow • Requires a “runner”• (Dataflow is the runner in

Google Cloud…)• Most sophisticated streaming

semantics.

Googlehasspentyearsthinkingaboutallthethingsastreamingenginehastohandleifyouwanttodoaccuratecalculationsonstreams,accountingforallthecontingenciesthatcanhappen.ThelatestincarnationisGoogleDataflow,availableasaserviceinGoogleCloudPlatform.Googlerecentlyopen-sourcedthepartofDataflowfordefining“dataflows”withthesesophisticatedsemantics,calledApacheBeam.Beamdoesn’tprovidearunner,sothird-partytoolsprovidethiscapability.

Page 28: Stream all the things

Hereisanexampleofthescenariosyouhavetohandle.Supposeyouarecomputingper-minuteaggregations(likesalesperminute).Theanalysismachinehasoneclockthatisnotnecessarilyinsyncwiththeotherserversprocessingsales.Worse,thereisanunavoidabletimedelayforeventstoarrivetotheanalysisserver.Henceyouneedtoprocesseventtimeandyouneedtoaccountforlatearrivingdata,notonlysmalldelayswheresomeevents/minutewillarrivewithinthefollowingminute,butevenlargedelaysduetonetworkpartitions,serversbusy,etc.,etc.Thisisoneexampleofthechallengesposedbystreamingwhenyouwantaccuratevs.approximateresults.

Page 29: Stream all the things

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

Ka>a Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

• Apache Flink• High volume• Low latency• Apache Beam runner• Evolving SQL and ML support

Flinkprovidestwouniquecapabilitiesamongthesechoices,1)low-latencyprocessingatscaleand2)itisthemostmaturerunnerforApacheBeamdataflows(atleastthatIknowof)outsideGoogle’sownDataflowengine.

Page 30: Stream all the things

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

Ka>a Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

• Akka Streams• Complex event processing• Efficient per-event• Not for large scale “pipes”

• Low latency• “Might” become an Apache

Beam runner

Thereisalesser-knownenginecalledGearpump(Intelproject)thatofferssimilarcapabilitiestoFlink,butisimplementedinAkka,soitcouldprovideaBeamrunnercapabilityontopofAkkaStreams.Hence,AkkaStreamsmightbeextendedtorunBeamflowsforthesituationwhereeventprocessing(i.e.,CEP)isthepreferredmodel,asopposedto“fat”datapipes.AkkaalsoprovidesapowerfulActormodelforrichconcurrency,andlibrariesforclustering,statemanagement,andinteropwithmanysourcesandsinks(the“Alpakka”project).

Page 31: Stream all the things

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

Ka>a Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

• Kafka Streams• Low-overhead processing of

Kafka topics.• Ideal for:• ETL of raw data• Last value for a key,

aggregations (“KTable” abstraction)

KafkaStreamsisagreat80%solutionthatworksclosewithKafka(intermsofproductiondeploymentandoverhead).ItisdesignedtoreadKafkatopics,performprocessingliketransformations,filtering,aggregations,“last-seen”valueforkeys,etc.thenwriteresultsbacktoKafka.Itnicelyaddressesmanycommondesignchallenges,butisn’tdesignedtobeacompletesolutionforstreamprocessing,e.g.,runningSQLqueriesandtrainingMLmodels.

Page 32: Stream all the things

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

Ka>a Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

• Spark• Streaming: mini-batch model• Latency > ~500 msecs

• Ideal for:• Rich SQL• ML model training

• Beam support? Coming…

SparkStreamingisamini-batchmodel,althoughthisisslowlybeingreplacedwithatrue,low-latencystreamingmodel.So,today,useitformoreexpensivecalculations,liketrainingMLmodels,butdon’tuseitwhenthelowest-latencyprocessingisrequired.Lightbend’sFastDataPlatformisworkingontoolstomakeiteasytotrainMLmodelsinSparkandservethemwiththeothertools.AnotheradvantageofSparkistherichecosystemoftoolsforawidevarietyofbatchandstreamingscenarios.

Page 33: Stream all the things

What about Microservices?

I’mgoingtoarguethatservicearchitectures(classicallythreetier,butevolving…)anddataarchitectures(classicallyBigDatalikeHadoop)areconverging.

Page 34: Stream all the things

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

• How is this…

Page 35: Stream all the things

• … like this?

• Image from:

YoucangetthisgreatreportbyJonasBonerhere:lightbend.com/reactive-microservices-architecture

Page 36: Stream all the things

• A Data app | microservice:• Has one responsibility• …

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

Eachdataapp,streamingorbatch,istypicallydoingonething,likeETLthisKafkatopictoCassandra,orcomputeaggregatesforadashboard,etc.Similarly,servicesevolvingnowintomicroservicesarealsosupposedtodoonethingandcommunicatewithotherservicesforthe“help”theyneed.

Page 37: Stream all the things

• A Data app | microservice:• Has one responsibility• Ingests and processes a never

ending stream of data | messages• …

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

Astreamingdataappwillhaveanever-endingstreamofdatatoprocess.Similarly,requestsforservicewillneverstopcomingtomicroservices.

Page 38: Stream all the things

• A Data app | microservice:• Has one responsibility• Ingests and processes a never

ending stream of data | messages• Must be available, responsive,

resilient, scalable, … reactive

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

Hence,bothkindsofsystemsmustprovidehighavailabilitytorespondtothestreamofdata|messagescomingin.Thatmeansthatbothmustberesilientagainstfailure,scaleondemandtomeettheload.Thesearethereactiveprinciples.

Page 39: Stream all the things

• Thesis:• Microservice and Data

architectures are converging.• Similar design problems.• Data tends to dominate as

environment grows.

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

So,bothkindsofsystemshavesimilardesignproblems,hencebothshouldbeimplementedinsimilarways.Also,organizationsthataren’tparticularlydatafocused,especiallywhentheyarenewendeavors,oftenfindthatdatagrowstobeadominantconcern,asignofsuccess!Forexample,Twitterstartedasaclassicthree-tierapplication,butnowmostoftheirinfrastructurelookslikeapetroleumrefinery.

Page 40: Stream all the things

• <Marketing_Plug />• Lightbend Fast Data Platform• Converged data services

and microservice tools• Quick start, expert guidance• Intelligent cluster

management tools.

Mesos, YARN, Cloud, …

Logs

Sockets

RESTZooKeeper Cluster

ZK

Mini-batch

SparkStreaming

Batch

Spark

Low Latency

Flink

Ka9aStreamsAkkaStreams Be

am

Persistence

S3

HDFS

DiskDiskDisk

SQL/NoSQLSearch

1

5

6

311

KaEa Cluster

Ka9a

Microservices

RP Go

Node.js …

24

7

8

9

10

Beam

Page 41: Stream all the things

Free, as in 🍺

bit.ly/fastdata-ORbook

Onceagain,thelinktomybook,publishedlastfallthatdescribesthepointsinthistalkingreaterdepth.I’verefinedthetalkabitsincethiswaspublished.

Page 42: Stream all the things

Interested in Lightbend Fast Data Platform?

lightbend.com/fast-data-platform

ThisURLtakesyoutoinformationabouttheproductwe’rebuilding,FastDataPlatform,thathelpsteamssucceedwiththesetechnologies,frominitialinstallationandconfiguration,towritingapplications,toruntimeproductionmonitoring,toadvancedmachine-learningbasedautomation.