Stream all the things
-
Upload
dean-wampler -
Category
Data & Analytics
-
view
108 -
download
0
Transcript of Stream all the things
Stream All the Things
Streaming Architectures for Data Streams that Never End
Dean Wampler, Ph.D. (@deanwampler)
Free, as in 🍺
bit.ly/fastdata-ORbook
Mybook,publishedlastfallthatdescribesthepointsinthistalkingreaterdepth.I’verefinedthetalkabitsincethiswaspublished.
Streaming Engines in Context…
Classic Batch Architecture: Hadoop
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
YARN
ResourceManager
NodeManager
NM
Batch
MapReduce
…
Spark
Flume
SqoopDBs
SchematicviewofHadoop.Youneed3corethings…
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
YARN
ResourceManager
NodeManager
NM
Batch
MapReduce
…
Spark
Flume
SqoopDBs
1.Storagetier,eitherHDFS(HadoopDistributedFileSystem)oralternativeslikeS3ordatabases.
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
YARN
ResourceManager
NodeManager
NM
Batch
MapReduce
…
Spark
Flume
SqoopDBs
2.Youneedacomputeengineforprocessingdata.FirstwehadMapReduce,thenafewothertoolstoreplaceit,finallySparkhasemergedasthemostviablesuccessor.
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
YARN
ResourceManager
NodeManager
NM
Batch
MapReduce
…
Spark
Flume
SqoopDBs
3.Finally,youneedamanagerofresources,schedulerofjobsandtasks,etc.
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
YARN
ResourceManager
NodeManager
NM
Batch
MapReduce
…
Spark
Flume
SqoopDBs
Othertoolsarebuiltonthisfoundationorsupplementit,liketoolsforingestingdata,suchasSqoopandFlume.
New Streaming, “Fast Data” Architecture
(but it also supports batch)
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
Numberingisinthereport,soit’seasierforthetexttoreferbacktothefigure.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
1&2.Datasourcesincludestreamsofdataoversocketsandfromlogs.YoumightalsoingestthroughRESTchannels,butunlesstheyareasync,theoverheadwillbetoohigh,soRESTmightbeusedonlyforcommunicatingwiththemicroservices(3)youwritetocompleteyourenvironment.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
3.Arealfastdataenvironmentissimilartomoregeneralservices,you’llhavethe“heavyhitters”likeSpark,Kafka,etc.,butyou’llalsoneedtowritesupportmicroservicestocompletetheenvironment.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
4.Someservices,likeKafka,requireZooKeeperformanagingsharesstate,suchasleaders.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
5.Kafkaisthecoreofthisarchitecture,theplacewheredataisingestedinqueues,organizedintotopics.PublishersareconsumersaredecoupledandN-to-M.Kafkahasmassivescalabilityandexcellentresiliencyanddatadurability.AllservicescancommunicatethrougheachotherusingKafka,too,ratherthanhavingtomanagearbitrarypoint-to-pointconnections.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
6,8,and10.Manypurestreaming,mini-batchstreaming,andbatchenginesarevyingforyourattention.We’llfocusintothisareaafterthisoverview.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
7,9,and10.Datacanalsobereadandwrittenbetweenthesecomputeenginesandstorage,notjustKafka.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
11.YoucanrunthisarchitectureonMesos,YARN(Hadoop),onpremiseorinthecloud.(AtLightbend,wethinkMesosisthebestchoice.)
Streaming Engines
Nowlet’szeroin…
Features to Consider
Whendecidingwhattouse,thereareseveraldimensionsorfeaturestoconsider.
•Low latency? How low?•…
Ifyouhavetightlatencyrequirements,youcan’tuseamini-batchorbatchengine.Ifyou’reconstraintsaremoreflexible,youcandomoresophisticatedthings,liketrainingMLmodels,writingtodatabases,etc.
•Low latency? How low?•High volume? How high?•…
Sometoolsarebetteratlargevolumesthanothers.Ifyouhavelowvolumes,you’llwanttoolsthatareefficientatlowvolumes,whereassomeofthehigh-volumetoolsareefficientperrecordwhenamortizedoverthestream.
•Low latency? How low?•High volume? How high?•Integration with other tools? •Which ones?
•…
ConnectingeverythingthroughKafkacaneliminatethisrequirement,butoftenthat’snotidealandyoumayneedsometoolstohavedirectconnectionstoothertools.
•…•Which kinds of data processing, analytics?•Process individual events? •Process records in bulk?
IsyourdataprocessingessentialdatawarehousekindsofanalyticsorsimilarSQLqueries?AreyoudoingETLofdata?Areyoudoingaggregations?Whoisconsumingthisdata?Areyoudoingcomplexeventprocessing(CEP)whereyouneedtomakedecisionsindividually,perevent?Or,isitastreamwithprocessing“enmasse”,liketypicalSQLandETLprocessing.
We’llcommentonhowthesecharacteristicsarerealizedinthespecifictoolswe’lldiscussnext,butifyou’reusingothertools,considerhowtheyfitthesedimensions.
Best of Breed Streaming Engines
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
Focusinginonthepure(low-latency)streamingandminibatchenginesontheright…
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
• Apache Beam• Based on Google Dataflow • Requires a “runner”• (Dataflow is the runner in
Google Cloud…)• Most sophisticated streaming
semantics.
Googlehasspentyearsthinkingaboutallthethingsastreamingenginehastohandleifyouwanttodoaccuratecalculationsonstreams,accountingforallthecontingenciesthatcanhappen.ThelatestincarnationisGoogleDataflow,availableasaserviceinGoogleCloudPlatform.Googlerecentlyopen-sourcedthepartofDataflowfordefining“dataflows”withthesesophisticatedsemantics,calledApacheBeam.Beamdoesn’tprovidearunner,sothird-partytoolsprovidethiscapability.
Hereisanexampleofthescenariosyouhavetohandle.Supposeyouarecomputingper-minuteaggregations(likesalesperminute).Theanalysismachinehasoneclockthatisnotnecessarilyinsyncwiththeotherserversprocessingsales.Worse,thereisanunavoidabletimedelayforeventstoarrivetotheanalysisserver.Henceyouneedtoprocesseventtimeandyouneedtoaccountforlatearrivingdata,notonlysmalldelayswheresomeevents/minutewillarrivewithinthefollowingminute,butevenlargedelaysduetonetworkpartitions,serversbusy,etc.,etc.Thisisoneexampleofthechallengesposedbystreamingwhenyouwantaccuratevs.approximateresults.
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
• Apache Flink• High volume• Low latency• Apache Beam runner• Evolving SQL and ML support
Flinkprovidestwouniquecapabilitiesamongthesechoices,1)low-latencyprocessingatscaleand2)itisthemostmaturerunnerforApacheBeamdataflows(atleastthatIknowof)outsideGoogle’sownDataflowengine.
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
• Akka Streams• Complex event processing• Efficient per-event• Not for large scale “pipes”
• Low latency• “Might” become an Apache
Beam runner
Thereisalesser-knownenginecalledGearpump(Intelproject)thatofferssimilarcapabilitiestoFlink,butisimplementedinAkka,soitcouldprovideaBeamrunnercapabilityontopofAkkaStreams.Hence,AkkaStreamsmightbeextendedtorunBeamflowsforthesituationwhereeventprocessing(i.e.,CEP)isthepreferredmodel,asopposedto“fat”datapipes.AkkaalsoprovidesapowerfulActormodelforrichconcurrency,andlibrariesforclustering,statemanagement,andinteropwithmanysourcesandsinks(the“Alpakka”project).
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
• Kafka Streams• Low-overhead processing of
Kafka topics.• Ideal for:• ETL of raw data• Last value for a key,
aggregations (“KTable” abstraction)
KafkaStreamsisagreat80%solutionthatworksclosewithKafka(intermsofproductiondeploymentandoverhead).ItisdesignedtoreadKafkatopics,performprocessingliketransformations,filtering,aggregations,“last-seen”valueforkeys,etc.thenwriteresultsbacktoKafka.Itnicelyaddressesmanycommondesignchallenges,butisn’tdesignedtobeacompletesolutionforstreamprocessing,e.g.,runningSQLqueriesandtrainingMLmodels.
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
• Spark• Streaming: mini-batch model• Latency > ~500 msecs
• Ideal for:• Rich SQL• ML model training
• Beam support? Coming…
SparkStreamingisamini-batchmodel,althoughthisisslowlybeingreplacedwithatrue,low-latencystreamingmodel.So,today,useitformoreexpensivecalculations,liketrainingMLmodels,butdon’tuseitwhenthelowest-latencyprocessingisrequired.Lightbend’sFastDataPlatformisworkingontoolstomakeiteasytotrainMLmodelsinSparkandservethemwiththeothertools.AnotheradvantageofSparkistherichecosystemoftoolsforawidevarietyofbatchandstreamingscenarios.
What about Microservices?
I’mgoingtoarguethatservicearchitectures(classicallythreetier,butevolving…)anddataarchitectures(classicallyBigDatalikeHadoop)areconverging.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
• How is this…
• … like this?
• Image from:
YoucangetthisgreatreportbyJonasBonerhere:lightbend.com/reactive-microservices-architecture
• A Data app | microservice:• Has one responsibility• …
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
Eachdataapp,streamingorbatch,istypicallydoingonething,likeETLthisKafkatopictoCassandra,orcomputeaggregatesforadashboard,etc.Similarly,servicesevolvingnowintomicroservicesarealsosupposedtodoonethingandcommunicatewithotherservicesforthe“help”theyneed.
• A Data app | microservice:• Has one responsibility• Ingests and processes a never
ending stream of data | messages• …
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
Astreamingdataappwillhaveanever-endingstreamofdatatoprocess.Similarly,requestsforservicewillneverstopcomingtomicroservices.
• A Data app | microservice:• Has one responsibility• Ingests and processes a never
ending stream of data | messages• Must be available, responsive,
resilient, scalable, … reactive
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
Hence,bothkindsofsystemsmustprovidehighavailabilitytorespondtothestreamofdata|messagescomingin.Thatmeansthatbothmustberesilientagainstfailure,scaleondemandtomeettheload.Thesearethereactiveprinciples.
• Thesis:• Microservice and Data
architectures are converging.• Similar design problems.• Data tends to dominate as
environment grows.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
So,bothkindsofsystemshavesimilardesignproblems,hencebothshouldbeimplementedinsimilarways.Also,organizationsthataren’tparticularlydatafocused,especiallywhentheyarenewendeavors,oftenfindthatdatagrowstobeadominantconcern,asignofsuccess!Forexample,Twitterstartedasaclassicthree-tierapplication,butnowmostoftheirinfrastructurelookslikeapetroleumrefinery.
• <Marketing_Plug />• Lightbend Fast Data Platform• Converged data services
and microservice tools• Quick start, expert guidance• Intelligent cluster
management tools.
Mesos, YARN, Cloud, …
Logs
Sockets
RESTZooKeeper Cluster
ZK
Mini-batch
SparkStreaming
Batch
Spark
…
Low Latency
Flink
Ka9aStreamsAkkaStreams Be
am
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/NoSQLSearch
1
5
6
311
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
24
7
8
9
10
Beam
Free, as in 🍺
bit.ly/fastdata-ORbook
Onceagain,thelinktomybook,publishedlastfallthatdescribesthepointsinthistalkingreaterdepth.I’verefinedthetalkabitsincethiswaspublished.
Interested in Lightbend Fast Data Platform?
lightbend.com/fast-data-platform
ThisURLtakesyoutoinformationabouttheproductwe’rebuilding,FastDataPlatform,thathelpsteamssucceedwiththesetechnologies,frominitialinstallationandconfiguration,towritingapplications,toruntimeproductionmonitoring,toadvancedmachine-learningbasedautomation.