DataFlow & Beam
Embed Size (px)
Transcript of DataFlow & Beam
DataFlow & BeamGabe Hamilton
So youve built your perfect video game.
People all over the world are playing it.
Now for Billing, High Scores, etcPeople are playing your game on servers all over the world.Its time to start crunching all your data for billing, high scores, error reports, etc.
The time that events happened is important.You charge per minute played, with surge pricing!Data often arrives late.Network delays, Servers go down and send their data hours later.
What were going to cover
What is Dataflow?Start a demo!DataFlow CodeBatches & StreamingEvent Time
What is Google DataFlow?
Distributed Streaming (and Batch) Data processing enginePulls in data from SourcesWrites data to SinksSpins up data processing nodes, pushes your code out to themLike Hadoop but handles unbounded data? Yep similar idea
Up and running in 10 minshttps://cloud.google.com/dataflow/getting-startedcreate a projectadd dataflow APIcreate a google storage bucketgcloud auth login
git clone [email protected]:gabehamilton/DataflowGroovySDK-examples.git
gradlew run -Pargs="project=PROJECT_NAME stagingLocation=BUCKET_NAME (requires a JDK)
Lets see some code - configDataflowPipelineOptions options =PipelineOptionsFactory.create().as(DataflowPipelineOptions)
options.setProject( myproject )options.setStagingLocation("gs://aStagingBucket")options.setNumWorkers(1000) // !!! default is 3options.setStreaming(true);Pipeline pipeline = Pipeline.create(options);
Lets see some code - pipelinepipeline // Extract and sum username/score pairs from the event data.
.apply(TextIO.Read.from(options.getInput())) // Read events from a text file
.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn())) .apply("SumByUser", new ExtractAndSumScore("user"))
.apply("WriteUserScoreSums", new WriteToBigQuery(options.getTableName()));
ComponentsPCollection - Parallel CollectionStandard interface to a set of data. Can be a streaming data set.PTransform - Parallel Transformtakes Input, produces OutputParDo - Parallel DoYour custom Transform Function
Running Dataflow - staging files
Code is staged in the staging bucket
before gettingpushed to Workers
Staged files - detail
We dont need no stinking BatchesHandles Batches!
Streaming! Not real time streaming, unbounded data set streaming.Continuously processing yourUser Scores, Billing, AnalyticsRisk, Spam, and other deviations from mean
Event TimeDataflow lets you work in event timewhen the event says it happenedrather than processing time when the event was received
Allows Out of Order processinga plane full of mobile users just landed, turned their phones back on and start delivering the past 2 hours of data
Features for working with Event TimeWindowinghourly, session basedWatermarks All the data is in.Fixed end of match, end of fileHeuristic the data is probably in, Percentile 90% of the data is inTriggersemitting partial results.Accumulationsways of dealing with late data
Streaming Event Time ExampleWindow Hourly - events per hourTriggerEach minute
Allowed Lateness12 hoursAccumulationDiscardingHow many errors occurred between 5-6pm.
As we process data, update windows every minute.
After 12 hours, discard any late data that arrives
When updating a window throw out the previous result replacing it with the new one.
Code - streaming event time.into(FixedWindows.of(ONE_HOUR)) // Duration.standardHours(1).triggering(.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(ONE_MINUTE)) .withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_MINUTES))).withAllowedLateness(TWELVE_HOURS).discardingFiredPanes())
What is Apache Beam?A standard for running pipelines on different enginesDirect PipelineRunner (i. e. local)Dataflow PipelineRunnerFlink PipelineRunner
Spark PipelineRunner (new)
What to rememberProcess lots of dataOut of order & Late dataOn cluster of your choiceLocally testable
Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers AnswersAnswers Answers Answers Answers Answers AnswersAnswers Answers Answers Answers Answers AnswersAnswers Answers Answers Answers Answers Answers
Image credits:http://fav.me/d80wco9 Game mashuphttp://mrg.bz/UwguyD Red Beamshttp://mrg.bz/ccBto0 Blue Beamshttp://mrg.bz/QfHhyS Steel beam framehttp://mrg.bz/Dtcc1B Clock