DataFlow & Beam

download DataFlow & Beam

of 24

  • date post

    13-Jan-2017
  • Category

    Software

  • view

    355
  • download

    0

Embed Size (px)

Transcript of DataFlow & Beam

DataFlow & BeamGabe Hamilton

So youve built your perfect video game.

People all over the world are playing it.

Now for Billing, High Scores, etcPeople are playing your game on servers all over the world.Its time to start crunching all your data for billing, high scores, error reports, etc.

The time that events happened is important.You charge per minute played, with surge pricing!Data often arrives late.Network delays, Servers go down and send their data hours later.

Google DataFlow?

Apache Beam?Yes!

What were going to cover

What is Dataflow?Start a demo!DataFlow CodeBatches & StreamingEvent Time

What is Google DataFlow?

Distributed Streaming (and Batch) Data processing enginePulls in data from SourcesWrites data to SinksSpins up data processing nodes, pushes your code out to themLike Hadoop but handles unbounded data? Yep similar idea

Up and running in 10 minshttps://cloud.google.com/dataflow/getting-startedcreate a projectadd dataflow APIcreate a google storage bucketgcloud auth login

git clone git@github.com:gabehamilton/DataflowGroovySDK-examples.git

gradlew run -Pargs="project=PROJECT_NAME stagingLocation=BUCKET_NAME (requires a JDK)

Lets see some code - configDataflowPipelineOptions options =PipelineOptionsFactory.create().as(DataflowPipelineOptions)

options.setProject( myproject )options.setStagingLocation("gs://aStagingBucket")options.setNumWorkers(1000) // !!! default is 3options.setStreaming(true);Pipeline pipeline = Pipeline.create(options);

Lets see some code - pipelinepipeline // Extract and sum username/score pairs from the event data.

.apply(TextIO.Read.from(options.getInput())) // Read events from a text file

.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn())) .apply("SumByUser", new ExtractAndSumScore("user"))

.apply("WriteUserScoreSums", new WriteToBigQuery(options.getTableName()));

ComponentsPCollection - Parallel CollectionStandard interface to a set of data. Can be a streaming data set.PTransform - Parallel Transformtakes Input, produces OutputParDo - Parallel DoYour custom Transform Function

Demo

Running Dataflow - staging files

Our code

Output

Dependencies

Code is staged in the staging bucket

before gettingpushed to Workers

Staged files - detail

We dont need no stinking BatchesHandles Batches!

Streaming! Not real time streaming, unbounded data set streaming.Continuously processing yourUser Scores, Billing, AnalyticsRisk, Spam, and other deviations from mean

Event TimeDataflow lets you work in event timewhen the event says it happenedrather than processing time when the event was received

Allows Out of Order processinga plane full of mobile users just landed, turned their phones back on and start delivering the past 2 hours of data

Features for working with Event TimeWindowinghourly, session basedWatermarks All the data is in.Fixed end of match, end of fileHeuristic the data is probably in, Percentile 90% of the data is inTriggersemitting partial results.Accumulationsways of dealing with late data

Streaming Event Time ExampleWindow Hourly - events per hourTriggerEach minute

Allowed Lateness12 hoursAccumulationDiscardingHow many errors occurred between 5-6pm.

As we process data, update windows every minute.

After 12 hours, discard any late data that arrives

When updating a window throw out the previous result replacing it with the new one.

Code - streaming event time.into(FixedWindows.of(ONE_HOUR)) // Duration.standardHours(1).triggering(.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(ONE_MINUTE)) .withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_MINUTES))).withAllowedLateness(TWELVE_HOURS).discardingFiredPanes())

Demo 2StreamingFraudDetection

What is Apache Beam?A standard for running pipelines on different enginesDirect PipelineRunner (i. e. local)Dataflow PipelineRunnerFlink PipelineRunner

Spark PipelineRunner (new)

ApacheBeam

What to rememberProcess lots of dataOut of order & Late dataOn cluster of your choiceLocally testable

Questions?

Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers AnswersAnswers Answers Answers Answers Answers AnswersAnswers Answers Answers Answers Answers AnswersAnswers Answers Answers Answers Answers Answers

Thanks!

Image credits:http://fav.me/d80wco9 Game mashuphttp://mrg.bz/UwguyD Red Beamshttp://mrg.bz/ccBto0 Blue Beamshttp://mrg.bz/QfHhyS Steel beam framehttp://mrg.bz/Dtcc1B Clock