DataFlow & Beam
Gabe Hamilton
So you’ve built your perfect video game.
People all over the world are playing it.
Now for Billing, High Scores, etcPeople are playing your game on servers all over the world.
It’s time to start crunching all your data for billing, high scores, error reports, etc.
The time that events happened is important.
You charge per minute played, with surge pricing!
Data often arrives late.
Network delays, Servers go down and send their data hours later.
Google DataFlow?
Apache Beam?Yes!
What we’re going to cover
What is Dataflow?
Start a demo!
DataFlow Code
Batches & Streaming
Event Time
What is Google DataFlow?
Distributed Streaming (and Batch) Data processing engine
Pulls in data from Sources
Writes data to Sinks
Spins up data processing nodes, pushes your code out to them
Like Hadoop but handles unbounded data? Yep similar idea
Up and running in 10 mins1. https://cloud.google.com/dataflow/getting-started
a. create a project
b. add dataflow API
c. create a google storage bucket
d. gcloud auth login
2. git clone [email protected]:gabehamilton/DataflowGroovySDK-examples.git
3. gradlew run -Pargs="project=PROJECT_NAME stagingLocation=BUCKET_NAME” (requires a JDK)
Lets see some code - configDataflowPipelineOptions options =
PipelineOptionsFactory.create().as(DataflowPipelineOptions)
options.setProject( ‘myproject’ )options.setStagingLocation("gs://aStagingBucket")options.setNumWorkers(1000) // ← !!! default is 3options.setStreaming(true);
Pipeline pipeline = Pipeline.create(options);
Lets see some code - pipelinepipeline // Extract and sum username/score pairs from the event data.
.apply(TextIO.Read.from(options.getInput())) // Read events from a text file
.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn())) .apply("SumByUser", new ExtractAndSumScore("user"))
.apply("WriteUserScoreSums", new WriteToBigQuery(options.getTableName()));
ComponentsPCollection - Parallel Collection
Standard interface to a set of data.
Can be a streaming data set.
PTransform - Parallel Transform
takes Input, produces Output
ParDo - Parallel Do
Your custom Transform Function
Demo
Running Dataflow - staging files
Our code
Output
Dependencies
Code is staged in the staging bucket
before gettingpushed to Workers
Staged files - detail
We don’t need no stinking BatchesHandles Batches!
Streaming!
Not real time streaming, unbounded data set streaming.
Continuously processing your
User Scores, Billing, Analytics
Risk, Spam, and other deviations from mean
Event TimeDataflow lets you work in event time
when the event says it happened
rather than processing time
when the event was received
Allows Out of Order processing
a plane full of mobile users just landed, turned their phones back on and start delivering the past 2 hours of data
Features for working with Event TimeWindowing hourly, session based
Watermarks All the data is in.
Fixed end of match, end of file
Heuristic the data is probably in,
Percentile 90% of the data is in
Triggers emitting partial results.
Accumulations ways of dealing with late data
Streaming Event Time ExampleWindow
Hourly - events per hour
TriggerEach minute
Allowed Lateness12 hours
AccumulationDiscarding
How many errors occurred between 5-6pm.
As we process data, update windows every minute.
After 12 hours, discard any late data that arrives
When updating a window throw out the previous result replacing it with the new one.
Code - streaming event time.into(FixedWindows.of(ONE_HOUR)) // Duration.standardHours(1)
.triggering(
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(ONE_MINUTE))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_MINUTES)))
.withAllowedLateness(TWELVE_HOURS)
.discardingFiredPanes())
Demo 2Streaming
FraudDetection
What is Apache Beam?A standard for running pipelines on different engines
Direct PipelineRunner (i. e. local)
Dataflow PipelineRunner
Flink PipelineRunner
Spark PipelineRunner (new)
ApacheBeam
What to rememberProcess lots of data
Out of order & Late data
On cluster of your choice
Locally testable
Questions?
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers Answers
Thanks!
Image credits:http://fav.me/d80wco9 Game mashup
http://mrg.bz/UwguyD Red Beams
http://mrg.bz/ccBto0 Blue Beams
http://mrg.bz/QfHhyS Steel beam frame
http://mrg.bz/Dtcc1B Clock
Top Related