Apache Beam and Google Cloud Dataflow - IDG - final
-
Author
szabolcs-feczak -
Category
Documents
-
view
628 -
download
1
Embed Size (px)
Transcript of Apache Beam and Google Cloud Dataflow - IDG - final
-
Google Cloud Dataflowthe next generation of managed big data service based on the Apache Beam programming model
Sub Szabolcs Feczak, Cloud Solutions Engineer
Google
9th Cloud & Data Center World 2016 - IDG
-
You leave here understanding the fundamentals of
the Apache Beam model and the Google Cloud Dataflow managed service
We have some fun.
1
Goals
2
-
Background and Historical overview
-
The trade-off quadrant of Big Data
CompletenessSpeed
Cost Optimization
Complexity
Time to Answer
-
MapReduce
Hadoop
Flume
Storm
Spark
MillWheel
Flink
Apache Beam
*
Batch
Streaming
Pipelines
Unified API
No Lam
bda
Iterative
Interactive
Exactly Once
State
Timers
Auto-A
wesom
e
Waterm
arks
Window
ing
High-level API
Managed Service
Triggers
Open Source
Unified Engine*
*O
ptimizer
* * **
*
* *
* *
*
-
Deep dive, probing familiarity with the subject
1M Devices
16.6K Events/sec
43B Events/month
518B Events/year
-
Before Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
OROROROR
-
After Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
ANDANDANDAND
Balancing correctness, latency and cost with a unified batch
with a streaming model
-
http://research.google.com/search.html?q=dataflow
http://research.google.com/search.html?q=dataflowhttp://research.google.com/search.html?q=dataflow
-
Apache Beam (incubating)
Java https://github.com/GoogleCloudPlatform/DataflowJavaSDK
Python (ALPHA)
Scala /darkjh/scalaflow
/jhlch/scala-dataflow-dsl
SoftwareDevelopment Kits Runners
http://incubator.apache.org/projects/beam.htmlThe Dataflow submission to the Apache Incubator was accepted on February 1, 2016, and the resulting project is now called Apache Beam.
Spark [email protected] /cloudera/spark-
dataflow
Flink runner @ /dataArtisans/flink-dataflow
https://github.com/GoogleCloudPlatform/DataflowJavaSDKhttps://github.com/GoogleCloudPlatform/DataflowJavaSDKhttps://github.com/GoogleCloudPlatform/DataflowJavaSDKhttps://github.com/darkjh/scalaflowhttps://github.com/jhlch/scala-dataflow-dslhttps://github.com/jhlch/scala-dataflow-dslhttp://incubator.apache.org/projects/beam.htmlhttp://incubator.apache.org/projects/beam.htmlhttp://github.com//cloudera/spark-dataflowhttp://github.com//cloudera/spark-dataflowhttp://github.com//cloudera/spark-dataflowhttp://github.com/dataArtisans/flink-dataflowhttp://github.com/dataArtisans/flink-dataflow
-
Movement
Filtering
Enrichment
Shaping
Reduction
Batch computation
Continuous computation
Composition
External orchestration
Simulation
Where might you use Apache Beam?
AnalysisETL Orchestration
-
Why would you go with a managed service?
-
GCP
Managed Service
User Code & SDKWork Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Managed Service advantages (GA since 2015 August)
Progress & Logs
-
Deploy Schedule & Monitor Tear Down
Worker Lifecycle ManagementCloud Dataflow Service
-
Time & life never stop
Data rates & schema are not static
Scaling models are not static
Non-elastic compute is wasteful and can create lag
Challenge: cost optimization
-
Auto-scaling800 QPS 1200 QPS 5000 QPS 50 QPS
10:00 11:00 12:00 13:00
Cloud Dataflow Service
-
100 mins. 65 mins.
vs.
Dynamic Work RebalancingCloud Dataflow Service
-
ParDo fusion Producer Consumer Sibling Intelligent fusion
boundaries Combiner lifting e.g. partial
aggregations before reduction
http://research.google.com/search.html?q=flume%20java
...
Graph OptimizationCloud Dataflow Service
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting
http://research.google.com/search.html?q=flume%20javahttp://research.google.com/search.html?q=flume%20javahttp://research.google.com/search.html?q=flume%20java
-
Deep dive into the programming model
-
The Apache Beam Logical Model
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
-
What are you computing?
A Pipeline represents a graph
Nodes are data processing
transformations
Edges are data sets flowing
through the pipeline
Optimized and executed as a
unit for efficiency
-
What are you computing? PCollections is a collection of homogenous
data of the same type
Maybe be bounded or unbounded in size
Each element has an implicit timestamp
Initially created from backing data stores
-
Challenge: completeness when processing continuous data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
-
What are you computing? PTransforms
transform PCollections into other PCollections.
What Where When How
Element-Wise(Map + Reduce = ParDo)
Aggregating(Combine, Join Group)
Composite
-
GroupByKey
Pair With Ones
Sum Values
Count
Define new PTransforms by building up subgraphs of existing transforms
Some utilities are included in the SDK Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
You can define your own: DoSomething, DoSomethingElse,
etc.
Why bother? Code reuse Better monitoring experience
Composite PTransformsApache BeamSDK
-
Example: Computing Integer Sums
What Where When How
-
What Where When How
Example: Computing Integer Sums
-
Key 2
Key 1
Key 3
1
Fixed
2
3
4
Key 2
Key 1
Key 3
Sliding
123
54
Key 2
Key 1
Key 3
Sessions
2
43
1
Where in Event Time?
Windowing divides data into event-time-based finite chunks.
Required when doing aggregations over unbounded data.
What Where When How
-
What Where When How
Example: Fixed 2-minute Windows
-
What Where When How
When in Processing Time?
Triggers control when results are emitted.
Triggers are often relative to the watermark.P
roce
ssin
g Ti
me
Event Time
WatermarkSkew
-
What Where When How
Example: Triggering at the Watermark
-
What Where When How
Example: Triggering for Speculative & Late Data
-
What Where When How
How do Refinements Relate?
How should multiple outputs per window accumulate?
Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
-
What Where When How
Example: Add Newest, Remove Previous
-
1. Classic Batch 2. Batch with Fixed Windows
3. Streaming 5. Streaming with Retractions
4. Streaming with Speculative + Late Data
Customizing What Where When How
What Where When How
-
The key takeaway
-
Optimizing Your Time To Answer
More time to dig into your data
Programming
Resource provisioning
Performance tuning
Monitoring
ReliabilityDeployment & configuration
Handling Growing Scale
Utilization improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming
-
How much more time?
You do not just save on processing, but code complexity and size as well!
Source: https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparisonhttps://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparisonhttps://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparisonhttps://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
-
What do customers have to say aboutGoogle Cloud Dataflow
"We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost."
Sudhir Hasbe, Director of Software Engineering, Zullily.com
The current iteration of Qubits real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Googles MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful and expressive API.
Jibran Saithi, Lead Architect, Qubit
"We are very excited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark"
Paul Clarke, Director of Technology, Ocado
Boosting performance isnt the only thing we want to get from the new system. Our bet is that by using cloud-managed
products we will have a much lower operational overhead. That in turn means we will have much more time to make
Spotifys products better.
Igor Maravi, Software Engineer working at Spotify
http://engineering.zulily.com/http://engineering.zulily.com/
-
Demo Time!
-
Lets build something - Demo!
Ingest stream from Wikipedia edits https://wikitech.wikimedia.org/wiki/Stream.wikimedia.org
Inspect the result set in our data warehouse (BigQuery)
Create a pipeline and run a Dataflow job to extract the top 10 active editors and top 10 pages edited
Extract words from a Shakespeare corpus, count the occurrences of each word, write sharded results as blobs into a key value store (Cloud Storage)
1.
2.
https://wikitech.wikimedia.org/wiki/Stream.wikimedia.orghttps://wikitech.wikimedia.org/wiki/Stream.wikimedia.orghttps://wikitech.wikimedia.org/wiki/Stream.wikimedia.org
-
Thank You!cloud.google.com/dataflowcloud.google.com/blog/big-data/cloud.google.com/solutions/articles#bigdatacloud.google.com/newsletterresearch.google.com