Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now....
Transcript of Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now....
Google for Work Sales EngineerGoogle
Todd Reedy
Processing Data Like GoogleUsing the Dataflow/Beam Model
Goals:
Write interesting
computations
Run in both batch &
streaming
Use custom timestamps
Handle late data
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
Weave provides the fastest way
to connect smart devices to
the network, the cloud, and an
ecosystem of (business) apps &
smart devices.
Weave setup works on Android,
iOS, and desktop.
Want more details? Visit: https://developers.google.com/weave/
CRM
Telemetry
Brillo is a free OS, tightly integrated with
Android, tailored for IoT.
Brillo makes it cost-effective to build a
secure connected device, with verified
boot to ensure it runs only trusted code.
Brillo comes with services to help
understand how customers interact
with it, and keep it updated over time.
Have Brillo running on your device itself
or use it as a hub to aggregate data from
nearby (micro) processors.
OTA Updates Weave Metrics
Android HAL
Kernel
Hardware
Data Shapes
Google’s Data Processing Story
The Beam Model
Agenda
Demo: Google Cloud Dataflow
1
2
3
4
Data Shapes1
Data...
...can be big...
...really, really big...
Tuesday
Wednesday
Thursday
...maybe even infinitely big...
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
1 + 1 = 2
Completeness Latency Cost
$$$
Data Processing Tradeoffs
Requirements: Billing Pipeline
Completeness Latency Cost
Important
Not Important
Requirements: Live Cost Estimate Pipeline
Completeness Latency Cost
Important
Not Important
Requirements: Abuse Detection Pipeline
Completeness Latency Cost
Important
Not Important
Requirements: Abuse Detection Backfill Pipeline
Completeness Latency Cost
Important
Not Important
Google’s Data Processing Story2
20122002 2004 2006 2008 2010
MapReduce
GFS Big Table
Dremel
Pregel
FlumeJava
Colossus
Spanner
2014
MillWheel
Dataflow
2016
Google’s Data-Related Papers
MapReduce: Batch Processing
R R
M M M
FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on what you want to do to
your data, not what the
underlying system supports.
● A graph of transformations is
automatically transformed into an
optimized series of MapReduces.
MapReduce
Batch Patterns: Creating Structured Data
MapReduce
Batch Patterns: Repetitive Runs
TuesdayWednesday
Thursday
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Batch Patterns: Time Based Windows
MapReduce
TuesdayWednesday
Batch Patterns: Sessions
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG of
computations to be performed
● System manages state and
persistent flow of elements
Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00Processing
Time
Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00Processing
Time
11:0010:00 15:0014:0013:0012:00Event Time
11:0010:00 15:0014:0013:0012:00Processing
Time
Input
Output
Streaming Patterns: Event Time Based Windows
Streaming Patterns: Session Windows
Event Time
Processing Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Pro
ce
ssin
g T
ime
Event Time
Skew
Event-Time Skew
Pro
ce
ssin
g T
ime
Event Time
Skew
Event-Time Skew
Watermark Watermarks describe event
time progress.
"No timestamp earlier than the
watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.
Streaming or Batch?
1 + 1 = 2 $$$Completeness Latency Cost
Why not both?
The Dataflow Model3
The Apache Beam Model3
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
• A Pipeline represents a graph
of data processing
transformations
• PCollections flow through the
pipeline
• Optimized and executed as a
unit for efficiency
What are you computing?
• A PCollection<T> is a collection of
data of type T
• Maybe be bounded or unbounded
in size
• Each element has an implicit
timestamp
• Initially created from backing data
stores
What are you computing?
PTransforms transform PCollections into other
PCollections.
What Where When How
Element-Wise Aggregating Composite
Example: Computing Integer Sums
// Collection of raw log linesPCollection<String> raw = ...;
// Element-wise transformation into team/score pairsPCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn()))
// Composite transformation containing an aggregationPCollection<KV<String, Integer>> output = input.apply(Sum.integersPerKey());
What Where When How
Example: Computing Integer Sums
What Where When How
Example: Computing Integer Sums
What Where When How
Example: Computing Integer Sums
What Where When How
Key 2
Key 1
Key 3
1
Fixed
2
3
4
Key 2
Key 1
Key 3
Sliding
123
54
Key 2
Key 1
Key 3
Sessions
2
4
3
1
Where in Event Time?
Windowing divides
data into event-
time-based finite
chunks.
Required when doing
aggregations over
unbounded data.
What Where When How
PCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(2)))).apply(Sum.integersPerKey());
Example: Fixed 2-minute Windows
What Where When How
Example: Fixed 2-minute Windows
What Where When How
When in Processing Time?
Triggers control when
results are emitted.
Triggers are often
relative to the
watermark.Pro
ce
ssin
g T
ime
Event Time
Watermark
What Where When How
PCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark())).apply(Sum.integersPerKey());
Example: Triggering at the Watermark
What Where When How
Example: Triggering at the Watermark
What Where When How
Example: Triggering for Speculative & Late Data
PCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))).withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
What Where When How
Example: Triggering for Speculative & Late Data
What Where When How
How do Refinements Relate?
How should multiple outputs per window accumulate?Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
What Where When How
How do Refinements Relate?
How should multiple outputs per window accumulate?Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
What Where When How
PCollection<KV<String, Integer>> output = input.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))).withLateFirings(AtCount(1)))
.accumulatingAndRetracting()).apply(new Sum());
Example: Add Newest, Remove Previous
What Where When How
Example: Add Newest, Remove Previous
What Where When How
1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How
Demo: Google Cloud Dataflow4
Cloud DataflowA fully-managed cloud service and
programming model for batch and
streaming big data processing.
Google Cloud Dataflow
Open Source SDKs
● Used to construct a Dataflow pipeline.
● Java available now. Python in the works.
● Pipelines can run…
○ On your development machine
○ On the Dataflow Service on Google Cloud Platform
○ On third party environments like Spark or Flink.
Fully Managed Dataflow Service
Runs the pipeline on Google Cloud Platform. Includes:
● Graph optimization: Modular code, efficient execution
● Smart Workers: Lifecycle management, Autoscaling, and
Smart task rebalancing
● Easy Monitoring: Dataflow UI, Restful API and CLI,
Integration with Cloud Logging, etc.
Demo Time!
Wrote Reusable code
(PTransforms)
Built upon the same code
(PTransform) in batch and
streaming
Used event time, not
processing time
Adjusted windowing and
late data settings
Learn More!
tec2016.toddreedy.com