The Next Generation of Data Processing and Open Source

40
The Next Generation of Data Processing & Open Source James Malone, Google Product Manager, Apache Beam PPMC Eric Schmidt, Google Developer Relations

Transcript of The Next Generation of Data Processing and Open Source

Page 1: The Next Generation of Data Processing and Open Source

The Next Generation of Data Processing & Open SourceJames Malone, Google Product Manager, Apache Beam PPMCEric Schmidt, Google Developer Relations

Page 2: The Next Generation of Data Processing and Open Source

Agenda

1

2

3

4

5

6

The Last Generation - Common historical challenges in large-scale data processing

The Next Generation - How large-scale data processing should work

Apache Beam - A solution for next generation data processing

Why Beam matters - A gaming example to show the power of the Beam model

Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds

Things to Remember - Recap and how you can get involved

2

Page 3: The Next Generation of Data Processing and Open Source

3

Common historical challenges in large-scale data processing

01 The Last Generation

Page 4: The Next Generation of Data Processing and Open Source

Decide on tool Read docs

Get infrastructure

Setup tools Tune tools

Productionize Get Specialists

Optimistic

Frustrated

Setting up infrastructure

Page 5: The Next Generation of Data Processing and Open Source

Batch model Streaming model

Batch use case Streaming use case

Streaming engineBatch engine

Batch output Streaming output

Join output

Optimistic

Frustrated

Programming models

Page 6: The Next Generation of Data Processing and Open Source

Data model

Data pipeline

Execution engine 1

Data model

Data pipeline

Execution engine 1

Data model

Data pipeline

Execution engine 1

FrustratedHappy

Data pipeline portability

Page 7: The Next Generation of Data Processing and Open Source

Infrastructure is a pain

Models are disconnected

Pipelines are not portable

7

Page 8: The Next Generation of Data Processing and Open Source

8

How data processing should work

02 The Next Generation

Page 9: The Next Generation of Data Processing and Open Source

9

Infrastructure is a pain an afterthought

Models are disconnected unified

Pipelines are not portable portable

Page 10: The Next Generation of Data Processing and Open Source

Skim docs

Decide on product

Start service

Optimistic

Happy

Setting up infrastructure

Page 11: The Next Generation of Data Processing and Open Source

Unified model

Batch use case

Runner(s)

Streaming use case

Output

Optimistic

Happy

A flexible (unified) model

Page 12: The Next Generation of Data Processing and Open Source

Data model

Data pipeline

Execution engine

Execution engine

Execution engine

Happy

Happier

Portable data pipelines

Page 13: The Next Generation of Data Processing and Open Source

Why does this matter?

More time can be dedicated to examining data for actionable insights

Less time is spent wrangling code, infrastructure, and tools used to process data

Hands-on with data

Cloud setup and customization

Page 14: The Next Generation of Data Processing and Open Source

14

A solution for next generation data processing

03 Apache Beam (incubating)

Page 15: The Next Generation of Data Processing and Open Source

What is Apache Beam?

1. The (unified stream + batch) Dataflow Beam programming model

2. Java and Python SDKs

3. Runners for Existing Distributed Processing Backends

a. Apache Flink (thanks to dataArtisans)

b. Apache Spark (thanks to Cloudera & PayPal)

c. Google Cloud Dataflow (fast, no-ops)

d. Local (in-process) runner for testing

+ Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others!

15

Page 16: The Next Generation of Data Processing and Open Source

The Apache Beam vision

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

16

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Google Cloud

Dataflow

Execution

Page 17: The Next Generation of Data Processing and Open Source

Joining several threads into Beam

17

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

Millwheel

Cloud Dataflow

Cloud Dataproc

Apache Beam

Page 18: The Next Generation of Data Processing and Open Source

Creating an Apache Beam community

Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so beam is a part of the larger OSS ecosystem

Learn - We (Google) are also learning a lot as this is our first data-related Apache contribution ;-)

Page 19: The Next Generation of Data Processing and Open Source

Apache Beam Roadmap

02/01/2016Enter Apache

Incubator

End 2016Beam pipelines

run on many runners in

production uses

Early 2016Design for use cases,

begin refactoring

Mid 2016Additional refactoring,non-production uses

Late 2016Multiple runners execute Beam

pipelines

02/25/20161st commit to ASF repository

06/14/20161st incubating

release

June 2016Python SDK

moves to Beam

Page 20: The Next Generation of Data Processing and Open Source

20

An example to show the power of the Beam model

04 Why Beam Matters

Page 21: The Next Generation of Data Processing and Open Source

Apache Beam - A next generation model

21

Improved abstractions let you focus on your business logic

Batch and stream processing are both first-class citizens -- no need to choose.

Clearly separates event time from processing time.

Page 22: The Next Generation of Data Processing and Open Source

Processing time vs. event time

22

Page 23: The Next Generation of Data Processing and Open Source

Beam model - asking the right questions

23

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

Page 24: The Next Generation of Data Processing and Open Source

The Beam model - what is being computed?

24

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

Page 25: The Next Generation of Data Processing and Open Source

The Beam model - what is being computed?

25

Page 26: The Next Generation of Data Processing and Open Source

The Beam model - where in event time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))

.apply(Sum.integersPerKey());

Page 27: The Next Generation of Data Processing and Open Source

The Beam model - where in event time?

Page 28: The Next Generation of Data Processing and Open Source

The Beam model - when in processing time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()))

.apply(Sum.integersPerKey());

Page 29: The Next Generation of Data Processing and Open Source

The Beam model - when in processing time?

Page 30: The Next Generation of Data Processing and Open Source

The Beam model - how do refinements relate?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

Page 31: The Next Generation of Data Processing and Open Source

The Beam model - how do refinements relate?

Page 32: The Next Generation of Data Processing and Open Source

Customizing what where when how

32

3Streaming

4Streaming

+ Accumulation

1Classic Batch

2Windowed

Batch

Page 33: The Next Generation of Data Processing and Open Source

Apache Beam - the ecosystem

33http://beam.incubator.apache.org/capability-matrix

Page 34: The Next Generation of Data Processing and Open Source

34

Lets run a Beam pipeline on 3 engines in 2 separate locations

05 Demo

Page 35: The Next Generation of Data Processing and Open Source

35

Created 1 Beam pipeline

Ran that one pipeline on three execution engines in two places

● Google Cloud Platform○ Google Cloud Dataflow○ Apache Spark on Google Cloud Dataproc

● Local○ Apache Beam local runner○ Apache Flink

100% portability, 0 problems

What we just did

Page 36: The Next Generation of Data Processing and Open Source

36

Recap and how you can get involved

06 Things to remember

Page 37: The Next Generation of Data Processing and Open Source

Apache Beam is designed to provide potable pipelines with a unified programming model

37

Page 38: The Next Generation of Data Processing and Open Source

Get involved with Apache Beam

38

Apache Beam (incubating)http://beam.incubator.apache.org

The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Join the Beam mailing lists! [email protected]@beam.incubator.apache.org

Join the Apache Beam Slack channel

https://apachebeam.slack.com

Follow @ApacheBeam on Twitter

Page 39: The Next Generation of Data Processing and Open Source

A special thank you

39

A special thank you to Frances Perry and Tyler Akidau for sharing Apache Beam content which was used in this presentation.

Page 40: The Next Generation of Data Processing and Open Source

40

Thank you