The Next Generation of Data Processing and Open Source

The Next Generation of Data Processing & Open SourceJames Malone, Google Product Manager, Apache Beam PPMCEric Schmidt, Google Developer Relations

Agenda

1

2

3

4

5

6

The Last Generation - Common historical challenges in large-scale data processing

The Next Generation - How large-scale data processing should work

Apache Beam - A solution for next generation data processing

Why Beam matters - A gaming example to show the power of the Beam model

Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds

Things to Remember - Recap and how you can get involved

2

3

Common historical challenges in large-scale data processing

01 The Last Generation

Decide on tool Read docs

Get infrastructure

Setup tools Tune tools

Productionize Get Specialists

Optimistic

Frustrated

Setting up infrastructure

Batch model Streaming model

Batch use case Streaming use case

Streaming engineBatch engine

Batch output Streaming output

Join output

Optimistic

Frustrated

Programming models

Data model

Data pipeline

Execution engine 1

Data model

Data pipeline

Execution engine 1

Data model

Data pipeline

Execution engine 1

FrustratedHappy

Data pipeline portability

Infrastructure is a pain

Models are disconnected

Pipelines are not portable

7

8

How data processing should work

02 The Next Generation

9

Infrastructure is a pain an afterthought

Models are disconnected unified

Pipelines are not portable portable

Skim docs

Decide on product

Start service

Optimistic

Happy

Setting up infrastructure

Unified model

Batch use case

Runner(s)

Streaming use case

Output

Optimistic

Happy

A flexible (unified) model

Data model

Data pipeline

Execution engine

Execution engine

Execution engine

Happy

Happier

Portable data pipelines

Why does this matter?

More time can be dedicated to examining data for actionable insights

Less time is spent wrangling code, infrastructure, and tools used to process data

Hands-on with data

Cloud setup and customization

14

A solution for next generation data processing

03 Apache Beam (incubating)

What is Apache Beam?

1. The (unified stream + batch) Dataflow Beam programming model

2. Java and Python SDKs

3. Runners for Existing Distributed Processing Backends

a. Apache Flink (thanks to dataArtisans)

b. Apache Spark (thanks to Cloudera & PayPal)

c. Google Cloud Dataflow (fast, no-ops)

d. Local (in-process) runner for testing

+ Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others!

15

The Apache Beam vision

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

16

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Google Cloud

Dataflow

Execution

Joining several threads into Beam

17

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

Millwheel

Cloud Dataflow

Cloud Dataproc

Apache Beam

Creating an Apache Beam community

Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so beam is a part of the larger OSS ecosystem

Learn - We (Google) are also learning a lot as this is our first data-related Apache contribution ;-)

Apache Beam Roadmap

02/01/2016Enter Apache

Incubator

End 2016Beam pipelines

run on many runners in

production uses

Early 2016Design for use cases,

begin refactoring

Mid 2016Additional refactoring,non-production uses

Late 2016Multiple runners execute Beam

pipelines

02/25/20161st commit to ASF repository

06/14/20161st incubating

release

June 2016Python SDK

moves to Beam

20

An example to show the power of the Beam model

04 Why Beam Matters

Apache Beam - A next generation model

21

Improved abstractions let you focus on your business logic

Batch and stream processing are both first-class citizens -- no need to choose.

Clearly separates event time from processing time.

Processing time vs. event time

22

Beam model - asking the right questions

23

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

The Beam model - what is being computed?

24

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

The Beam model - what is being computed?

25

The Beam model - where in event time?


.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))


The Beam model - where in event time?

The Beam model - when in processing time?


.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()))


The Beam model - when in processing time?

The Beam model - how do refinements relate?


.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingFiredPanes())


The Beam model - how do refinements relate?

Customizing what where when how

32

3Streaming

4Streaming

+ Accumulation

1Classic Batch

2Windowed

Batch

Apache Beam - the ecosystem

33http://beam.incubator.apache.org/capability-matrix

http://beam.incubator.apache.org/capability-matrix

http://beam.incubator.apache.org/capability-matrix

34

Lets run a Beam pipeline on 3 engines in 2 separate locations

05 Demo

35

Created 1 Beam pipeline

Ran that one pipeline on three execution engines in two places

● Google Cloud Platform○ Google Cloud Dataflow○ Apache Spark on Google Cloud Dataproc

● Local○ Apache Beam local runner○ Apache Flink

100% portability, 0 problems

What we just did

36

Recap and how you can get involved

06 Things to remember

Apache Beam is designed to provide potable pipelines with a unified programming model

37

Get involved with Apache Beam

38

Apache Beam (incubating)http://beam.incubator.apache.org

The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Join the Beam mailing lists! [email protected]@beam.incubator.apache.org

Join the Apache Beam Slack channel

https://apachebeam.slack.com

Follow @ApacheBeam on Twitter

http://beam.incubator.apache.org

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

A special thank you

39

A special thank you to Frances Perry and Tyler Akidau for sharing Apache Beam content which was used in this presentation.

40

Thank you

The Next Generation of Data Processing and Open Source

Technology

Transcript of The Next Generation of Data Processing and Open Source