The Next Generation of Data Processing and Open Source

The Next Generation of Data Processing & Open SourceJames Malone, Google Product Manager, Apache Beam PPMCEric Schmidt, Google Developer Relations

Agenda

The Last Generation - Common historical challenges in large-scale data processing

The Next Generation - How large-scale data processing should work

Apache Beam - A solution for next generation data processing

Why Beam matters - A gaming example to show the power of the Beam model

Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds

Things to Remember - Recap and how you can get involved

Common historical challenges in large-scale data processing

01 The Last Generation

Decide on tool Read docs

Get infrastructure

Setup tools Tune tools

Productionize Get Specialists

Optimistic

Frustrated

Setting up infrastructure

Batch model Streaming model

Batch use case Streaming use case

Streaming engineBatch engine

Batch output Streaming output

Join output

Optimistic

Frustrated

Programming models

Data model

Data pipeline

Execution engine 1

Data model

Data pipeline

Execution engine 1

Data model

Data pipeline

Execution engine 1

FrustratedHappy

Data pipeline portability

Infrastructure is a pain

Models are disconnected

Pipelines are not portable

How data processing should work

02 The Next Generation

Infrastructure is a pain an afterthought

Models are disconnected unified

Pipelines are not portable portable

Skim docs

Decide on product

Start service

Optimistic

Setting up infrastructure

Unified model

Batch use case

Runner(s)

Streaming use case

Output

Optimistic

A flexible (unified) model

Data model

Data pipeline

Execution engine

Happier

Portable data pipelines

Why does this matter?

More time can be dedicated to examining data for actionable insights

Less time is spent wrangling code, infrastructure, and tools used to process data

Hands-on with data

Cloud setup and customization

A solution for next generation data processing

03 Apache Beam (incubating)

What is Apache Beam?

1. The (unified stream + batch) Dataflow Beam programming model

2. Java and Python SDKs

3. Runners for Existing Distributed Processing Backends

a. Apache Flink (thanks to dataArtisans)

b. Apache Spark (thanks to Cloudera & PayPal)

c. Google Cloud Dataflow (fast, no-ops)

d. Local (in-process) runner for testing

+ Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others!

The Apache Beam vision

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Google Cloud

Dataflow

Execution

Joining several threads into Beam

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

Millwheel

Cloud Dataflow

Cloud Dataproc

Apache Beam

Creating an Apache Beam community

Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so beam is a part of the larger OSS ecosystem

Learn - We (Google) are also learning a lot as this is our first data-related Apache contribution ;-)

Apache Beam Roadmap

02/01/2016Enter Apache

Incubator

End 2016Beam pipelines

run on many runners in

production uses

Early 2016Design for use cases,

begin refactoring

Mid 2016Additional refactoring,non-production uses

Late 2016Multiple runners execute Beam

pipelines

02/25/20161st commit to ASF repository

06/14/20161st incubating

release

June 2016Python SDK

moves to Beam

An example to show the power of the Beam model

04 Why Beam Matters

Apache Beam - A next generation model

Improved abstractions let you focus on your business logic

Batch and stream processing are both first-class citizens -- no need to choose.

Clearly separates event time from processing time.

Processing time vs. event time

Beam model - asking the right questions

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

The Beam model - what is being computed?

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

The Beam model - what is being computed?

The Beam model - where in event time?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))

The Beam model - where in event time?

The Beam model - when in processing time?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()))

The Beam model - when in processing time?

The Beam model - how do refinements relate?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingFiredPanes())

The Beam model - how do refinements relate?

Customizing what where when how

3Streaming

4Streaming

+ Accumulation

1Classic Batch

2Windowed

Apache Beam - the ecosystem

33http://beam.incubator.apache.org/capability-matrix

Lets run a Beam pipeline on 3 engines in 2 separate locations

05 Demo

Created 1 Beam pipeline

Ran that one pipeline on three execution engines in two places

● Google Cloud Platform○ Google Cloud Dataflow○ Apache Spark on Google Cloud Dataproc

● Local○ Apache Beam local runner○ Apache Flink

100% portability, 0 problems

What we just did

Recap and how you can get involved

06 Things to remember

Apache Beam is designed to provide potable pipelines with a unified programming model

Get involved with Apache Beam

Apache Beam (incubating)http://beam.incubator.apache.org

The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Join the Beam mailing lists! user-subscribe@beam.incubator.apache.orgdev-subscribe@beam.incubator.apache.org

Join the Apache Beam Slack channel

https://apachebeam.slack.com

Follow @ApacheBeam on Twitter

A special thank you

A special thank you to Frances Perry and Tyler Akidau for sharing Apache Beam content which was used in this presentation.

Thank you

The Next Generation of Data Processing and Open Source

Technology

Transcript of The Next Generation of Data Processing and Open Source

Next Generation Airmen, Next Generation Fighter! 138th ...

Evolution of Next Generation Bioproducts · 6/17/2016 · Ensyn’s Approach to Next Generation kjkj. Bioproducts. Refinery. Co-processing. Conversion . to . Liquid . Biocrude Biomass

Optical Integration and DSP in Next Generation Networks Market... · Optical Integration and DSP in Next Generation Networks ... •DP-QPSK, 28Gbaud, 4bits ... Digital Signal Processing

Clinical Experience with the Karius Next-Generation ...€¦ · Next-Generation Sequencing Test in 1,000+ Patients CLIA #: 05D2121236 CAP #: 9497749 The Karius Test SAMPLE PROCESSING

the next generation of data post-processing · 2013-10-25 · Stuttgart jBEAM v5 – Enterprise Services, the next generation of data post-processing May 6th, 2008 Page 3 Enterprise

NEXT-GENERATION BIOPRODUCT MILL IN ÄÄNEKOSKI · Metsä Group’s next-generation bioproduct mill at Äänekoski is the largest wood-processing plant in the Northern Hemisphere and,

Next Generation Devices for Next Generation Users

NeoNexus: The Next-generation Information Processing ......NeoNexus: The Next-generation Information Processing System across Digital and Neuromorphic Computing Domains QinruQiu Dept.

Next-Generation Photonic Transport Network Using Digital Signal Processing

Next-Generation Sequencing Next-Generation Sequencing Technologies

Next-Generation Reading Next-Generation Advanced Algebra ...

ADAPTIVE SIGNAL PROCESSING - download.e-bookshelf.de€¦ · Adali and Haykin / ADAPTIVE SIGNAL PROCESSING: Next Generation Solutions Beckerman / ADAPTIVE COOPERATIVE SYSTEMS Candy

Silicon Photonics for Next-Generation Optical Processing ...

2013-14 School Year. Unbridled Learning Next Generation Learners (100%) Next Generation Support (23%) Next Generation Professionals (10%) Next Generation.

Event Processing - The Next Generation; March 2009

XUTools: Unix Commands for Processing Next-Generation Structured Text

Energy Firm Improves Claims Processing - CSC: Next Generation IT

The Next Generation of Data Storage · The Next Generation of Data Storage ... , Facebook, and Yahoo began taking parallel processing techniques used for decades in high-performance

The Next Generation of Next Generation Learning

NEXT-GENERATION ONCOLYTIC NEXT GENERATION ONCOLYTIC ...