Apache Beam @ GCPUG.TW Flink.TW 20161006

19
Apache Beam in Data Pipeline Randy Huang 2016/10/06

Transcript of Apache Beam @ GCPUG.TW Flink.TW 20161006

Page 1: Apache Beam @ GCPUG.TW Flink.TW 20161006

Apache Beam in Data Pipeline

Randy Huang 2016/10/06

Page 2: Apache Beam @ GCPUG.TW Flink.TW 20161006

Who am I

• Data Architect @ VMFive

• Fluentd/Embulk fans

Page 3: Apache Beam @ GCPUG.TW Flink.TW 20161006

Overview

• Define Data Pipeline

• Architecture

• How to write Beam

• Demo

Page 4: Apache Beam @ GCPUG.TW Flink.TW 20161006

Data PipelineInput Algorithm Output

Page 5: Apache Beam @ GCPUG.TW Flink.TW 20161006

Why Apache Beam?

Page 6: Apache Beam @ GCPUG.TW Flink.TW 20161006

Data Pipeline’s world is chaos

Page 7: Apache Beam @ GCPUG.TW Flink.TW 20161006

Goal

• Provide an abstraction layer between data processing’s code and the execution runtime.

• Batch processing and Streaming Jobs in one world.

• Beam SDK open the door to write once, run anywhere.*

on-premise and non-Google cloud

Page 8: Apache Beam @ GCPUG.TW Flink.TW 20161006

Supported Runners

• Google Cloud Dataflow (Block/Non-Blocking)

• Apache Flink 1.1.2

• Apache Spark 1.6.2 Hadoop 2.2.0 Kafka 0.8.2.1

Page 9: Apache Beam @ GCPUG.TW Flink.TW 20161006

API, model, and engine

Page 10: Apache Beam @ GCPUG.TW Flink.TW 20161006

Architecture

• Pipelines

• Translators

• Runners

Page 11: Apache Beam @ GCPUG.TW Flink.TW 20161006

programming tips/ Flink

• Use the Flink DataStream API in Java and Scala

• Use the Beam API directly in Java (and soon Python) with the Flink runner

Page 12: Apache Beam @ GCPUG.TW Flink.TW 20161006

SDK

• Four Parts :

• Pipeline : Streaming & Batch Processing

• PCollection

• Transform

• I/O : Source & Sink

Page 13: Apache Beam @ GCPUG.TW Flink.TW 20161006

for Flink user• we encourage users to use either of the Beam or Flink

APIs to implement their Flink jobs for stream data processing.

• But Native Flink API -

• backwards-compatible API

• built-in libraries (e.g., CEP and upcoming SQL)

• key-value state (with the ability to query that state in the future)

http://data-artisans.com/why-apache-beam/

Page 14: Apache Beam @ GCPUG.TW Flink.TW 20161006

Demo• GDELT project

• EventCount by Location

Pileline

Page 15: Apache Beam @ GCPUG.TW Flink.TW 20161006

Recap

• Write the general data pipeline, and choose your runner

Page 16: Apache Beam @ GCPUG.TW Flink.TW 20161006

Next…

• New Runners, SDK (python still dev)

• DSL

Page 17: Apache Beam @ GCPUG.TW Flink.TW 20161006

Another things

• BigQuery have DML support!!! https://goo.gl/lcZQVZ

• DataStudio Beta in Taiwan is available

• Embulk

• Fluentd v0.14.6 - 2016/09/07

Page 18: Apache Beam @ GCPUG.TW Flink.TW 20161006

forward secure

Page 19: Apache Beam @ GCPUG.TW Flink.TW 20161006

remember to setup nginx