Apache Beam @ GCPUG.TW Flink.TW 20161006
-
Upload
randy-huang -
Category
Software
-
view
126 -
download
2
Transcript of Apache Beam @ GCPUG.TW Flink.TW 20161006
Apache Beam in Data Pipeline
Randy Huang 2016/10/06
Who am I
• Data Architect @ VMFive
• Fluentd/Embulk fans
Overview
• Define Data Pipeline
• Architecture
• How to write Beam
• Demo
Data PipelineInput Algorithm Output
Why Apache Beam?
Data Pipeline’s world is chaos
Goal
• Provide an abstraction layer between data processing’s code and the execution runtime.
• Batch processing and Streaming Jobs in one world.
• Beam SDK open the door to write once, run anywhere.*
on-premise and non-Google cloud
Supported Runners
• Google Cloud Dataflow (Block/Non-Blocking)
• Apache Flink 1.1.2
• Apache Spark 1.6.2 Hadoop 2.2.0 Kafka 0.8.2.1
API, model, and engine
Architecture
• Pipelines
• Translators
• Runners
programming tips/ Flink
• Use the Flink DataStream API in Java and Scala
• Use the Beam API directly in Java (and soon Python) with the Flink runner
SDK
• Four Parts :
• Pipeline : Streaming & Batch Processing
• PCollection
• Transform
• I/O : Source & Sink
for Flink user• we encourage users to use either of the Beam or Flink
APIs to implement their Flink jobs for stream data processing.
• But Native Flink API -
• backwards-compatible API
• built-in libraries (e.g., CEP and upcoming SQL)
• key-value state (with the ability to query that state in the future)
http://data-artisans.com/why-apache-beam/
Demo• GDELT project
• EventCount by Location
Pileline
Recap
• Write the general data pipeline, and choose your runner
Next…
• New Runners, SDK (python still dev)
• DSL
Another things
• BigQuery have DML support!!! https://goo.gl/lcZQVZ
• DataStudio Beta in Taiwan is available
• Embulk
• Fluentd v0.14.6 - 2016/09/07
forward secure
remember to setup nginx