Dataflow II: Finish Dataflow Analysis, Start on Classical Optimizations
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
-
Upload
simon-su -
Category
Technology
-
view
638 -
download
2
Transcript of JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
Google Dataflow在雲端大資料處理的應用
Simon Su @ QNAP
https://goo.gl/YuXCw5
var simon = {/** I am at GCPUG.TW **/};
simon.aboutme = 'http://about.me/peihsinsu';
simon.nodejs = ‘http://opennodes.arecord.us';
simon.googleshare = 'http://gappsnews.blogspot.tw'
simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';
simon.blog = ‘http://peihsinsu.blogspot.com';
simon.slideshare = ‘http://slideshare.net/peihsinsu/';
simon.email = ‘[email protected]’;
simon.say(‘Good luck to everybody!');
var sunny = {};
sunny.aboutme = 'https://plus.google.com/u/0/+sunnyHU/posts';
sunny.email = [email protected]’;
sunny.language =[‘Java’,’.NET’,’NodeJS’,’SQL’ ]
sunny.skill = [ ‘Project management’,’System Analysis’,
’System design’,’Car ho lan’]
sunny.say(‘寫code太苦悶,心情要sunny');
https://www.facebook.com/groups/GCPUG.TW/
https://plus.google.com/u/0/communities/116100913832589966421
Google Cloud Platform User Group Taiwan我們是Google Cloud Platform Taiwan User Group。在Google雲端服務在台灣地區展露頭角之後,
有許多新的服務、新的知識、新的創意,歡迎大家一起分享,一起了解 Google雲端服務...
GCPUG透過網際網路串聯喜好Google Cloud的使用者,分享與交流使用 GCP的點滴鑑驗。如果您
是Google Cloud Platform的初學者,您應該來聽聽前輩們的使用經驗;如果您是 Google Cloud Platform的Expert,您應該來分享一下寶貴的經驗,並與更多高手互相交流;如果您還沒開始用
Google Cloud Platform,那麼您應該馬上來聽聽我們是怎麼使用 Google Cloud的!
Before Dataflow...
What Google provides in Big Data related domain?
Google Cloud Big Data Tools
● Construct scalable and reliable data pipelines
● Executes processing on Compute Engine
instances
● Provides support for:
○ ETL
○ Analytics
○ Real-time computation
○ Process orchestration
● Integrates with GCP services for data processing
○ Cloud Storage
○ Cloud Pub/Sub
○ BigQuery
● Open source Cloud Dataflow Java SDK available
Demo
Run a word count example...
•
•
•
gcloud alpha dataflow jobs list
Install Path: https://dl.google.com/dataflow/eclipse/
Dataflow Programming Models
Pipeline, PCollections, Transforms, Pipeline I/O
• Represents a Data processing job
• Consists of two parts: data and transforms applied to that data
• Consists of a set of operations
○ Read input - >Transform data -> Write output
• May include multiple inputs and multiple outputs
• May encompass many logical MapReduce operations
Transform
Output
Input
•
•
• AvroIO
• PubSubIO
• Custom source /
Sink API
YourSource/Sink
Here
•• newline-delimited• file can be compressed with gzip or bzip2
•
• Read and write avro local or remote GCS files
• A collection of immutable data of any type in a pipeline
• Maybe be either bounded or unbounded in size
• bouded - Text , BigQuery , Datastore , custom data
• unbounded -Data Source : PubSub ,Data Sinks : PubSub , BigQuery
• Created by using a PTransform to:
• Build from a java.util.Collection• Read from a backing data store• Transform an existing PCollection
• Often contain the key-value pairs using KV
● A step, or a processing operation that transforms data○ convert format , group , filter data
● Type of Transforms○ ParDo
■ For generic parallel processing ,processing style is similar “Mapper”
○ GroupByKey■ Is analogous to the Shuffle phase of a Map/Shuffle/Reduce-style algorithm■ Use GroupByKey to collect all of the values associated with a unique key
○ Combine■ Combine the values in your pipeline's PCollectionobjects or to combine
key-grouped values.
○ Flatten■ Multiple PCollection objects that contain the same data type, you can
merge them into a single logical PCollection using the Flatten transform
Map
Shuffle
Reduce
ParDo
GroupByKey
ParDo
How WordCount works?
Look into Word Count...
Dataflow的應用情境
NYC案例分享
• Functional (transform based) programming model
• Unified programming model for batch & stream processing
• Reduced operational cost of “cluster” management
• Decreased job clock time via platform innovation
• Open source ecosystem of SDKs, extensions, runners..
總結一下Dataflow適用情境
麻煩的離散的資料 >.<
From: https://whitelassiblog.files.wordpress.com/2010/09/postpaid-flow-basic.png
From: http://rsrit.com/blog/wp-content/uploads/2014/08/Automatically-detect-data-errors-and-inconsistencies-through-ETL-Tools.jpg
使用Dataflow後,得到了?
親愛的,我把資料變簡單了~